Archive for the ‘R’ Category

Exit poll for June 2017 election (UK)

2017-06-11

Spiegelhalter-Twitter-2017-06-09

It has been a while since I posted anything here, but I can’t resist this one.

Let me just give three numbers.  The first two are:

  • 314, the number of seats predicted for the largest party (Conservatives) in the UK House of Commons, at 10pm in Thursday (i.e., before even a single vote had been counted) from the exit poll commissioned jointly by broadcasters BBC, ITV and Sky.
  • 318, the actual number of seats that were won by the Conservatives, now that all the votes have been counted.

That highly accurate prediction changed the whole story on election night: most of the pre-election voting intention polls had predicted a substantial Conservative majority.  (And certainly that’s what Theresa May had expected to achieve when she made the mistake of calling a snap election, 3 years early.)  But the exit poll prediction made it pretty clear that the Conservatives would either not achieve a majority (for which 326 seats would be needed), or at best would be returned with a very small majority such as the one they held before the election.  Media commentary turned quickly to how a government might be formed in the seemingly likely event of a hung Parliament, and what the future might be for Mrs May.  The financial markets moved quite substantially, too, in the moments after 10pm.

For more details on the exit poll, its history, and the methods used to achieve that kind of predictive accuracy, see Exit Polling Explained.

The third number I want to mention here is

  • 2.1.0

That’s the version of R that I had at the time of the 2005 General Election, when I completed the development of a fairly extensive set of R functions to use in connection with the exit poll (which at that time was done for BBC and ITV jointly).  Amazingly (to me!) the code that I wrote back in 2001–2005 still works fine.  My friend and former colleague Jouni Kuha, who stepped in as election-day statistician for the BBC when I gave it up after 2005, told me today that (with some tweaks, I presume!) it all works brilliantly still, as the basis for an extremely high-pressure data analysis on election day/night.  Very pleasing indeed; and strong testimony to the heroic efforts of the R Core Development Team, to keep everything stable with a view to the long term.

As suggested by that kind tweet reproduced above from the RSS President, David Spiegelhalter: Thursday’s performance was quite a triumph for the practical art and science of Statistics.  [And I think I am allowed to say this, since on this occasion I was not even there!  The credit for Thursday’s work goes to Jouni Kuha, along with John Curtice, Steve Fisher and the rest of the academic team of analysts who worked in the secret exit-poll “bunker” on 8 June.]

Power-to-the-nerds-2017-06-08

R and citations

2011-06-25

We’re hosting the international useR! conference at Warwick this summer, and I thought it might be interesting to try to get some data on how the use of R is growing. I decided to look at scholarly citations to R, mainly because I know where to find the relevant information.

I have access to the ISI Web of Knowledge, as well as to Google Scholar. The data below comes from the ISI Web of Knowledge database, which counts (mainly?) citations found in academic journals.

Background: How R is cited
Since version 0.90.0 of R, which was released in November 1999, the distributed software has included a FAQ document containing (among many other things) information on how to cite R. Initially (in 1999) the instruction given in the FAQwas to cite

When R version 1.8.1 was released in November 2003 the advice on citing R changed: people using Rin published work were asked to cite

The “2003” part of the citation advice has changed with each passing year; for example when R 1.9.1 was released (in June 2004) it was updated to “2004”.

ISI Web of Knowledge: Getting the data
Finding the citation counts by searching the ISI database directlydoes not work, because:

  1. the ISI database does not index Journal of Computational and Graphical Statistics as far back as 1996; and
  2. the “R Core Development Team” citations are (rightly) not counted as citations to journal articles, so they also are not directly indexed.

So here is what I did: I looked up published papers in the ISI index which I knew would cite R correctly. [This was easy; for example my friend Achim Zeileis has published many papers of this kind, so a lot of the results were delivered through a search for his name as an author.] For each such paper, the citation of interest would appear in its references. I then asked the Web of Knowledge search engine for all other papers which cited the same source, with the resulting counts tabulated by year of publication.

It seems that the ISI database aims to associate a unique identifier with each cited item, including items that are not themselves indexed as journal articles in the database. This is what made the approach described above possible.

There’s a hitch, though! It seems that, for some cited items, more than one identifier gets used. Thus it is hard to be sure that the counts below include all of the citations to R: indeed, as I mention further below, I am pretty sure that my search will have missed some citations to R, where the identifier assigned by ISI was not their “normal” one. (This probably seems a bit cryptic, but should become clearer from the table below.)

Citation counts
As extracted from the ISI Web of Knowledge on 25 June 2011:

ISI identifier 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Total
IHAKA R
J COMPUTATIONAL GRAP 5 : 299 1996
5 15 18 43 131 290 472 528 435 419 449 378 396 3579
*R DEV COR TEAM
R LANG ENV STAT COMP : 2003
39 123 91 57 39 25 14 388
*R DEV COR TEAM
R LANG ENV STAT COMP : 2004
16 235 421 327 289 187 126 1601
*R DEV COR TEAM
R LANG ENV STAT COMP : 2005
42 397 531 511 445 366 2292
*R DEV COR TEAM
LANG ENV STAT COMP : 2005
5 39 75 41 25 10 195
*R DEV COR TEAM
R LANG ENV STAT COMP : 2006
55 438 849 656 461 2459
*R DEV COR TEAM
R LANG ENV STAT COMP : 2007
92 714 962 733 2501
*R DEV COR TEAM
R LANG ENV STAT COMP : 2008
208 1402 1906 3516
*R DEV COR TEAM
LANG ENV STAT COMP : 2008
7 21 44 72
*R DEV COR TEAM
R LANG ENV STAT COMP : 2009
172 1363 1535
*R DEV COR TEAM
R LANG ENV STAT COMP : 2010
205 205
*R DEV COR TEAM
R LANG ENV STAT COMP :
1 12 14 25 36 81 93 262
Total 5 15 18 43 131 290 528 945 1452 1964 3143 4354 5717 18605

For the “R Development Core Team (year)” citations, the peak appears about 2 years after the year concerned. This presumably reflects journal review and backlog times.

There are almost certainly some ISI identifiers missing from the above table (and, as a result, almost certainly some citations not yet counted by me). For example, the number of citations found above to R Development Core Team (2009) is lower than might be expected given the general rate of growth that is evident in the table: there is probably at least one other identifier by which such citations are labelled in the ISI database (I just haven’t found it/them yet!). If anyone reading this can help with finding the “missing” identifiers and associated citation counts, I would be grateful.

The graph below shows the citations found within each year since 1998.

© David Firth, June 2011

To cite this entry:
Firth, D (2011). R and citations. Weblog entry at URL https://statgeek.wordpress.com/2011/06/25/r-and-citations/.

bb

The graph shows the citations found within each year since 1998.

[Click on the graph to view it at a larger size.]

Citations to Ihaka and Gentleman (1996) and to R Core Development Team (any year) are distinguished in the graph, and the total count of the two kinds of citation is also shown.