Archive for the ‘R’ Category

Simple maths of a fairer USS deal

2018-03-16

In yesterday’s post I showed a graph, followed by some comments to suggest that future USS proposals with a flatter (or even increasing) “percent lost” curve would be fairer (and, as I argued earlier in my Robin Hood post, more affordable at the same time).

It’s now clear to me that my suggestion seemed a bit cryptic to many (maybe most!) who read it yesterday.  So here I will try to show more specifically how to achieve a flat curve.  (This is not because I think flat is optimal.  It’s mainly because it’s easy to explain.  As already mentioned, it might not be a bad idea if the curve was actually to increase a bit as salary levels increase; that would allow those with higher salaries to feel happy that they are doing their bit towards the sustainable future of USS.)

Flattening the curve

The graph below is the same as yesterday’s but with a flat (blue, dashed) line drawn at the level of 4% lost across all salary levels.

lost-with-4pc-line

I drew the line at 4% here just as an example, to illustrate the calculation.  The actual level needed — i.e, the “affordable” level for universities —  would need to be determined by negotiation; but the maths is essentially the same, whatever the level (within reason).

Let’s suppose we want to adjust the USS contribution and benefits parameters to achieve just such a flat “percent lost” curve, at the 4% level.  How is that done?

I will assume here the same adjustable parameters that UUK and UCU appear to have in mind, namely:

  • employee contribution rate E (as percentage of salary — currently 8; was 8.7 in the 12 March proposal; was 8 in the January proposal)
  • threshold salary T, over which defined benefit (DB) pension entitlement ceases (which is currently £55.55k; was £42k in the 12 March proposal; and was £0 in the January proposal)
  • accrual rate A, in the DB pension.  Expressed here in percentage points (currently 100/75; was 100/85 in the 12 March proposal; and not relevant to the January proposal).
  • employer contribution rate (%) to the defined contribution (DC) part of USS pension.  Let’s allow different rates C_1 and C_2 for, respectively, salaries between T and £55.55k, and salaries over £55.55k. (Currently C_1 is irrelevant, and C_2 is 13 (max); these were both set at 12 in the 12th March proposal; and were both 13.25 in the January proposal.)

I will assume also, as all the recent proposals do, that the 1% USS match possibility is lost to all members.

Then, to get to 4% lost across the board, we need simply to solve the following linear equations.  (To see where these came from, please see this earlier post.)

For salary up to T:

 (E - 8) + 19(100/75 - A) + 1] = 4.

For salary between T and £55.55k:

  -8 + 19(100/75) - C_1 + 1 = 4.

For salary over £55.55k:

 13 - C_2 = 4.

Solving those last two equations is simple, and results in

 C_1 = 14.33, \qquad C_2 = 9.

The first equation above clearly allows more freedom: it’s just one equation, with two unknowns, so there are many solutions available.  Three example solutions, still based the illustrative 4% loss level across all salary levels, are:

 E=8, \qquad A = 1.175 = 100/85.1

 E = 8.7, \qquad A = 1.21 = 100/82.6

 E = 11, \qquad A = 100/75.

At the end here I’ll give code in R to do the above calculation quite generally, i.e., for any desired percentage loss level.  First let me just make a few remarks relating to all this.

Remarks

Choice of threshold

Note that the value of T does not enter into the above calculation.  Clearly there will be (negotiable) interplay between T and the required percentage loss, though, for a given level of affordability.

Choice of C_2

Much depends on the value of C_2.

The calculation above gives the value of C_2 needed for a flat “percent lost” curve, at any given level for the percent lost (which was 4% in the example above).

To achieve an increasing “percent lost” curve, we could simply reduce the value of C_2 further than the answer given by the above calculation.  Alternatively, as suggested in my earlier Robin Hood post, USS could apply a lower value of C_2 only for salaries above some higher threshold — i.e., in much the same spirit as progressive taxation of income.

Just as with income tax, it would be important not to set C_2 too small, otherwise the highest-paid members would quite likely want to leave USS.  There is clearly a delicate balance to be struck, at the top end of the salary spectrum.

But it is clear that if the higher-paid were to sacrifice at least as much as everyone else, in proportion to their salary, then that would allow the overall level of “percent lost” to be appreciably reduced, which would benefit the vast majority of USS members.

Determination of the overall “percent lost”

Everything written here constitutes a methodology to help with finding a good solution.  As mentioned at the top here, the actual solution — and in particular, the actual level of USS member pain (if any) deemed to be necessary to keep USS afloat — will be a matter for negotiation.  The maths here can help inform that negotiation, though.

Code for solving the above equations


## Function to compute the USS parameters needed for a
## flat "percent lost" curve
##
## Function arguments are:
## loss: in percentage points, the constant loss desired
## E: employee contribution, in percentage points
## A: the DB accrual rate
##
## Exactly one of E and A must be specified (ie, not NULL).
##
## Example calls:
## flatcurve(4.0, A = 100/75)
## flatcurve(2.0, E = 10.5)
## flatcurve(1.0, A = 100/75)  # status quo, just 1% "match" lost

flatcurve <- function(loss, E = NULL, A = NULL){

    if (is.null(E) && is.null(A)) {
        stop("E and A can't both be NULL")}
    if (!is.null(E) && !is.null(A)) {
        stop("one of {E, A} must be NULL")}

    c1 <- 19 * (100/75) - (7 + loss)
    c2 <- 13 - loss

    if (is.null(E)) {
        E <- 7 + loss - (19 * (100/75 - A))
    }

    if (is.null(A)) {
        A <- (E - 7 - loss + (19 * 100/75)) / 19
    }

return(list(loss_percent = loss,
            employee_contribution_percent = E,
            accrual_reciprocal = 100/A,
            DC_employer_rate_below_55.55k = c1,
            DC_employer_rate_above_55.55k = c2))
}

The above function will run in base R.

Here are three examples of its use (copied from an interactive session in R):


###  Specify 4% loss level, 
###  still using the current USS DB accrual rate

> flatcurve(4.0, A = 100/75)
$loss_percent
[1] 4

$employee_contribution_percent
[1] 11

$accrual_reciprocal
[1] 75

$DC_employer_rate_below_55.55k
[1] 14.33333

$DC_employer_rate_above_55.55k
[1] 9

#------------------------------------------------------------
###  This time for a smaller (2%) loss, 
###  with specified employee contribution

> flatcurve(2.0, E = 10.5)
$loss_percent
[1] 2

$employee_contribution_percent
[1] 10.5

$accrual_reciprocal
[1] 70.80745

$DC_employer_rate_below_55.55k
[1] 16.33333

$DC_employer_rate_above_55.55k
[1] 11

#------------------------------------------------------------
### Finally, my personal favourite:
### --- status quo with just the "match" lost

> flatcurve(1, A = 100/75)
$loss_percent
[1] 1

$employee_contribution_percent
[1] 8

$accrual_reciprocal
[1] 75

$DC_employer_rate_below_55.55k
[1] 17.33333

$DC_employer_rate_above_55.55k
[1] 12

© David Firth, March 2018

To cite this entry:
Firth, D (2018). Simple maths of a fairer USS deal. Weblog entry at URL https://statgeek.net/2018/03/16/simple-maths-of-a-fairer-uss-deal/

USS proposals: Tail wagging the dog?

2018-03-15

Update on 16 March: There’s now a follow-up post to this one, which gives more detail on how (mathematically) to achieve a fairer sharing-out of whatever level of USS member pain might ultimately be deemed necessary.  See Simple maths of a fairer USS deal (but ideally only after reading the necessary background, below!).


In response to my previous post, “Latest USS proposal: Who would lose most?“, someone asked me about doing the same calculation for the USS JNC-supported proposals from January.  For a summary of those January proposals and my comments about their fairness, please see my earlier post “USS pension scheme and fairness“.

Anyway, the calculation is quite simple, and it led to the following graph.  The black curve is as in my previous post, and the red one is from the same calculation done for the January USS proposal.

lost-comparisonThe red curve shows just over 5% effective loss of salary for those below the current £55.55k USS threshold, and then a fairly sharp decline to less than 2% lost at the salaries of the very highest-paid professors, managers and administrators.  Under the January proposals, higher-paid staff would contribute proportionately less to the “rescue package” for USS — less, even, than under the March proposals.  (And if the salary axis were to be extended indefinitely, the red curve would actually cross the zero-line: that’s because in the January proposals the defined-contribution rate from employers would actually have increased from (max) 13% to 13.25%.)

In terms of unequal sharing of the “pain”, then, the January proposal was even worse than the March one.

At the bottom here I’ll give the R code and a few words of explanation for the calculation of the red curve above.

But the main topic of this post arises from a remarkable feature of the above graph! At the current USS threshold salary of £55.55k, the amount lost is the same — it’s 5.08% under both proposals.  Which led me to wonder: is that a coincidence, or was it actually a (pretty weird!) constraint used in the recent UUK-UCU negotiations?  And then to wonder: might the best solution (i.e., for the same cost) be to do something that gives a better graph than either of the two proposals seen so far?

Tail wagging the dog?

The fact that the loss under the March proposal tops out at 5.08%, exactly (to 2 decimals, anyway) the same as in the January proposal, seems unlikely to be a coincidence?

If it’s not a coincidence, then a plausible route to the March proposal, at the UUK-UCU negotiating table, could have been along the lines of:

How can we re-work the January proposal to

  • retain defined benefit, up to some (presumably reduced) threshold and with some (presumably reduced) accrual rate,

while at the same time

  • nobody loses more than the maximum 5.08% that’s in the January proposal
  • the employer contribution rate to the DC pots of high earners is not reduced below the current standard (i.e., without the “match”) level of 12%

?

Those constraints, coupled with total cost to employers, would lead naturally to a family of solutions indexed by just two adjustable constants, namely

  • the threshold salary up to which DB pension applies (previously £55.55k)
  • the DB accrual rate (previously 1/75)

— and it seems plausible that the suggested (12 March 2018) new threshold of £42k and accrual rate of 1/85 were simply selected as the preferred candidate (among many such potential solutions) to offer to UUK and UCU members.

But the curve ought to be flat, or even increasing!

The two constraints listed as second and third bullets in the above essentially fix the position of the part of the black curve that applies to salaries over £55.55k.  That’s what I mean by “tail wagging the dog”.  Those constraints inevitably result in a solution that implies substantial losses for those with low or moderate incomes.

Once this is recognised, it becomes natural to ask: what should the shape of that “percentage loss” curve be?

The answer is surely a matter of opinion.

Those wishing to preserve substantial pension contributions at high salary levels, at the expense of those at lower salary levels, would want a curve that decreases to the right — as seen in the above curves for the January and March proposals.

For myself, I would argue the opposite: The “percent lost” curve should either be roughly constant, or might reasonably even increase as salary increases.  (The obvious parallel being progressive rates of income tax: those who can afford to pay more, pay more.)

I had made a specific suggestion along these lines, in this earlier post:

The details of any solution that satisfies the “percent loss roughly constant, or even increasing” requirement clearly would need to depend on data that’s not so widely available (mainly, the distribution of all salaries for USS members).

But first the principle of fairness needs to be recognised.  And once that is accepted, the constraints underlying future UUK-UCU negotiations would need to change radically — i.e., definitely away from those last two bullets in the above display.

Calculation of the red curve

In the previous post I gave R code for the black curve.  Here is the corresponding calculation behind the red curve:

sacrifice.Jan <- function(salary) { # salary in thousands
    old_threshold <- 55.55
    s <- salary

## sacrifice arising from income up to old_threshold
    s2 <- min(s, old_threshold)
    r2 <- s2 * (19/75 + 1/100 - (13.25 + 8)/100)

## sacrifice (max) arising from income over the old threshold
## -- note that this is negative
    r3 <- (s > old_threshold) * (s - old_threshold) * 
                (13 - 13.25)/100

    return(r2 + r3)
}

## A vector of salary values up to £150k
salaries <- (1:1500) / 10

## Compute percent of salary that would be lost, 
## at each salary level
sacrifices <- 100 * sapply(salaries, sacrifice.Jan) / salaries

In essence:

  • salary under £55.55k would lose the defined benefit (that’s the 19/75 part) and the 1% “match”, and in its place would get 21.25% as defined contribution.  The sum of these parts is the computed loss r2.
  • salary over £55.55k would gain the difference between potential 13% employer contribution and the proposed new rate of 13.25% (that’s the negative value r3 in the code).

© David Firth, March 2018

To cite this entry:
Firth, D (2018). USS proposals: Tail wagging the dog?. Weblog entry at URL https://statgeek.net/2018/03/15/uss-proposals-tail-wagging-the-dog/

 

Latest USS proposal: Who would lose most?

2018-03-13

Update on 16 March: After reading this post, you might perhaps be interested in these follow-ups:

Update, 14 March: Some details in the original post yesterday were not quite right, and so the graph/numbers that appear in the now-corrected version below are different in detail from yesterday’s.  But the overall picture is unchanged.  (If you really want to know about those changes in the detail, please see my note in Appendix 2 at the bottom of the post about that.)


Yesterday (March 12th) the UUK/UCU negotiations at ACAS concluded with an agreement document.

In this post I’ll look at the numbers in those proposed interim changes to the Universities Superannuation Scheme, to work out how much money would effectively be lost by USS members at each salary level.

This is inevitably a fairly rough calculation, but its results don’t really demand more precision.  The picture is very clear: the cost of “saving” USS would be felt most by USS members with low or moderate incomes.

The effective marginal rates at which money is lost by members are (as calculated below):

  • 4.7% on salary up to £42k
  • 6.3% on salary between £42k and the current USS threshold salary of £55.55k
  • 1.0% (at most) on salary over £55.55k

This translates into the following relationship between salary and the percentage of total salary lost:

lost

The two “kinks” in that graph reflect the discontinuities in marginal rates, at £42k and at £55.55k.

The vertical lines drawn in green are current full-time pay grades at a typical university (with no London allowance or other extras): grade 6 is the pay of many Research Associates and Teaching Fellows, for example; grade 7 is the pay of most Lecturers; grade 8 is the pay of Senior Lecturers and Readers; and grade 9 is the pay of Professors and other senior staff.  (I have mentioned only academic and research staff here, but the same grades apply also to administrative and technical staff in UK universities.)

The long decay to the right continues indefinitely, ultimately approaching an asymptote at 1% lost, i.e., for those with absolutely stratospheric salaries (if such people are actually members of USS, still, that is — though I would guess that many are not).

In the rest of this post I’ll give the details of the calculation that leads to the above numbers and graph.  (For people who prefer a list of numbers to a graphical display, I have also added the numbers as an Appendix at the bottom of this post.)

Just here, though, let me again comment on how unfair this “remedy” would be.  The unfairness should be obvious from the above graph: those who are paid most, and would stand to benefit most from being in USS, would contribute least, in percentage terms, in this proposed move towards the future sustainability of USS.  For a more general view on this unfairness, see also my previous two posts in this “USS” category:

The calculation

It suffices to consider salaries in three distinct bands.  In each salary band, we can calculate how much is lost, per unit of salary.

The following code in R reproduces the graph drawn above.  A brief explanation is then given, beneath the displayed code.


## This code runs in base R.

## Function to compute the amount that would be lost annually (£k)
## at any given salary level
sacrifice <- function(salary) { # salary in thousands
    old_threshold <- 55.55
    new_threshold <- 42
    s <- salary

## sacrifice arising from income up to the new threshold
    r1 <- min(s, new_threshold) * ((8.7 - 8)/100 +
                                    19 * (1/75 - 1/85) +
                                    1/100)

## sacrifice arising from income between the thresholds
    s2 <- (s > new_threshold) * (min(s, old_threshold) - 
                                            new_threshold)
    r2 <- s2 * ((8.7 - 8)/100 + (19/75 - (12 + 8.7)/100) + 1/100)

## sacrifice (max) arising from income over the old threshold
    r3 <- (s > old_threshold) * (s - old_threshold) * (1/100)

    return(r1 + r2 + r3)
}

## A vector of salary values up to £150k
salaries <- (1:1500) / 10

## Compute percent of salary that would be lost, 
## at each salary level
sacrifices <- 100 * sapply(salaries, sacrifice) / salaries

## Plot the result
svg(file = "lost.svg", width = 8, height = 4)
plot(salaries, sacrifices, type = "l",
 xlab = "salary (thousands)", ylab = "percent lost",
 main = "Percent of salary lost under UUK-UCU agreement 2018-03-12")
abline(v = c(29, 39, 48, 61), col = "green")
text(x = c(34, 44, 54, 75), y = 2.8,
 labels = c("6", "7", "8", "9"), col = "green")
dev.off()

Band 1: Salary up to £42k

Most contributions from this part of salary go to the “defined benefit” part of USS. The new proposal would see 8.7% of member’s salary up to £42k going in to this, as opposed to 8.0% at present. The return (i.e., the value of the defined-benefit pension) can readily be calculated using the standard HMRC formula, the one that is used for Annual Allowance purposes. Under current USS, the value of this part is 19 times (s/75), where s is either £42k or the member’s salary if the salary is less than £42k. Under yesterday’s proposals, the value of this part would fall to 19 times (s/85). Under yesterday’s proposals, USS members would also lose the possibility to add 1% “matching” employer contribution to an additional, defined-contribution pension pot. The amount lost to each member, relating to salary in this first band, is then the sum of the additional contribution made and the amount of pension value lost: that is r1 in the above code.

Band 2:  Salary between £42k and £55.55k

Now, for salaries greater than £42k, let s2 be the smaller of (salary minus £42k) and (£55.55k minus £42k). Then current USS has members contributing 8% of s2 in the defined-benefit part, for a return of 19 times s2/75. Yesterday’s proposal would change the contribution to 8.7% of s2, for a return of s2 times (12% + 8.7%). And again, the possibility of 1% matching employer contribution to the defined-contribution pot would be lost. The amount lost to each member, relating to salary in this second band, is again just the sum of the additional contribution made and the amount of pension value lost: that is r2 in the above code.

Band 3: Salary over £55.55k

Relating to salary above the current £55.55k threshold, the loss would be limited to loss of the 1% matching employer contribution.  This is computed as r3 in the above code. (In practice this will be an upper bound on what is lost.  Those USS members with the very highest salaries are likely also to face issues relating to the HMRC Annual Allowance and Lifetime Allowance limits, in which case the loss of the matching employer contribution could be worth substantially less than 1% to them.)

Conclusion

I have reproduced the full calculation here, with code, because I found the result of the calculation so shocking!  If anyone reading this thinks I have made a mistake in the calculation, please do let me know. If it is correct — and right now I have no reason to suspect otherwise — then I confess I’m alarmed that this is actually being proposed as a potential solution, even as an interim solution for the next 3 years, to the perceived problems with USS.  It shakes my faith in those who have been involved in negotiating it.  With seemingly intelligent people on both sides of the table, how could they possibly come up with something as bad as this?

© David Firth, March 2018

To cite this entry: Firth, D (2018). Latest USS proposal: Who would lose most?  Weblog entry at URL https://statgeek.net/2018/03/13/latest-uss-proposal-who-would-lose-most/.


Appendix 1: A tabular view of what’s in the graph

## Make a table for anyone who wants more detail than the graph
salary <- c(10:55, 55.55, 56:100, 150)
percent_lost <- round(100 * sapply(salary, sacrifice) / salary, 2)
salary <- 1000 * salary
my_table <- data.frame(salary, percent_lost)

That’s the code for making a little table, showing the same numbers as those in the above graph.

Here is the resulting table:

salary    %
 10000 4.68 -- I started the table at £10k for no good reason
 11000 4.68
 ...
 41000 4.68
 42000 4.68 -- the proposed new threshold
 43000 4.72
 44000 4.76
 45000 4.79
 46000 4.82
 47000 4.86
 48000 4.89
 49000 4.92
 50000 4.94
 51000 4.97
 52000 5.00
 53000 5.02
 54000 5.05
 55000 5.07
 55550 5.08 -- current USS threshold, highest % of salary lost
 56000 5.05
 57000 4.98
 58000 4.91
 59000 4.84
 60000 4.78
 61000 4.72
 62000 4.66
 63000 4.60
 64000 4.54
 65000 4.49
 66000 4.44
 67000 4.39
 68000 4.34
 69000 4.29
 70000 4.24
 71000 4.19
 72000 4.15
 73000 4.11
 74000 4.07
 75000 4.02
 76000 3.98
 77000 3.95
 78000 3.91
 79000 3.87
 80000 3.84
 81000 3.80
 82000 3.77
 83000 3.73
 84000 3.70
 85000 3.67
 86000 3.64
 87000 3.61
 88000 3.58
 89000 3.55
 90000 3.52
 91000 3.49
 92000 3.47
 93000 3.44
 94000 3.41
 95000 3.39
 96000 3.36
 97000 3.34
 98000 3.31
 99000 3.29
100000 3.27
150000 2.51 -- possibly there are even some salaries this high?!

Appendix 2: Details of the update made on 14 March

Many thanks to all who gave feedback on the original posting, yesterday (13 March).

In response to that feedback, I made two substantive changes to the calculation.  This Appendix gives details of those changes, for those who are interested (and for the record).

Neither change affects the story qualitatively: only the detailed numbers have changed a bit.

Change 1: Use of HMRC multiplier 19 rather than 23

The HMRC calculations for Annual Allowance and Lifetime Allowance purposes are different in detail: the former uses a multiplier of 19 times pension to value USS defined benefits, while the latter uses 23 (i.e., in place of 19).  In yesterday’s post I had used 23.  The updated figures calculated above use multiplier 19 instead.

Mainly I decided to use the smaller figure as it’s a bit more conservative, in relation to the value lost through the proposed reduction of defined benefits.  (I certainly don’t want to be accused of bias in the other direction, through having picked the larger multiplier.)

The effect on the calculated numbers is mainly to reduce the height of the “spike” that appears in the graph, around the £55k salary level.  The spike is still there; it’s just a bit smaller.

My friend Jon commented that the actual value of a defined-benefit pension is harder to quantify than the HMRC formula would suggest — and that it’s likely to be dependent on age and perhaps other factors.  This is undoubtedly true, and certainly I would not suggest that anyone should use the above numbers for their own financial planning!  Rather, the aim here was (only) to show through a simple, transparent calculation how the losses arising from current proposals would differ — in rough, average terms — between pay levels.

Since writing my post yesterday I found that I am not alone in having done a calculation like this: see also http://brianosmith.blogspot.co.uk/ (and maybe there are others too?).

Change 2: Inclusion of the USS “Match” at all salary levels

Several people pointed out to me that the USS “Match” possibility is available at all salary levels.  So it’s a benefit that would be lost at all salary levels, under the 12 March agreement.  In yesterday’s post I had taken it into account only at salaries over £55.55k: that (relatively minor) error is now corrected, in the revised figures shown above.


 

Exit poll for June 2017 election (UK)

2017-06-11

Spiegelhalter-Twitter-2017-06-09

It has been a while since I posted anything here, but I can’t resist this one.

Let me just give three numbers.  The first two are:

  • 314, the number of seats predicted for the largest party (Conservatives) in the UK House of Commons, at 10pm in Thursday (i.e., before even a single vote had been counted) from the exit poll commissioned jointly by broadcasters BBC, ITV and Sky.
  • 318, the actual number of seats that were won by the Conservatives, now that all the votes have been counted.

That highly accurate prediction changed the whole story on election night: most of the pre-election voting intention polls had predicted a substantial Conservative majority.  (And certainly that’s what Theresa May had expected to achieve when she made the mistake of calling a snap election, 3 years early.)  But the exit poll prediction made it pretty clear that the Conservatives would either not achieve a majority (for which 326 seats would be needed), or at best would be returned with a very small majority such as the one they held before the election.  Media commentary turned quickly to how a government might be formed in the seemingly likely event of a hung Parliament, and what the future might be for Mrs May.  The financial markets moved quite substantially, too, in the moments after 10pm.

For more details on the exit poll, its history, and the methods used to achieve that kind of predictive accuracy, see Exit Polling Explained.

The third number I want to mention here is

  • 2.1.0

That’s the version of R that I had at the time of the 2005 General Election, when I completed the development of a fairly extensive set of R functions to use in connection with the exit poll (which at that time was done for BBC and ITV jointly).  Amazingly (to me!) the code that I wrote back in 2001–2005 still works fine.  My friend and former colleague Jouni Kuha, who stepped in as election-day statistician for the BBC when I gave it up after 2005, told me today that (with some tweaks, I presume!) it all works brilliantly still, as the basis for an extremely high-pressure data analysis on election day/night.  Very pleasing indeed; and strong testimony to the heroic efforts of the R Core Development Team, to keep everything stable with a view to the long term.

As suggested by that kind tweet reproduced above from the RSS President, David Spiegelhalter: Thursday’s performance was quite a triumph for the practical art and science of Statistics.  [And I think I am allowed to say this, since on this occasion I was not even there!  The credit for Thursday’s work goes to Jouni Kuha, along with John Curtice, Steve Fisher and the rest of the academic team of analysts who worked in the secret exit-poll “bunker” on 8 June.]

Power-to-the-nerds-2017-06-08

R and citations

2011-06-25

We’re hosting the international useR! conference at Warwick this summer, and I thought it might be interesting to try to get some data on how the use of R is growing. I decided to look at scholarly citations to R, mainly because I know where to find the relevant information.

I have access to the ISI Web of Knowledge, as well as to Google Scholar. The data below comes from the ISI Web of Knowledge database, which counts (mainly?) citations found in academic journals.

Background: How R is cited
Since version 0.90.0 of R, which was released in November 1999, the distributed software has included a FAQ document containing (among many other things) information on how to cite R. Initially (in 1999) the instruction given in the FAQwas to cite

When R version 1.8.1 was released in November 2003 the advice on citing R changed: people using Rin published work were asked to cite

The “2003” part of the citation advice has changed with each passing year; for example when R 1.9.1 was released (in June 2004) it was updated to “2004”.

ISI Web of Knowledge: Getting the data
Finding the citation counts by searching the ISI database directlydoes not work, because:

  1. the ISI database does not index Journal of Computational and Graphical Statistics as far back as 1996; and
  2. the “R Core Development Team” citations are (rightly) not counted as citations to journal articles, so they also are not directly indexed.

So here is what I did: I looked up published papers in the ISI index which I knew would cite R correctly. [This was easy; for example my friend Achim Zeileis has published many papers of this kind, so a lot of the results were delivered through a search for his name as an author.] For each such paper, the citation of interest would appear in its references. I then asked the Web of Knowledge search engine for all other papers which cited the same source, with the resulting counts tabulated by year of publication.

It seems that the ISI database aims to associate a unique identifier with each cited item, including items that are not themselves indexed as journal articles in the database. This is what made the approach described above possible.

There’s a hitch, though! It seems that, for some cited items, more than one identifier gets used. Thus it is hard to be sure that the counts below include all of the citations to R: indeed, as I mention further below, I am pretty sure that my search will have missed some citations to R, where the identifier assigned by ISI was not their “normal” one. (This probably seems a bit cryptic, but should become clearer from the table below.)

Citation counts
As extracted from the ISI Web of Knowledge on 25 June 2011:

ISI identifier 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Total
IHAKA R
J COMPUTATIONAL GRAP 5 : 299 1996
5 15 18 43 131 290 472 528 435 419 449 378 396 3579
*R DEV COR TEAM
R LANG ENV STAT COMP : 2003
39 123 91 57 39 25 14 388
*R DEV COR TEAM
R LANG ENV STAT COMP : 2004
16 235 421 327 289 187 126 1601
*R DEV COR TEAM
R LANG ENV STAT COMP : 2005
42 397 531 511 445 366 2292
*R DEV COR TEAM
LANG ENV STAT COMP : 2005
5 39 75 41 25 10 195
*R DEV COR TEAM
R LANG ENV STAT COMP : 2006
55 438 849 656 461 2459
*R DEV COR TEAM
R LANG ENV STAT COMP : 2007
92 714 962 733 2501
*R DEV COR TEAM
R LANG ENV STAT COMP : 2008
208 1402 1906 3516
*R DEV COR TEAM
LANG ENV STAT COMP : 2008
7 21 44 72
*R DEV COR TEAM
R LANG ENV STAT COMP : 2009
172 1363 1535
*R DEV COR TEAM
R LANG ENV STAT COMP : 2010
205 205
*R DEV COR TEAM
R LANG ENV STAT COMP :
1 12 14 25 36 81 93 262
Total 5 15 18 43 131 290 528 945 1452 1964 3143 4354 5717 18605

For the “R Development Core Team (year)” citations, the peak appears about 2 years after the year concerned. This presumably reflects journal review and backlog times.

There are almost certainly some ISI identifiers missing from the above table (and, as a result, almost certainly some citations not yet counted by me). For example, the number of citations found above to R Development Core Team (2009) is lower than might be expected given the general rate of growth that is evident in the table: there is probably at least one other identifier by which such citations are labelled in the ISI database (I just haven’t found it/them yet!). If anyone reading this can help with finding the “missing” identifiers and associated citation counts, I would be grateful.

The graph below shows the citations found within each year since 1998.

© David Firth, June 2011

To cite this entry:
Firth, D (2011). R and citations. Weblog entry at URL https://statgeek.wordpress.com/2011/06/25/r-and-citations/.

bb

The graph shows the citations found within each year since 1998.

[Click on the graph to view it at a larger size.]

Citations to Ihaka and Gentleman (1996) and to R Core Development Team (any year) are distinguished in the graph, and the total count of the two kinds of citation is also shown.