About heading soccer balls, and memory loss


This new research paper, by a group led from the University of Stirling, made the national news in the UK last week:

Di Virgilio, T.G., et al., Evidence for Acute Electrophysiological and Cognitive Changes Following Routine Soccer Heading, EBioMedicine (2016), http://dx.doi.org/10.1016/j.ebiom.2016.10.029

My declaration of interest:

I will write some notes below about my reading of the paper.  But first I should make clear that I am not a completely disinterested scientist when it comes to this topic.  For quite some years now, my son and I have been avid supporters of West Bromwich Albion FC, where calls for better research on the long-term effects of heading footballs have been made following the death of former Albion and England centre forward Jeff Astle in 2002 (aged 59).  Jeff Astle was a prolific goalscorer for West Brom, well known for his outstanding ability as a header of the ball.  The Coroner’s verdict in 2002 was “death by industrial disease”, and his report on Jeff Astle’s death included the comment that “The trauma caused to the front of his brain [by heading the ball] is likely to have had a considerable effect on the cause of death.”  There was even an adjournment debate in the House of Commons on this subject, following Jeff Astle’s death.

The background to my notes below is that I strongly support the case for better research on the long-term effects of heading a football: it seems clear that not enough is known about the health risks, and such questions as whether heading the ball is safer now that footballs have become lighter.

Some of the news headlines from last week:

Heading footballs ‘affects memory’BBC Scotland, 2016-10-24

Heading footballs affects memory and brain function, study findsITV News, 2016-10-24

Study finds heading a football has immediate effect on the brainThe Guardian, 2016-10-24

Heading a Soccer Ball Affects Memory FunctionWall Street Journal video, 2016-10-24

Calls for more research as football headers linked to memory lossSky News, 2016-10-24.  (This features a short video clip of Dawn Astle, Jeff’s daughter, talking persuasively about the need for thorough, longitudinal research.)

Here are the original press release and an online article by the some of the authors of the original research paper:

Heading a football causes instant changes to the brainNIHR press release, 2016-10-24

How we discovered that heading a football causes impairment of brain functionThe Conversation, 2016-10-24

And on the same day, the story was reported also on the public news website of the UK’s National Health Service:

Heading footballs may cause short-term brain changesNHS Choices, 2016-10-24

My reading of the original research paper

The research reported in the paper is a small, before-and-after experiment.  Data are analysed from 19 amateur footballers who took part in a controlled heading-practice session, with various measurements made before and after the session (immediately before, immediately after, and at three later time-points up to 2 weeks after).

The paper’s main findings are based on before-to-after differences in three of the measurements made, these three having apparently achieved statistical significance in the experiment, with reported p-values less than the pre-assigned threshold of 0.05.  The three “statistically significant” differences found were:

  1. The “primary outcome measure cSP” — a measure of corticomotor inhibition — was found to have increased for 14 of the 19 participants when measured immediately after the heading practice session.  The reported p-value, for the apparent increase in response time that was seen on average, is 0.049.  [Section 3.1 of the paper]
  2. The “Spatial Working Memory” (SWM) test scores showed an increased error rate on average (on the log scale the change was from 0.79 before to 1.00 after the heading session).  The reported p-value for this apparent difference is 0.03.  [Section 3.2 of the paper]
  3. The “Paired Associated Learning” (PAL) test scores also showed an increased error rate on average (on the log scale the change was from 0.38 before to 0.65 after).  The reported p-value for this apparent difference is 0.007.  [Section 3.2 of the paper]

How to interpret those apparent effects and their p-values?

I was prompted to think a bit about this by the reported p-value of 0.049 for the primary outcome measure: that’s only just less than the pre-assigned threshold of 0.05.  So if it’s agreed that p equal to 0.05 is the largest value that can reasonably count as “statistically significant” evidence, the value of p=0.049 found for this apparent increase in cSP time should probably be labelled “almost insignificant”!  (This is in agreement with the “14 out of 19” finding mentioned already above, for the number of subjects whose cSP time had shown any increase at all; a simple sign test is enough to tell us that 14 out of 19 is not quite significant at the 0.05 level.)

But was 0.05 a reasonable threshold to use, anyway?   A computed p-value of 0.05, or even 0.03, should really be considered very weak evidence when quite a large number of different measurements are being recorded and tested, as was the case in this study.    As Table 2 of the paper shows, there were 8 different measurements taken, each done on four occasions after the heading session: that’s a lot of chances to find some “significant” differences.  The much-used threshold of p<0.05, which is designed to limit the chance of a spuriously significant finding to 5% when conducting a single test, is much more likely to throw up spuriously significant results when several hypotheses are being tested.  A crude Bonferroni correction based on 8 tested differences, for example, would result in the threshold of 0.05 being reduced to 0.05/8 = 0.006, as a much more stringent criterion to apply in order to be sure that the chance of a spuriously significant finding is still less than 5%.

Of the paper’s three main findings, then, only the third one — the increased average error rate in the PAL test scores — seems to be an apparent effect that might demand further attention.  The paper mentions [in Section 3.2] that an increased error rate in the PAL test is compatible with reduced long-term memory function.  (But note that if we do take the route of a Bonferroni correction, to allow for the fact that 8 different measurements were tested — while still neglecting the number of occasions on which post-session measurements were made — the reported p-value of 0.007 still would fail to reach significance at the traditional 5% level.)

Some methodological quibbles and questions

Q1.  The big one: Causality

The press release mentioned above, and hence the media coverage of this research paper, reports an apparently causal link found between routine heading of footballs and outcomes such as long-term memory function.  Such a causal link does seem rather plausible, a priori.  But the research reported in this paper does not (for me, at any rate) firmly establish that cause.  The study design leaves open the possibility of an alternative explanation (for an increase in PAL test error scores, for example).  The paper’s authors allude to this problem in their Discussion section [Section 4 of the paper], where they write: “Future work should include a control activity such as body movement without head impact”.  I do agree; and careful design of the control regime is essential if causality is to be established compellingly.

What sort of alternative explanation(s) might there be?  Well, the problem is that heading the football was not the only thing that happened to each of the 19 experimental subjects between the first two sets of measurements.  Some other things that might conceivably have produced an effect are:

  • the passing of time (e.g., time since breakfast?)
  • the order in which measurements were taken (the research paper is not altogether clear on this, actually — there seem to be conflicting statements in Section 2.2 there)
  • the thrill of taking repeated shots at a goal (which might have been just as great had the shots been volley-kicks instead of headers?)

I am not suggesting here that any of these possible alternative causes is the reality; only that if we want to establish that heading the ball is a cause of something, then other potential causes must be eliminated as part of the study design or through external scientific knowledge.

Q2.  Missing data?

In Section 2.1 of the paper it is mentioned that there were originally 23 study participants recruited.  This got reduced to 19 for the analysis, because “Data from one participant could not be analyzed and three more participants withdrew from the study for personal reasons”.  It would have been good to have some more detail on this.  In particular:

  • what was it about one participant’s data that meant they could not be analyzed?
  • at what point in the study did each of the other three participants withdraw?

Q3.  Size of effect, misleadingly reported?

Update, 2016-11-04:  I have now heard back from one of the paper’s authors about this question.  Dr Magdalena Ietswaart has kindly told me that the 67% figure “is based on raw scores rather than log transformed values”.  In which case, my conjecture below about the origin of that figure was false — and for that reason I have now struck through it, and I humbly apologise to the paper’s authors for having guessed wrongly about this particular point.  (The 67% increase still does strike me as startling, though!)

As discussed above, of all the various effect sizes that are considered in the paper it is the increase in the PAL test error rate that might merit further attention, since that is the one effect for which the experimental evidence might be viewed as statistically significant.  In the paper [Section 3.2], it is stated that the error score on the PAL task immediately after heading increased by 67%.  But this rather startling statement of the effect size — which appeared also in much of the press coverage, and indeed is the single number that prompted me to read the paper in full — appears to be wrong.    If my calculations are correct, the increase found in the PAL test error rate is in fact more like 26% (which is still a fairly substantial increase, of course).

The reason for the discrepancy is that the 67% figure appears to have been calculated directly from Table 2 in the paper, where the increase in the PAL test error rate is measured by averaging the logarithms of the raw error rates.  The ratio (0.65 – 0.38)/0.38, calculated from Table 2, is roughly 1.67 to within the effects of rounding error, i.e., it corresponds to a 67% increase in the logarithm.  But it makes no sense at all to calculate a ratio of logarithms — the numerical value of such a ratio in the present context is completely meaningless.  (This will be obvious to the reader who understands how logarithms work, but probably not otherwise!  The key point, mathematically, is that while a percentage increase in error rate — or in anything else, for that matter — does not depend at all on the units of measurement, the same is not the case after logarithmic transformation.  A ratio of logarithms will depend in a completely arbitrary way on the original units of measurement used, and so will be meaningless.)

How did I get the 26% figure mentioned above as the correct percentage increase?  Well it’s actually something of a guess, because I do not have access to the raw data.  (I have asked the paper’s authors if they will share the data with me; but right now I only have Table 2 to work from.)  It’s probably not such a bad guess, though.  I made the working assumption that the distributions underlying the figures in Table 2 are normal, which seems reasonable given the rationale for logarithmic transformation that is given in Section 2.8 of the paper.  With that assumption, the ratio of means is calculated (by standard properties of the log-normal distribution) as

       exp(0.65 + ((0.29)^2 / 2)) / exp(0.38 + (0.41^2 / 2)) = 1.26

I should emphasise, though, that this is largely guesswork.  In particular, it is possible that I have misunderstood how the 67% figure, quoted in the Section 3.2 of the paper, was arrived at.


Let me restate here what I said near the top: I strongly support the case for better research on the long-term effects of heading a football.  With that in mind, I am disappointed that the paper that I have read and discussed here does not provide very convincing evidence to help our understanding.

One rather positive aspect of the paper is that it did get a lot of media coverage, which helped to bring the issue (and accompanying memories of Jeff Astle!) to wider public attention, at least for a day or two.

But, as Dawn Astle so eloquently argued in the Sky News interview that’s linked above: there is still a clear need for good research on the matter of “routine”, but potentially long-lasting, brain injuries in football.

© David Firth, November 2016

To cite this entry:
Firth, D (2016). About heading soccer balls, and memory loss. Weblog entry at URL https://statgeek.wordpress.com/2016/11/03/about-heading-soccer-balls-and-memory-loss/.

RSS discussion paper on model-based ranking of journals, using citation data


This paper has been around on arXiv for quite some time.  Now, having survived various rounds of review — and having grown quite a bit as a result of reviewers’ requests! — it will be discussed at an Ordinary Meeting of the Royal Statistical Society on 13 May 2015 (just follow this link to the recent Allstat announcement, for instructions on how to contribute to the RSS discussion either in person or in writing).

Here is the link to the preprint on arXiv.org:

Statistical modelling of citation exchange between statistics journals by Cristiano Varin, Manuela Cattelan and David Firth.

(Note that the more ‘official’ version, made public at the RSS website, is an initial, uncorrected printer’s proof of the paper for JRSS-A.  It contains plenty of typos!  Those obviously will be eliminated before the paper appears in the Journal.)

The paper has associated online supplementary material (zip file, 0.4MB) comprising datasets used in the paper, and full R code to help potential discussants and other readers to replicate and/or experiment with the reported analyses.


Figure 4 from the paper (a ranking of statistics journals based on the Bradley-Terry model)

The paper’s Summary is as follows:

Rankings of scholarly journals based on citation data are often met with skepticism by the scientific community. Part of the skepticism is due to disparity between the common perception of journals’ prestige and their ranking based on citation counts. A more serious concern is the inappropriate use of journal rankings to evaluate the scientific influence of authors. This paper focuses on analysis of the table of cross-citations among a selection of Statistics journals. Data are collected from the Web of Science database published by Thomson Reuters. Our results suggest that modelling the exchange of citations between journals is useful to highlight the most prestigious journals, but also that journal citation data are characterized by considerable heterogeneity, which needs to be properly summarized. Inferential conclusions require care in order to avoid potential over-interpretation of insignificant differences between journal ratings. Comparison with published ratings of institutions from the UK’s Research Assessment Exercise shows strong correlation at aggregate level between assessed research quality and journal citation ‘export scores’ within the discipline of Statistics.

Facts checked: The UK general election, 7-way TV debate between Westminster party leaders


I was part way through doing some of this myself, when I found that fullfact.org has done a great job already with checking some of the main numbers quoted in last night’s televised debate:


Just moved to WordPress


This blog was previously hosted at Warwick Blogs. I have moved everything here today, mainly because WordPress has such a strong feature set and supportive community. I hope I didn’t forget anything. The old blog posts will remain in place at Warwick but will not be developed: all future posts will be made here.

R and citations


We’re hosting the international useR! conference at Warwick this summer, and I thought it might be interesting to try to get some data on how the use of R is growing. I decided to look at scholarly citations to R, mainly because I know where to find the relevant information.

I have access to the ISI Web of Knowledge, as well as to Google Scholar. The data below comes from the ISI Web of Knowledge database, which counts (mainly?) citations found in academic journals.

Background: How R is cited
Since version 0.90.0 of R, which was released in November 1999, the distributed software has included a FAQ document containing (among many other things) information on how to cite R. Initially (in 1999) the instruction given in the FAQwas to cite

When R version 1.8.1 was released in November 2003 the advice on citing R changed: people using Rin published work were asked to cite

The “2003” part of the citation advice has changed with each passing year; for example when R 1.9.1 was released (in June 2004) it was updated to “2004”.

ISI Web of Knowledge: Getting the data
Finding the citation counts by searching the ISI database directlydoes not work, because:

  1. the ISI database does not index Journal of Computational and Graphical Statistics as far back as 1996; and
  2. the “R Core Development Team” citations are (rightly) not counted as citations to journal articles, so they also are not directly indexed.

So here is what I did: I looked up published papers in the ISI index which I knew would cite R correctly. [This was easy; for example my friend Achim Zeileis has published many papers of this kind, so a lot of the results were delivered through a search for his name as an author.] For each such paper, the citation of interest would appear in its references. I then asked the Web of Knowledge search engine for all other papers which cited the same source, with the resulting counts tabulated by year of publication.

It seems that the ISI database aims to associate a unique identifier with each cited item, including items that are not themselves indexed as journal articles in the database. This is what made the approach described above possible.

There’s a hitch, though! It seems that, for some cited items, more than one identifier gets used. Thus it is hard to be sure that the counts below include all of the citations to R: indeed, as I mention further below, I am pretty sure that my search will have missed some citations to R, where the identifier assigned by ISI was not their “normal” one. (This probably seems a bit cryptic, but should become clearer from the table below.)

Citation counts
As extracted from the ISI Web of Knowledge on 25 June 2011:

ISI identifier 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Total
5 15 18 43 131 290 472 528 435 419 449 378 396 3579
39 123 91 57 39 25 14 388
16 235 421 327 289 187 126 1601
42 397 531 511 445 366 2292
5 39 75 41 25 10 195
55 438 849 656 461 2459
92 714 962 733 2501
208 1402 1906 3516
7 21 44 72
172 1363 1535
205 205
1 12 14 25 36 81 93 262
Total 5 15 18 43 131 290 528 945 1452 1964 3143 4354 5717 18605

For the “R Development Core Team (year)” citations, the peak appears about 2 years after the year concerned. This presumably reflects journal review and backlog times.

There are almost certainly some ISI identifiers missing from the above table (and, as a result, almost certainly some citations not yet counted by me). For example, the number of citations found above to R Development Core Team (2009) is lower than might be expected given the general rate of growth that is evident in the table: there is probably at least one other identifier by which such citations are labelled in the ISI database (I just haven’t found it/them yet!). If anyone reading this can help with finding the “missing” identifiers and associated citation counts, I would be grateful.

The graph below shows the citations found within each year since 1998.

© David Firth, June 2011

To cite this entry:
Firth, D (2011). R and citations. Weblog entry at URL https://statgeek.wordpress.com/2011/06/25/r-and-citations/.


The graph shows the citations found within each year since 1998.

[Click on the graph to view it at a larger size.]

Citations to Ihaka and Gentleman (1996) and to R Core Development Team (any year) are distinguished in the graph, and the total count of the two kinds of citation is also shown.

Have rail fares gone up this year?


Why does that big number end in 8?
I have to go to London tomorrow, so I thought I’d check how much the price of my normal rail ticket has increased in the new year. I didn’t ask for the first class fare, but they told me it anyway. Having picked myself up off the floor, I’m a bit curious about that last digit. (Click on the image to see it more clearly.)

RAE 2008: How much weight did research outputs actually get?


In the 2008 UK Research Assessment Exercise each subject-area assessment panel specified and published in advance the weight to be given to each of the three parts of the assessment, namely “research outputs”, “research environment” and “esteem”. The quality “sub-profiles” for those three parts were then combined into an overall quality profile for each department assesed, by using the published weights. The overall quality profiles have since been used in HEFCE funding allocations, and in various league tables published by newspapers and others.

For example, RAE Panel F (Pure and Applied Maths, Statistics, Operational Research, Computer Science and Informatics) specified the following weights:

  • Research outputs: 70%
  • Research environment: 20%
  • Esteem: 10%

The weight specified for research outputs varied between 50% (for RAE Panel G, engineering disciplines) to 80% (RAE Panel N, humanities disciplines).

When the RAE sub-profiles were published in spring 2009, it became clear that the assessments for the three parts were often quite different from one another. For example, some of the assessment panels awarded many more 4* (“world leading”) grades for research environment and esteem than for research outputs. These seemingly systematic differences naturally prompt the question: to what extent are the agreed and published weights for the three parts reflected in funding allocations, league tables, etc.?

Let’s leave the consideration of league tables for another time. Here we’ll calculate the actual relative weights of the three parts in terms of their effect on funding outcomes, and compare those with the weights that were published and used by RAE assessment panels.

The formula used by HEFCE in 2009 awarded quality-related research funding to departments in proportion to

\displaystyle 7 p_{4d} + 3 p_{3d} + p_{2d}

where the p’s come from the department’s overall RAE profile (being the percentages at quality levels 4*, 3* and 2*). Now, from the published sub-profile for research outputs, it can also be calculated how much of the any department’s allocated funding came from the research outputs component, in the obvious way. The actual weight accorded to research outputs in the 2009 funding outcomes by a given RAE Sub-panel is then

\displaystyle{\sum_d(\textrm{funding from RAE research outputs profile for department } d) \over\sum_d(\textrm{funding from overall RAE profile for department } d)}

where the summations are over all of the departments d assessed by the Sub-panel. (In the calculation here I have used the un-rounded overall profiles, not the crudely rounded ones used by HEFCE in their 2009 funding allocation. I’ll write more about that in a later post. Rounded or un-rounded doesn’t really affect the main point here, though.)

For 2010 it seems that the HEFCE funding rates will be in the ratio 9:3:1 rather than 7:3:1, i.e., proportionately more funds will be given to departments with a high percentage of work assessed at the 4* quality level. The table below lists the discrepancies between the actual and intended weight given to Outputs, by RAE Sub-panel, using the 2009 and 2010 HEFCE funding rates. For example, the RAE Sub-panel J41 (Sociology) decided that 75% of the weight should go to Outputs, but the reality in 2009 was that only 56.6% of the HEFCE “QR” funding to Sociology departments came via their Outputs sub-profiles; the corresponding figure that appears in the table below is 56.6 – 75 = -18.4. An alternative view of the same numbers is that the Sociology Sub-panel intended to give combined weight 25% to “research environment” and “esteem”, but those two parts of the assessment actually accounted for a very much larger 43.4% of the 2009 funding allocation to Sociology departments (and with the new funding rates for 2010 that will increase to 45.4%).

RAE Panel RAE Sub-panel name 2009 2010
A Cardiovascular Medicine -2.9 -3.4
A Cancer Studies -3.8 -4.3
A Infection and Immunology -7.7 -9.0
A Other Hospital Based Clinical Subjects -13.4 -16.1
A Other Laboratory Based Clinical Subjects -4.5 -4.9
B Epidemiology and Public Health -10.7 -12.5
B Health Services Research -9.2 -10.6
B Primary Care and Other Community Based Clinical Subjects -5.0 -5.4
B Psychiatry, Neuroscience and Clinical Psychology -6.4 -7.2
C Dentistry 0.2 -0.5
C Nursing and Midwifery -2.3 -3.6
C Allied Health Professions and Studies -1.9 -2.5
C Pharmacy -4.1 -5.3
D Biological Sciences -5.5 -6.0
D Pre-clinical and Human Biological Sciences -4.9 -5.6
D Agriculture, Veterinary and Food Science -7.0 -8.2
E Earth Systems and Environmental Sciences -4.9 -5.6
E Chemistry -3.5 -3.9
E Physics -3.2 -4.1
F Pure Mathematics -3.8 -4.2
F Applied Mathematics -6.0 -7.2
F Statistics and Operational Research -3.2 -3.5
F Computer Science and Informatics 1.8 1.6
G Electrical and Electronic Engineering 2.0 1.8
G General Engineering and Mineral & Mining Engineering 3.2 2.7
G Chemical Engineering 7.1 7.8
G Civil Engineering 1.2 0.6
G Mechanical, Aeronautical and Manufacturing Engineering 3.7 3.7
G Metallurgy and Materials 2.6 2.2
H Architecture and the Built Environment -2.8 -3.7
H Town and Country Planning -3.2 -3.3
H Geography and Environmental Studies -3.8 -4.5
H Archaeology -10.9 -12.6
I Economics and Econometrics -1.2 -1.1
I Accounting and Finance -4.0 -4.7
I Business and Management Studies -5.3 -6.2
I Library and Information Management -8.9 -10.0
J Law -13.4 -14.7
J Politics and International Studies -9.6 -10.0
J Social Work and Social Policy & Administration -8.4 -9.7
J Sociology -18.4 -20.4
J Anthropology -18.6 -21.2
J Development Studies -11.5 -12.9
K Psychology -3.6 -4.6
K Education -5.9 -7.5
K Sports-Related Studies -4.5 -5.5
L American Studies and Anglophone Area Studies -13.3 -14.7
L Middle Eastern and African Studies -14.3 -16.0
L Asian Studies -14.1 -15.5
L European Studies -11.1 -13.1
M Russian, Slavonic and East European Languages -11.1 -12.8
M French -6.9 -7.6
M German, Dutch and Scandinavian Languages -4.5 -5.3
M Italian -8.6 -9.7
M Iberian and Latin American Languages -12.2 -13.8
M Celtic Studies -9.8 -11.3
M English Language and Literature -10.9 -12.7
M Linguistics -7.9 -9.1
N Classics, Ancient History, Byzantine and Modern Greek Studies -8.8 -10.1
N Philosophy -11.9 -14.1
N Theology, Divinity and Religious Studies -9.6 -11.4
N History -8.4 -9.5
O Art and Design -13.9 -14.8
O History of Art, Architecture and Design -1.0 -1.0
O Drama, Dance and Performing Arts -2.8 -2.9
O Communication, Cultural and Media Studies 0.7 0.8
O Music -4.5 -4.6

RAE 2008: relation between 2009 and 2010 funding rates
Most of the discrepancies are negative: the actual weight given to research outputs, in terms of funding, is less than was apparently intended by most of the assessment panels. Some of the discrepancies are very large indeed — more than 20 percentage points in the cases of Sociology and Anthropology, under the HEFCE funding rates that will be applied in 2010.

Click on the image for a graphical view of the relationship between the discrepancies for 2009 (funding rates 7:3:1:0:0 for the five RAE quality levels) and 2010 (funding rates 9:3:1:0:0).

In RAE 2008 the agreed and published weights were the result of much discussion and public consultation, most of which centred on the perceived relative importance of the three components (research outputs, research environment, esteem) in different research disciplines. The discrepancies that are evident here arise from the weighted averaging of three separate profiles without (it seems) careful consideration of the differences of distribution between them. In the case of funding, it’s (mainly) differences in the usage of the 4* quality level that matter: if 4* is a relatively rare assessment for research outputs but is much more common for research environment, for example, the upshot is that the quality of research outputs actually determines less of the funding than the published weights would imply.

It is to be hoped that measures will be put in place to rectify this in the forthcoming replacement for the RAE, the Research Excellence Framework. In particular, the much-debated weight of 25% for the proposed new “impact” part of the REF assessment might actually turn out to be appreciably more if we’re not careful (the example of Sociology, see above, should be enough to emphasise this point).

The calculation done here was suggested to me by my friend Bernard Silverman, and indeed he did the same calculation independently himself (for the 2009 funding formula) and got the same results. The opinion expressed above is mine, not necessarily shared by Bernard.

© David Firth, February 2010

To cite this entry:
Firth, D (2010). RAE 2008: How much weight did research outputs actually get? Weblog entry at https://statgeek.wordpress.com/2010/02/07/rae-how/.

RAE 2008: Assessed quality of research in different disciplines


RAE 2008 aggregate quality assessments, by discipline (1295 x 788 pixels)
This graph was drawn with the help of my daughter Kathryn on her “take your daughter to work” day in Year 10 at school. Her skill with spreadsheet programs was invaluable!

The graph shows how different disciplines — that is, different RAE “sub-panels” or “units of assessment” — emerged in terms of their average research quality as assessed in RAE 2008. The main data used to make the graph are the overall (rounded) quality profiles and submission-size data that were published in December 2008. Those published quality profiles were the basis (in March 2009) of ‘QR’ funding allocations made for 2009-10 by HEFCE to universities and other institutions.

Each bar in the graph represents one academic discipline (as defined by the remit of an RAE sub-panel). The blue and pink colouring shows how the sub-panels were organised into 15 RAE “main panels”. A key task of the main panels was to try to ensure comparability between the assessmenta made for different disciplines. Disciplines within the first seven main panels are the so-called “STEM” (Science, Technology, Engineering and Mathematics) subjects.

The height of each bar is calculated as the average, over all “full-time equivalent” (FTE) researchers whose work was submitted to the RAE, of a “quality score” calculated directly from the published RAE profile of each researcher’s department. The quality score for members of department d is calculated as a weighted sum

\displaystyle w_4 p_{4d} + w_3 p_{3d} + w_2 p_{2d} + w_1 p_{1d} + w_0 p_{0d}\ ,

where the p’s represent the department’s RAE profile and the w’s are suitably defined weights (with w4w3 ≥ … ≥ w0). The particular weights used in constructing such a quality score are rather arbitrary; here I have used 7:3:1:0:0, the same weights that were used in HEFCE’s 2009 funding allocation, but it would not make very much difference, for the purpose of drawing this graph to compare whole disciplines, to use something else such as 4:3:2:1:0.

Example: for a department whose RAE profile is

          4*   3*   2*   1*   0*
          0.10 0.25 0.30 0.30 0.05

the quality score assigned to each submitted researcher is

\displaystyle (7 \times 0.10) + (3\times 0.25) + (1 \times 0.30) = 1.75\ .

The average quality score over the whole RAE was about 2.6 (the green line in the graph).

The graph shows a fair amount of variation between disciplines, both within and between main panels of the RAE. The differences may of course reflect, at least to some extent, genuine differences in the perceived quality of research in different disciplines; the top-level RAE assessment criteria were the same for all disciplines, so in principle this kind of comparison between disciplines might be made (although in practice the verbally described criteria would surely be interpreted differently by assessors in different disciplines). However, it does appear that some main panels were appreciably tougher in their assessments than others. Even within main panels it looks as though the assessments of different sub-panels might not really be comparable. (On this last point, main panel F even went so far as to make an explicit comment in the minutes of its final meeting in October 2008 (available in this zip file from the RAE website): noting the discrepancy between assessments for Computer Science and Informatics and for the other three disciplines under its remit, Panel F minuted that ”…this discrepancy should not be taken to be an indication of the relative strengths of the subfields in the institutions where comparisons are possible.” I have not checked the minutes of other panels for similar statements.)

Note that, although the HEFCE funding weights were used in constructing the scores that are summarized here, the relative funding rates for different disciplines cannot be read straightforwardly from the above graph. This is because HEFCE took special measures to protect the research funding to disciplines in the “STEM” group. Within the non-STEM group of disciplines, relative heights in the above graph equate to relative funding rates; the same applies also within each main panel among the STEM disciplines. (On this last point, and in relation to the discrepancy minuted by Panel F as mentioned above: Panel F also formally minuted its hope that the discrepancy would not adversely affect the QR funding allocated to Pure Maths, Applied Maths, and Statistics & Operational Research. But, perhaps unsurprisingly, that expression of hope from Panel F was ignored in the actual funding formula!)

HEFCE relies heavily on the notion that assessment panels are able to regulate each other’s behaviour, so as to arrive at assessments which allow disciplines to be compared (for funding purposes at least, and perhaps for other purposes as well). This strikes me as wishful thinking, at best! By allowing the relative funding of different disciplines funding to follow quality scores so directly, HEFCE has created a simple game in which the clear incentive for the assessors in any given discipline is to make their own scores as high as they can get away with. The most successful assessment panel, certainly in the eyes of their own colleagues, is not the one that makes the best job of assessing quality faithfully, but the one with the highest scores at the end! This seems an absurd way to set up such an expensive, and potentially valuable, research assessment exercise. Unfortunately in the current plans for the REF (Research Excellence Framework, the RAE’s successor) there is little or no evidence that HEFCE has a solution to this problem. The REF pilot study seems to have concluded, as perhaps expected, that routinely generated “bibliometric” measures cannot be used at all reliably for such inter-discipline comparisons.

Since I don’t have an alternative solution to offer, I strongly favour the de-coupling of allocation of funds between disciplines from research quality assessment. If the Government or HEFCE wishes or needs to increase its funding for research in some disciplines at the expense of others, it ought to be for a good and clear reason; research assessment panels will inevitably vary in their interpretation of the assessment criteria and in their scrupulousness, and such variation should not be any part of the reason.

© David Firth, November 2009

To cite this entry:
Firth, D (2009). RAE 2008: Assessed quality of research in different disciplines. Weblog entry
at URL https://statgeek.wordpress.com/2009/11/25/rae-2008-assessed/.