Archive for the ‘Research Assessment’ Category

RSS discussion paper on model-based ranking of journals, using citation data

2015-04-07

This paper has been around on arXiv for quite some time.  Now, having survived various rounds of review — and having grown quite a bit as a result of reviewers’ requests! — it will be discussed at an Ordinary Meeting of the Royal Statistical Society on 13 May 2015 (just follow this link to the recent Allstat announcement, for instructions on how to contribute to the RSS discussion either in person or in writing).

Here is the link to the preprint on arXiv.org:

Statistical modelling of citation exchange between statistics journals by Cristiano Varin, Manuela Cattelan and David Firth.

(Note that the more ‘official’ version, made public at the RSS website, is an initial, uncorrected printer’s proof of the paper for JRSS-A.  It contains plenty of typos!  Those obviously will be eliminated before the paper appears in the Journal.)

The paper has associated online supplementary material (zip file, 0.4MB) comprising datasets used in the paper, and full R code to help potential discussants and other readers to replicate and/or experiment with the reported analyses.

fig4

Figure 4 from the paper (a ranking of statistics journals based on the Bradley-Terry model)

The paper’s Summary is as follows:

Rankings of scholarly journals based on citation data are often met with skepticism by the scientific community. Part of the skepticism is due to disparity between the common perception of journals’ prestige and their ranking based on citation counts. A more serious concern is the inappropriate use of journal rankings to evaluate the scientific influence of authors. This paper focuses on analysis of the table of cross-citations among a selection of Statistics journals. Data are collected from the Web of Science database published by Thomson Reuters. Our results suggest that modelling the exchange of citations between journals is useful to highlight the most prestigious journals, but also that journal citation data are characterized by considerable heterogeneity, which needs to be properly summarized. Inferential conclusions require care in order to avoid potential over-interpretation of insignificant differences between journal ratings. Comparison with published ratings of institutions from the UK’s Research Assessment Exercise shows strong correlation at aggregate level between assessed research quality and journal citation ‘export scores’ within the discipline of Statistics.

RAE 2008: How much weight did research outputs actually get?

2010-02-07

In the 2008 UK Research Assessment Exercise each subject-area assessment panel specified and published in advance the weight to be given to each of the three parts of the assessment, namely “research outputs”, “research environment” and “esteem”. The quality “sub-profiles” for those three parts were then combined into an overall quality profile for each department assesed, by using the published weights. The overall quality profiles have since been used in HEFCE funding allocations, and in various league tables published by newspapers and others.

For example, RAE Panel F (Pure and Applied Maths, Statistics, Operational Research, Computer Science and Informatics) specified the following weights:

  • Research outputs: 70%
  • Research environment: 20%
  • Esteem: 10%

The weight specified for research outputs varied between 50% (for RAE Panel G, engineering disciplines) to 80% (RAE Panel N, humanities disciplines).

When the RAE sub-profiles were published in spring 2009, it became clear that the assessments for the three parts were often quite different from one another. For example, some of the assessment panels awarded many more 4* (“world leading”) grades for research environment and esteem than for research outputs. These seemingly systematic differences naturally prompt the question: to what extent are the agreed and published weights for the three parts reflected in funding allocations, league tables, etc.?

Let’s leave the consideration of league tables for another time. Here we’ll calculate the actual relative weights of the three parts in terms of their effect on funding outcomes, and compare those with the weights that were published and used by RAE assessment panels.

The formula used by HEFCE in 2009 awarded quality-related research funding to departments in proportion to

\displaystyle 7 p_{4d} + 3 p_{3d} + p_{2d}

where the p’s come from the department’s overall RAE profile (being the percentages at quality levels 4*, 3* and 2*). Now, from the published sub-profile for research outputs, it can also be calculated how much of the any department’s allocated funding came from the research outputs component, in the obvious way. The actual weight accorded to research outputs in the 2009 funding outcomes by a given RAE Sub-panel is then

\displaystyle{\sum_d(\textrm{funding from RAE research outputs profile for department } d) \over\sum_d(\textrm{funding from overall RAE profile for department } d)}

where the summations are over all of the departments d assessed by the Sub-panel. (In the calculation here I have used the un-rounded overall profiles, not the crudely rounded ones used by HEFCE in their 2009 funding allocation. I’ll write more about that in a later post. Rounded or un-rounded doesn’t really affect the main point here, though.)

For 2010 it seems that the HEFCE funding rates will be in the ratio 9:3:1 rather than 7:3:1, i.e., proportionately more funds will be given to departments with a high percentage of work assessed at the 4* quality level. The table below lists the discrepancies between the actual and intended weight given to Outputs, by RAE Sub-panel, using the 2009 and 2010 HEFCE funding rates. For example, the RAE Sub-panel J41 (Sociology) decided that 75% of the weight should go to Outputs, but the reality in 2009 was that only 56.6% of the HEFCE “QR” funding to Sociology departments came via their Outputs sub-profiles; the corresponding figure that appears in the table below is 56.6 – 75 = -18.4. An alternative view of the same numbers is that the Sociology Sub-panel intended to give combined weight 25% to “research environment” and “esteem”, but those two parts of the assessment actually accounted for a very much larger 43.4% of the 2009 funding allocation to Sociology departments (and with the new funding rates for 2010 that will increase to 45.4%).

RAE Panel RAE Sub-panel name 2009 2010
A Cardiovascular Medicine -2.9 -3.4
A Cancer Studies -3.8 -4.3
A Infection and Immunology -7.7 -9.0
A Other Hospital Based Clinical Subjects -13.4 -16.1
A Other Laboratory Based Clinical Subjects -4.5 -4.9
B Epidemiology and Public Health -10.7 -12.5
B Health Services Research -9.2 -10.6
B Primary Care and Other Community Based Clinical Subjects -5.0 -5.4
B Psychiatry, Neuroscience and Clinical Psychology -6.4 -7.2
C Dentistry 0.2 -0.5
C Nursing and Midwifery -2.3 -3.6
C Allied Health Professions and Studies -1.9 -2.5
C Pharmacy -4.1 -5.3
D Biological Sciences -5.5 -6.0
D Pre-clinical and Human Biological Sciences -4.9 -5.6
D Agriculture, Veterinary and Food Science -7.0 -8.2
E Earth Systems and Environmental Sciences -4.9 -5.6
E Chemistry -3.5 -3.9
E Physics -3.2 -4.1
F Pure Mathematics -3.8 -4.2
F Applied Mathematics -6.0 -7.2
F Statistics and Operational Research -3.2 -3.5
F Computer Science and Informatics 1.8 1.6
G Electrical and Electronic Engineering 2.0 1.8
G General Engineering and Mineral & Mining Engineering 3.2 2.7
G Chemical Engineering 7.1 7.8
G Civil Engineering 1.2 0.6
G Mechanical, Aeronautical and Manufacturing Engineering 3.7 3.7
G Metallurgy and Materials 2.6 2.2
H Architecture and the Built Environment -2.8 -3.7
H Town and Country Planning -3.2 -3.3
H Geography and Environmental Studies -3.8 -4.5
H Archaeology -10.9 -12.6
I Economics and Econometrics -1.2 -1.1
I Accounting and Finance -4.0 -4.7
I Business and Management Studies -5.3 -6.2
I Library and Information Management -8.9 -10.0
J Law -13.4 -14.7
J Politics and International Studies -9.6 -10.0
J Social Work and Social Policy & Administration -8.4 -9.7
J Sociology -18.4 -20.4
J Anthropology -18.6 -21.2
J Development Studies -11.5 -12.9
K Psychology -3.6 -4.6
K Education -5.9 -7.5
K Sports-Related Studies -4.5 -5.5
L American Studies and Anglophone Area Studies -13.3 -14.7
L Middle Eastern and African Studies -14.3 -16.0
L Asian Studies -14.1 -15.5
L European Studies -11.1 -13.1
M Russian, Slavonic and East European Languages -11.1 -12.8
M French -6.9 -7.6
M German, Dutch and Scandinavian Languages -4.5 -5.3
M Italian -8.6 -9.7
M Iberian and Latin American Languages -12.2 -13.8
M Celtic Studies -9.8 -11.3
M English Language and Literature -10.9 -12.7
M Linguistics -7.9 -9.1
N Classics, Ancient History, Byzantine and Modern Greek Studies -8.8 -10.1
N Philosophy -11.9 -14.1
N Theology, Divinity and Religious Studies -9.6 -11.4
N History -8.4 -9.5
O Art and Design -13.9 -14.8
O History of Art, Architecture and Design -1.0 -1.0
O Drama, Dance and Performing Arts -2.8 -2.9
O Communication, Cultural and Media Studies 0.7 0.8
O Music -4.5 -4.6

RAE 2008: relation between 2009 and 2010 funding rates
Most of the discrepancies are negative: the actual weight given to research outputs, in terms of funding, is less than was apparently intended by most of the assessment panels. Some of the discrepancies are very large indeed — more than 20 percentage points in the cases of Sociology and Anthropology, under the HEFCE funding rates that will be applied in 2010.

Click on the image for a graphical view of the relationship between the discrepancies for 2009 (funding rates 7:3:1:0:0 for the five RAE quality levels) and 2010 (funding rates 9:3:1:0:0).

\begin{opinion}
In RAE 2008 the agreed and published weights were the result of much discussion and public consultation, most of which centred on the perceived relative importance of the three components (research outputs, research environment, esteem) in different research disciplines. The discrepancies that are evident here arise from the weighted averaging of three separate profiles without (it seems) careful consideration of the differences of distribution between them. In the case of funding, it’s (mainly) differences in the usage of the 4* quality level that matter: if 4* is a relatively rare assessment for research outputs but is much more common for research environment, for example, the upshot is that the quality of research outputs actually determines less of the funding than the published weights would imply.

It is to be hoped that measures will be put in place to rectify this in the forthcoming replacement for the RAE, the Research Excellence Framework. In particular, the much-debated weight of 25% for the proposed new “impact” part of the REF assessment might actually turn out to be appreciably more if we’re not careful (the example of Sociology, see above, should be enough to emphasise this point).
\end{opinion}

Acknowledgement
The calculation done here was suggested to me by my friend Bernard Silverman, and indeed he did the same calculation independently himself (for the 2009 funding formula) and got the same results. The opinion expressed above is mine, not necessarily shared by Bernard.

© David Firth, February 2010

To cite this entry:
Firth, D (2010). RAE 2008: How much weight did research outputs actually get? Weblog entry at https://statgeek.wordpress.com/2010/02/07/rae-how/.

RAE 2008: Assessed quality of research in different disciplines

2009-11-25

RAE 2008 aggregate quality assessments, by discipline (1295 x 788 pixels)
This graph was drawn with the help of my daughter Kathryn on her “take your daughter to work” day in Year 10 at school. Her skill with spreadsheet programs was invaluable!

The graph shows how different disciplines — that is, different RAE “sub-panels” or “units of assessment” — emerged in terms of their average research quality as assessed in RAE 2008. The main data used to make the graph are the overall (rounded) quality profiles and submission-size data that were published in December 2008. Those published quality profiles were the basis (in March 2009) of ‘QR’ funding allocations made for 2009-10 by HEFCE to universities and other institutions.

Each bar in the graph represents one academic discipline (as defined by the remit of an RAE sub-panel). The blue and pink colouring shows how the sub-panels were organised into 15 RAE “main panels”. A key task of the main panels was to try to ensure comparability between the assessmenta made for different disciplines. Disciplines within the first seven main panels are the so-called “STEM” (Science, Technology, Engineering and Mathematics) subjects.

The height of each bar is calculated as the average, over all “full-time equivalent” (FTE) researchers whose work was submitted to the RAE, of a “quality score” calculated directly from the published RAE profile of each researcher’s department. The quality score for members of department d is calculated as a weighted sum

\displaystyle w_4 p_{4d} + w_3 p_{3d} + w_2 p_{2d} + w_1 p_{1d} + w_0 p_{0d}\ ,

where the p’s represent the department’s RAE profile and the w’s are suitably defined weights (with w4w3 ≥ … ≥ w0). The particular weights used in constructing such a quality score are rather arbitrary; here I have used 7:3:1:0:0, the same weights that were used in HEFCE’s 2009 funding allocation, but it would not make very much difference, for the purpose of drawing this graph to compare whole disciplines, to use something else such as 4:3:2:1:0.

Example: for a department whose RAE profile is

          4*   3*   2*   1*   0*
          0.10 0.25 0.30 0.30 0.05

the quality score assigned to each submitted researcher is

\displaystyle (7 \times 0.10) + (3\times 0.25) + (1 \times 0.30) = 1.75\ .

The average quality score over the whole RAE was about 2.6 (the green line in the graph).

The graph shows a fair amount of variation between disciplines, both within and between main panels of the RAE. The differences may of course reflect, at least to some extent, genuine differences in the perceived quality of research in different disciplines; the top-level RAE assessment criteria were the same for all disciplines, so in principle this kind of comparison between disciplines might be made (although in practice the verbally described criteria would surely be interpreted differently by assessors in different disciplines). However, it does appear that some main panels were appreciably tougher in their assessments than others. Even within main panels it looks as though the assessments of different sub-panels might not really be comparable. (On this last point, main panel F even went so far as to make an explicit comment in the minutes of its final meeting in October 2008 (available in this zip file from the RAE website): noting the discrepancy between assessments for Computer Science and Informatics and for the other three disciplines under its remit, Panel F minuted that ”…this discrepancy should not be taken to be an indication of the relative strengths of the subfields in the institutions where comparisons are possible.” I have not checked the minutes of other panels for similar statements.)

Note that, although the HEFCE funding weights were used in constructing the scores that are summarized here, the relative funding rates for different disciplines cannot be read straightforwardly from the above graph. This is because HEFCE took special measures to protect the research funding to disciplines in the “STEM” group. Within the non-STEM group of disciplines, relative heights in the above graph equate to relative funding rates; the same applies also within each main panel among the STEM disciplines. (On this last point, and in relation to the discrepancy minuted by Panel F as mentioned above: Panel F also formally minuted its hope that the discrepancy would not adversely affect the QR funding allocated to Pure Maths, Applied Maths, and Statistics & Operational Research. But, perhaps unsurprisingly, that expression of hope from Panel F was ignored in the actual funding formula!)

\begin{opinion}
HEFCE relies heavily on the notion that assessment panels are able to regulate each other’s behaviour, so as to arrive at assessments which allow disciplines to be compared (for funding purposes at least, and perhaps for other purposes as well). This strikes me as wishful thinking, at best! By allowing the relative funding of different disciplines funding to follow quality scores so directly, HEFCE has created a simple game in which the clear incentive for the assessors in any given discipline is to make their own scores as high as they can get away with. The most successful assessment panel, certainly in the eyes of their own colleagues, is not the one that makes the best job of assessing quality faithfully, but the one with the highest scores at the end! This seems an absurd way to set up such an expensive, and potentially valuable, research assessment exercise. Unfortunately in the current plans for the REF (Research Excellence Framework, the RAE’s successor) there is little or no evidence that HEFCE has a solution to this problem. The REF pilot study seems to have concluded, as perhaps expected, that routinely generated “bibliometric” measures cannot be used at all reliably for such inter-discipline comparisons.

Since I don’t have an alternative solution to offer, I strongly favour the de-coupling of allocation of funds between disciplines from research quality assessment. If the Government or HEFCE wishes or needs to increase its funding for research in some disciplines at the expense of others, it ought to be for a good and clear reason; research assessment panels will inevitably vary in their interpretation of the assessment criteria and in their scrupulousness, and such variation should not be any part of the reason.
\end{opinion}

© David Firth, November 2009

To cite this entry:
Firth, D (2009). RAE 2008: Assessed quality of research in different disciplines. Weblog entry
at URL https://statgeek.wordpress.com/2009/11/25/rae-2008-assessed/.