**Update**, 2019-01-07: I am pleased to say that the online media article that I complained about in Sec 1 below has now been amended by its author(s), to correct the false attributions. I am grateful to Chris Parr for helping to sort this out.

In my post a few days ago (which I’ll now call “Part 1”) I looked at aspects of the statistical methods used in a report by the UK government’s *Office for Students*, about “grade inflation” in English universities. This second post continues on the same topic.

In this *Part 2* I will do two things:

- Set the record straight, in relation to some incorrect reporting of Part 1 in the specialist media.
- Suggest a new statistical method that (in my opinion) is better than the one used in the OfS report.

The more substantial stuff will be the second bullet there (and of course I wish I didn’t need to do the first bullet at all). In this post (at section 2 below) I will just *outline* a better method, by using the same artificial example that I gave in Part 1: hopefully that will be enough to give the general idea, to both specialist and non-specialist readers. Later I will follow up (in my intended *Part 3*) with a more detailed description of the suggested better method; that *Part 3* post will be suitable mainly for readers with more specialist background in Statistics.

## 1. For the record

I am aware of two places where the analysis I gave in Part 1 has been reported:

- At https://www.researchprofessional.com/0/rr/he/agencies/ofs/2019/OfS-grade-inflation-analysis-not-fit-for-purpose–says-expert.html, an article entitled “OfS grade inflation analysis not fit for purpose, says expert”
- At https://www.researchresearch.com/news/article/?articleId=1379083, which seems to be a straight copy of the same article (I have not checked in detail).

The first link there is to a paywalled site, I think. The second one appears to be in the public domain. I do not recommend following either of those links, though! If anyone reading this wants to know about what I wrote in *Part 1*, then my advice is just to read Part 1 directly.

Here I want to mention three specific ways in which **that article misrepresents what I wrote in Part 1**. Points 2 and 3 here are the more important ones, I think (but #1 is also slightly troubling, to me):

- The article refers to my blog post as
**“a review commissioned by HE”**. The reality is that a journalist called Chris Parr had emailed me just before Christmas. In the email Chris introduced himself as “I’m a journalist at Research Fortnight”, and the request he made in the email (in relation to the newly published OfS report) was “Would you or someone you know be interested in taking a look?”. I had heard of*Research Fortnight.*And I was indeed interested in taking a look at the methods used in the OfS report. But until the above-mentioned article came to my attention, I had never even heard of a publication named*HE.*Possibly I am mistaken in this, but to my mind the phrase “a review commissioned by HE” indicates some kind of formal arrangement between*HE*and me, with specified deliverables and perhaps even payment for the work. There was in fact no such “commission” for the work that I did. I merely spent some time during the Christmas break thinking about the methods used in the OfS report, and then I wrote a blog post (and told Chris Parr that I had done that). And let me repeat: I had never even heard of*HE*(nor of the article’s apparent author, which was not Chris Parr). No payment was offered or demanded. I mention all this here only in case anyone who has read that article got a wrong impression from it. - The article contains this false statement:
**“The data is too complex for a reliable statistical method to be used, he said”**. The “he” there refers to me, David Firth. I said no such thing, neither in my blog post nor in any email correspondence with Chris Parr. Indeed, it is not something I ever*would*say: the phrase “data…too complex for a reliable statistical method” is a nonsense. - The article contains this false statement:
**“He calls the OfS analysis an example of Simpson’s paradox”**. Again, the “he” in that statement refers to me. But I did not call the OfS analysis an example of Simpson’s paradox, either in my blog post or anywhere else. (And nor could I have, since I do not have access to the OfS dataset.) What I actually wrote in my blog post was that my own*artificial, specially-constructed example*was an instance of Simpson’s paradox — which is not even close to the same thing!

The article mentioned above seems to have had an agenda that was very different from giving a faithful and informative account of my comments on the OfS report. I suppose that’s journalistic license (although I would naively have expected better from a specialist publication to which my own university appears to subscribe). The false attribution of misleading statements is not something I can accept, though, and that is why I have written specifically about that here.

To be completely clear:

**The article mentioned above is misleading. I do not recommend it to anyone.****All of my posts in this blog are my own work, not commissioned by anyone.**In particular, none of what I’ll continue to write below (and also in*Part 3*of this extended blog post, when I get to that), about the OfS report, was requested by any journalist.

## 2. Towards a better (statistical) measurement model

I have to admit that in Part 1 I ran out of steam at one point, specifically where — in response to my own question about what would be a better way than the method used in the OfS report — I wrote “*I do not have an answer*“. I could have and should have done better than that.

Below I will outline a fairly simple approach that overcomes the specific pitfall I identified in *Part 1*, i.e., the fact that measurement at too high a level of aggregation can give misleading answers. I will demonstrate my suggested new approach through the same, contrived example that I used in *Part 1*. This should be enough to convey the basic idea, I hope. [Full generality for the analysis of real data will demand a more detailed and more technical treatment of a hierarchical statistical model; I’ll do that later, when I come to write *Part 3*.]

On reflection, I think a lot of the criticism seen by the OfS report since its publication relates to the use of the word “explain” in that report. And indeed, that was a factor also in my own (mentioned above) “*I do not have an answer*” comment. It seems obvious — to me, anyway — that any serious attempt to *explain* apparent increases in the awarding of First Class degrees would need to take account of a lot more than just the attributes of students when they enter university. With the data used in the OfS report I think the best that one can hope to do is to *measure* those apparent increases (or decreases), in such a way that the measurement is a “fair” one that appropriately takes account of incoming student attributes and their fluctuation over time. If we take that attitude — i.e, that **the aim is only to measure things well**, not to explain them — then I do think it is possible to devise a better statistical analysis, for that purpose, than the one that was used in the OfS report.

(I fully recognise that this actually *was* the attitude taken in the OfS work! It is just unfortunate that the OfS report’s use of the word “explain”, which I think was intended there mainly as a technical word with its meaning defined by a statistical regression model, inevitably leads readers of the report to think more broadly about substantive *explanations* for any apparent changes in degree-class distributions.)

### 2.1 Those “toy” data again, and a better statistical model

Recall the setup of the simple example from Part 1: * Two academic years, two types of university, two types of student. *The data are as follows:

2010-11 University A University B Firsts Other Firsts Other h 1000 0 h 500 500 i 0 1000 i 500 500 2016-17 University A University B Firsts Other Firsts Other h 1800 200 h 0 0 i 0 0 i 500 1500

Our measurement (of change) should reflect the fact that, *for each type of student within each university*, where information is available, *the percentage awarded Firsts actually decreased* (in this example).

Change in percent awarded firsts: University A, student type h: 100% --> 90% University A, student type i: no data University B, student type h: no data University B, student type i: 50% --> 25%

This provides the key to specification of a suitable (statistical) measurement model:

- measure the changes at the lowest level of aggregation possible;
- then, if aggregate conclusions are wanted, combine the separate measurements in some sensible way.

In our simple example, “lowest level of aggregation possible” means that we should measure the change separately for each type of student within each university. (In the *real* OfS data, there’s a lower level of aggregation that will be more appropriate, since different *degree courses* within a university ought to be distinguished too — they have different student intakes, different teaching, different exam boards, etc.)

In Statistics this kind of analysis is often called a *stratified* analysis. The quantity of interest (which here is the change in % awarded Firsts) is measured separately in several pre-specified *strata*, and those measurements are then combined if needed (either through a formal statistical model, or less formally by simple or weighted averaging).

In our simple example above, there are 4 strata (corresponding to 2 types of student within each of 2 universities). In our specific dataset there is information about the change in just 2 of those strata, and we can summarize that information as follows:

- in University A, student type
*i*saw their percentage of Firsts reduced by 10%; - in University B, student type
*h*saw their percentage of Firsts reduced by 50%.

That’s all the information in the data, about changes in the rate at which Firsts are awarded. (It was a deliberately small dataset!)

If a combined, “sector-wide” measure of change is wanted, then the separate, stratum-specific measures need to be combined somehow. To some extent this is arbitrary, and the choice of a combination method ought to depend on the *purpose* of such a sector-wide measure and (especially) on the *interpretation desired* for it. I might find time to write more about this later in *Part 3*.

For now, let me just recall what was the “sector-wide” measurement that resulted from analysis (shown in Part 1) of the above dataset using the OfS report’s method. The result obtained by that method was a sector-wide *increase* of 7.5% in the rate at which Firsts are awarded — which is plainly misleading in the face of data that shows substantial *decreases* in both universities. Whilst I do not much like the OfS Report’s “compare with 2010” approach, it does have the benefit of transparency and in my “toy” example it is easy to apply to the stratified analysis:

2016-17 Expected Firsts Actual based on 2010-11 University A 2000 1800 University B 1000 500 ------------------------------------------ Total 3000 2300

— from which we could report a sector-wide decrease of 700/3000 = 23.3% in the awarding of Firsts, once student attributes are taken properly into account. (This could be viewed as just a suitably weighted average of the 10% and 50% decreases seen in University A and University B respectively.)

As before, I have made the full *R* code available (as an update to my earlier *R Markdown* document). For those who don’t use *R*, I attach here also a PDF copy of that: grade-inflation-example.pdf

### 2.2 Generalising the better model: More strata, more time-points

The essential idea of a better measurement model is presented above in the context of a small “toy” example, but the real data are of course much bigger and more complex.

The key to generalising the model will simply be to recognise that it can be expressed in the form of a logistic regression model (that’s the same *kind* of model that was used in the OfS report; but the “better” logistic regression model structure is different, in that it needs to include a term that defines the strata within which measurement takes place).

This will be developed further in *Part 3*, which will be more technical in flavour than Parts 1 and 2 of this blog-post thread have been. Just by way of a taster, let me show here the mathematical form of the logistic-regression representation of the “toy” data analysis shown above. With notation

*u*for providers (universities);*u*is either*A*or*B*in the toy example*t*for type of student;*t*is either*h*or*i*in the toy example*y*for years;*y*is either 2010-11 or 2016-17 in the toy example- for the probability of a First in year
*y*, for students of type*t*in university*u*

the logistic regression model corresponding to the analysis above is

.

This is readily generalized to situations involving more strata (more universities *u* and student types *t*, and also degree-courses *within* universities). There were just 4 stratum parameters in the above example, but more strata are easily accommodated.

The model is readily generalized also, in a similar way, to more than 2 years of data.

For comparison, the corresponding logistic regression model as used *in the OfS report* looks like this:

.

So it is superficially very similar. But the all-important term that determines the necessary strata *within* universities is missing from the OfS model.

I will aim to flesh this out a bit in a new *Part 3* post within the next few days, if time permits. For now I suppose the model I’m suggesting here needs a name (i.e., a name that identifies it more clearly than just “my better model”!) Naming things is not my strong point, unfortunately! But, for now at least, I will term the analysis introduced above “stratified by available student attributes” — or “SASA model” for short.

(The key word there is “stratified”.)

© David Firth, January 2019

**To cite this entry:**

Firth, D (2019). Part 2, further comments on OfS grade-inflation report. Weblog entry at URL https://statgeek.net/2019/01/07/part-2-further-comments-on-ofs-grade-inflation-report/