## Archive for the ‘universities’ Category

### Robust measurement from a 2-way table

2019-04-26

I work in a university.  My department runs degree courses that allow students a lot of flexibility in their choice of course “modules”.  (A typical student takes 8–10 modules per year, and is assessed separately on each module).

After the exams are finished each year, we promise our students to look carefully at the exam marks for each module — to ensure that students taking a “hard” module are not penalized for doing that, and that students taking an “easy” module are not unduly advantaged.

The challenge in this is to separate module difficulty from student ability: we need to be able to tell the difference between (for example) a hard module and a module that was chosen by weaker-than-average students.  This necessitates analysis of the exam marks for all modules together, rather than separately.

The data to be analysed are each student’s score (expressed as a percentage) in each module they took.  It is convenient to arrange those scores in a 2-way table, whose rows are indexed by student IDs, and whose columns correspond to all the different possible modules that were taken.  The task is then to analyse the (typically incomplete) 2-way table, to determine a numerical “module effect” for each module (a relatively high number for each module that was found relatively “easy”, and lower numbers for modules that were relatively “hard”.

A standard method for doing this robustly (i.e., in such a way that the analysis is not influenced too strongly by the performance of a small number of students) is the clever median polish method due to J W Tukey.  My university department has been using median polish now for several years, to identify any strong “module effects” that ought to be taken into account when assessing each student’s overall performance in their degree course.

Median polish works mostly OK, it seems: it gives answers that broadly make sense.  But there are some well known problems, including that it matters which way round the table is presented (i.e., “rows are students”, versus “rows are modules”) — the answer will depend on that.  So median polish is actually not just one method, but two.

When my university department asked me recently to implement its annual median-polish exercise in R, I could not resist thinking a bit about whether there might be something even better than median polish, for this specific purpose of identifying the column effects (module effects) robustly.  This led me to look at some simple “toy” examples, to help understand the principles.  I’ll just show one such example here, to illustrate how it’s possible to do better than median polish in this particular context.

## Example: 5 modules, 3 students

> x
module
student  A  B  C  D  E
i NA NA NA 45 60
j NA NA NA 55 60
k 10 20 30 NA 50

There were five modules (labelled A,B,C,D,E).  Students i, j and k each took a selection of those modules.  It’s a small dataset, but that is deliberate: we can see easily what’s going on in a table this small.  Module E was easier than the others, for example; and student k looks to be the weakest student (since k was outperformed by the other two students in module E, the only one that they all took).

I will call the above table perfect, as far as the measurement of module effects is concerned.  If we assign module effects (−20, −10, 0, 10, 20) to the five modules A,B,C,D,E respectively, then for every pair of modules the observed within-student differences are centered upon the relevant difference in those module effects.  For example, look at modules D and E: student i scores 15 points more in E, while j scores 5 points more in E, and the median of those two differences is 10 — the same as the difference between the proposed “perfect” module effects for D and E.

When we perform median polish on this table, we get different answers depending on whether we apply the method to the table directly, or to its transpose:

> medpolish(x, na.rm = TRUE, maxiter = 20)
...
Median Polish Results (Dataset: "x")

Overall: 38.75

Row Effects:
i     j     k
0.00  5.00 -8.75

Column Effects:
A      B      C      D      E
-20.00 -10.00   0.00   8.75  20.00

Residuals:
module
student  A  B  C    D     E
i NA NA NA -2.5  1.25
j NA NA NA  2.5 -3.75
k  0  0  0   NA  0.00

> medpolish(t(x), na.rm = TRUE, maxiter = 20)
...
Median Polish Results (Dataset: "t(x)")

Overall: 36.25

Row Effects:
A      B      C      D      E
-20.00 -10.00   0.00  11.25  20.00

Column Effects:
i      j      k
0.625  5.625 -6.250

Residuals:
student
module      i      j  k
A     NA     NA  0
B     NA     NA  0
C     NA     NA  0
D -3.125  1.875 NA
E  3.125 -1.875  0


Neither of those answers is the same as the “perfect” module-effect measurement that was mentioned above.  The module effect for D as computed by median polish is either 8.75 or 11.25, depending on the orientation of the input table — but not the “perfect 10”.

## A better method: Median difference analysis

I decided to implement, in place of median polish, a simple non-iterative method that targets directly the notion of “perfect” measurement that is mentioned above.

The method is in two stages.

Stage 1 computes within-student differences and takes the median of those, for each possible module pair.  For our toy example:

> md <- meddiff(x)
A   B   C  D   E
A NA -10 -20 NA -40
B  1  NA -10 NA -30
C  1   1  NA NA -20
D  0   0   0 NA -10
E  1   1   1  2  NA


The result here has all of the available median-difference values above the diagonal.  Below the diagonal is the count of how many differences were used in computing each one of those medians.  So, for example, the median difference between modules  D and E is −10; and that was computed from 2 students’ exam scores.

Stage 2 then fits a linear model to the median-difference values, using weighted least squares.  The linear model finds the vector of module effects that most closely approximates the available median differences (i.e., best approximates the numbers above the diagonal).  The weights are simply the counts from the lower triangle of the above matrix.

In this “perfect” example, we achieve the desired perfect answer (which here is presented with E as the “reference” module):

> fit(md)$coefficients A B C D E -40 -30 -20 -10 0  My plan now is to make these simple R functions robust enough to use for our students’ actual exam marks, and to add also inference on the module-effect values (via a suitably designed bootstrap calculation). For now, here are my prototype functions in case anyone else wants to play with them: meddiff <- function(xmat) { ## rows are students, columns are modules S <- nrow(xmat) M <- ncol(xmat) result <- matrix(NA, M, M) rownames(result) <- colnames(result) <- colnames(xmat) for (m in 1:(M-1)) { for (mm in (m+1):M) { diffs <- xmat[, m] - xmat[, mm] ## upper triangle result[m, mm] <- median(diffs, na.rm = TRUE) ## lower triangle result[mm, m] <- sum(!is.na(diffs)) } } return(result) } fit <- function(m) { ## matrix m needs to be fully connected above the diagonal upper <- upper.tri(m) diffs <- m[upper] weights <- t(m)[upper] rows <- factor(row(m)[upper]) cols <- factor(col(m)[upper]) X <- cbind(model.matrix(~ rows - 1), 0) - cbind(0, model.matrix(~ cols - 1)) colnames(X) <- colnames(m) rownames(X) <- paste0(colnames(m)[rows], "-", colnames(m)[cols]) result <- lm.wfit(X, diffs, weights) result$coefficients[is.na(result\$coefficients)] <- 0
class(result) <- c("meddiff_fit", "list")
return(result)
}


To cite this entry:
Firth, D (2019). Robust measurement from a 2-way table. Weblog entry at

2019-01-07

Update, 2019-01-07: I am pleased to say that the online media article that I complained about in Sec 1 below has now been amended by its author(s), to correct the false attributions.  I am grateful to Chris Parr for helping to sort this out.

In my post a few days ago (which I’ll now call “Part 1”) I looked at aspects of the statistical methods used in a report by the UK government’s Office for Students, about “grade inflation” in English universities.  This second post continues on the same topic.

In this Part 2 I will do two things:

1. Set the record straight, in relation to some incorrect reporting of Part 1 in the specialist media.
2. Suggest a new statistical method that (in my opinion) is better than the one used in the OfS report.

The more substantial stuff will be the second bullet there (and of course I wish I didn’t need to do the first bullet at all).   In this post (at section 2 below) I will just outline a better method, by using the same artificial example that I gave in Part 1: hopefully that will be enough to give the general idea, to both specialist and non-specialist readers.  Later I will follow up (in my intended Part 3) with a more detailed description of the suggested better method; that Part 3 post will be suitable mainly for readers with more specialist background in Statistics.

## 1.  For the record

I am aware of two places where the analysis I gave in Part 1 has been reported:

The first link there is to a paywalled site, I think.  The second one appears to be in the public domain.  I do not recommend following either of those links, though!  If anyone reading this wants to know about what I wrote in Part 1, then my advice is just to read Part 1 directly.

Here I want to mention three specific ways in which that article misrepresents what I wrote in Part 1.  Points 2 and 3 here are the more important ones, I think (but #1 is also slightly troubling, to me):

1. The article refers to my blog post as “a review commissioned by HE”.  The reality is that a journalist called Chris Parr had emailed me just before Christmas.  In the email Chris introduced himself as “I’m a journalist at Research Fortnight”, and the request he made in the email (in relation to the newly published OfS report) was “Would you or someone you know be interested in taking a look?”.  I had heard of Research Fortnight.  And I was indeed interested in taking a look at the methods used in the OfS report.  But until the above-mentioned article came to my attention, I had never even heard of a publication named HE.  Possibly I am mistaken in this, but to my mind the phrase “a review commissioned by HE” indicates some kind of formal arrangement between HE and me, with specified deliverables and perhaps even payment for the work.  There was in fact no such “commission” for the work that I did.  I merely spent some time during the Christmas break thinking about the methods used in the OfS report, and then I wrote a blog post (and told Chris Parr that I had done that).  And let me repeat: I had never even heard of HE (nor of the article’s apparent author, which was not Chris Parr).  No payment was offered or demanded.  I mention all this here only in case anyone who has read that article  got a wrong impression from it.
2. The article contains this false statement: “The data is too complex for a reliable statistical method to be used, he said”.  The “he” there refers to me, David Firth.  I said no such thing, neither in my blog post nor in any email correspondence with Chris Parr.  Indeed, it is not something I ever would say: the phrase “data…too complex for a reliable statistical method” is a nonsense.
3. The article contains this false statement: “He calls the OfS analysis an example of Simpson’s paradox”.  Again, the “he” in that statement refers to me.  But I did not call the OfS analysis an example of Simpson’s paradox, either in my blog post or anywhere else.  (And nor could I have, since I do not have access to the OfS dataset.)  What I actually wrote in my blog post was that my own artificial, specially-constructed example was an instance of Simpson’s paradox — which is not even close to the same thing!

The article mentioned above seems to have had an agenda that was very different from giving a faithful and informative account of my comments on the OfS report.  I suppose that’s journalistic license (although I would naively have expected better from a specialist publication to which my own university appears to subscribe).  The false attribution of misleading statements is not something I can accept, though, and that is why I have written specifically about that here.

To be completely clear:

• The article mentioned above is misleading.  I do not recommend it to anyone.
• All of my posts in this blog are my own work, not commissioned by anyone.  In particular, none of what I’ll continue to write below (and also in Part 3 of this extended blog post, when I get to that), about the OfS report, was requested by any journalist.

## 2.  Towards a better (statistical) measurement model

I have to admit that in Part 1 I ran out of steam at one point, specifically where — in response to my own question about what would be a better way than the method used in the OfS report — I wrote “I do not have an answer“.  I could have and should have done better than that.

Below I will outline a fairly simple approach that overcomes the specific pitfall I identified in Part 1, i.e., the fact that measurement at too high a level of aggregation can give misleading answers.  I will demonstrate my suggested new approach through the same, contrived example that I used in Part 1.  This should be enough to convey the basic idea, I hope.  [Full generality for the analysis of real data will demand a more detailed and more technical treatment of a hierarchical statistical model; I’ll do that later, when I come to write Part 3.]

On reflection, I think a lot of the criticism seen by the OfS report since its publication relates to the use of the word “explain” in that report.  And indeed, that was a factor also in my own (mentioned above) “I do not have an answer” comment.  It seems obvious — to me, anyway — that any serious attempt to explain apparent increases in the awarding of First Class degrees would need to take account of a lot more than just the attributes of students when they enter university.  With the data used in the OfS report I think the best that one can hope to do is to measure those apparent increases (or decreases), in such a way that the measurement is a “fair” one that appropriately takes account of incoming student attributes and their fluctuation over time.  If we take that attitude — i.e, that the aim is only to measure things well, not to explain them — then I do think it is possible to devise a better statistical analysis, for that purpose, than the one that was used in the OfS report.

(I fully recognise that this actually was the attitude taken in the OfS work!  It is just unfortunate that the OfS report’s use of the word “explain”, which I think was intended there mainly as a technical word with its meaning defined by a statistical regression model, inevitably leads readers of the report to think more broadly about substantive explanations for any apparent changes in degree-class distributions.)

### 2.1  Those “toy” data again, and a better statistical model

Recall the setup of the simple example from Part 1:  Two academic years, two types of university, two types of student.  The data are as follows:

2010-11
University A           University B
Firsts  Other          Firsts  Other
h   1000      0        h    500    500
i      0   1000        i    500    500
2016-17
University A          University B
Firsts  Other          Firsts  Other
h   1800    200       h       0      0
i      0      0       i     500   1500


Our measurement (of change) should reflect the fact that, for each type of student within each university, where information is available, the percentage awarded Firsts actually decreased (in this example).

Change in percent awarded firsts:
University A, student type h:  100% --> 90%
University A, student type i:   no data
University B, student type h:   no data
University B, student type i:   50% --> 25%

This provides the key to specification of a suitable (statistical) measurement model:

• measure the changes at the lowest level of aggregation possible;
• then, if aggregate conclusions are wanted, combine the separate measurements in some sensible way.

In our simple example, “lowest level of aggregation possible” means that we should measure the change separately for each type of student within each university.  (In the real OfS data, there’s a lower level of aggregation that will be more appropriate, since different degree courses within a university ought to be distinguished too — they have different student intakes, different teaching, different exam boards, etc.)

In Statistics this kind of analysis is often called a stratified analysis.  The quantity of interest (which here is the change in % awarded Firsts) is measured separately in several pre-specified strata, and those measurements are then combined if needed (either through a formal statistical model, or less formally by simple or weighted averaging).

In our simple example above, there are 4 strata (corresponding to 2 types of student within each of 2 universities).  In our specific dataset there is information about the change in just 2 of those strata, and we can summarize that information as follows:

• in University A, student type i saw their percentage of Firsts reduced by 10%;
• in University B, student type h saw their percentage of Firsts reduced by 50%.

That’s all the information in the data, about changes in the rate at which Firsts are awarded.  (It was a deliberately small dataset!)

If a combined, “sector-wide” measure of change is wanted, then the separate, stratum-specific measures need to be combined somehow.  To some extent this is arbitrary, and the choice of a combination method ought to depend on the purpose of such a sector-wide measure and (especially) on the interpretation desired for it.  I might find time to write more about this later in Part 3.

For now, let me just recall what was the “sector-wide” measurement that resulted from analysis (shown in Part 1) of the above dataset using the OfS report’s method.  The result obtained by that method was a sector-wide increase of 7.5% in the rate at which Firsts are awarded — which is plainly misleading in the face of data that shows substantial decreases in both universities.  Whilst I do not much like the OfS Report’s “compare with 2010” approach, it does have the benefit of transparency and in my “toy” example it is easy to apply to the stratified analysis:

2016-17          Expected Firsts       Actual
based on 2010-11
University A         2000             1800
University B         1000              500
------------------------------------------
Total                3000             2300

— from which we could report a sector-wide decrease of 700/3000 = 23.3% in the awarding of Firsts, once student attributes are taken properly into account.  (This could be viewed as just a suitably weighted average of the 10% and 50% decreases seen in University A and University B respectively.)

As before, I have made the full R code available (as an update to my earlier R Markdown document).  For those who don’t use R, I attach here also a PDF copy of that: grade-inflation-example.pdf

### 2.2  Generalising the better model: More strata, more time-points

The essential idea of a better measurement model is presented above in the context of a small “toy” example, but the real data are of course much bigger and more complex.

The key to generalising the model will simply be to recognise that it can be expressed in the form of a logistic regression model (that’s the same kind of model that was used in the OfS report; but the “better” logistic regression model structure is different, in that it needs to include a term that defines the strata within which measurement takes place).

This will be developed further in Part 3, which will be more technical in flavour than Parts 1 and 2 of this blog-post thread have been.  Just by way of a taster, let me show here the mathematical form of the logistic-regression representation of the “toy” data analysis shown above.  With notation

• u for providers (universities); u is either A or B in the toy example
• t for type of student; t is either h or i in the toy example
• y for years; y is either 2010-11 or 2016-17 in the toy example
• $\pi_{uty}$ for the probability of a First in year y, for students of type t in university u

the logistic regression model corresponding to the analysis above is

$\log\left(\pi_{uty}\over 1-\pi_{uty}\right) = \alpha_{ut} + \beta_{uy}$.

This is readily generalized to situations involving more strata (more universities u and student types t, and also degree-courses within universities).  There were just 4 stratum parameters $\alpha_{Ah},\alpha_{Ai}, \alpha_{Bh}, \alpha_{Bi}$ in the above example, but more strata are easily accommodated.

The model is readily generalized also, in a similar way, to more than 2 years of data.

For comparison, the corresponding logistic regression model as used in the OfS report looks like this:

$\log\left(\pi_{uty}\over 1-\pi_{uty}\right) = \alpha_{t} + \beta_{uy}$.

So it is superficially very similar.  But the all-important term $\alpha_{ut}$ that determines the necessary strata within universities is missing from the OfS model.

I will aim to flesh this out a bit in a new Part 3 post within the next few days, if time permits.  For now I suppose the model I’m suggesting here needs a name (i.e., a name that identifies it more clearly than just “my better model”!)  Naming things is not my strong point, unfortunately!  But, for now at least, I will term the analysis introduced above “stratified by available student attributes” — or “SASA model” for short.

(The key word there is “stratified”.)

To cite this entry:
Firth, D (2019). Part 2, further comments on OfS grade-inflation report. Weblog entry at

### Office for Students report on “grade inflation”

2019-01-02

Update, 2019-01-07: There’s now also Part 2 of this blog post, for those who are keen to know more!

Chris Parr, a journalist for Research Professional, asked me to look at a recent report, Analysis of degree classifications over time: Changes in graduate attainment.  The report was published by the UK government’s Office for Students (OfS) on 19 December 2018, along with a headline-grabbing press release:

The report uses a statistical method — the widely used method of logistic regression — to devise a yardstick by which each English university (and indeed the English university sector as a whole) is to be measured, in terms of their tendency to award the top degree classes (First Class and Upper Second Class honours degrees).  The OfS report looks specifically at the extent to which apparent “grade inflation” in recent years can be explained by changes in student-attribute data available to OfS (which include grades in pre-university qualifications, and also some other characteristics such as gender and ethnicity).

I write here as an experienced academic, who has worked at the University of Warwick (in England) for the last 15 years.  At the end, below, I will briefly express some opinions based upon that general experience (and it should be noted that everything I write here is my own — definitely not an official view from the University of Warwick!)

My specific expertise, though, is in statistical methods, and this post will focus mainly on that aspect of the OfS report.  (For a more wide-ranging critique, see for example https://wonkhe.com/blogs/policy-watch-ofs-report-on-grade-inflation/)

Parts of what I say below will get a bit technical, but I will aim to write first in a non-technical way about the big issue here, which is just how difficult it is to devise a meaningful measurement of “grade inflation” from available data.  My impression is that, unfortunately, the OfS report has either not recognised the difficulty or has chosen to neglect it.  In my view the methods used in the report are not actually fit for their intended purpose.

## 1.  Analysis of an idealized dataset

In much the same way as when I give a lecture, I will aim here to expose the key issue through a relatively simple, concocted example.  The real data from all universities over several years are of course quite complex; but the essence can be captured in a much smaller set of idealized data, the advantage of which is that it allows a crucial difficulty to be seen quite directly.

### An imagined setup: Two academic years, two types of university, two types of student

Suppose (purely for simplicity) that there are just two identifiable types of university (or, if you prefer, just two distinct universities) — let’s call them A and B.

Suppose also (purely for simplicity) that all of the measurable characteristics of students can be encapsulated in a single binary indicator: every student is known to be either of type h or of type i, say.  (Maybe h for hardworking and i for idle?)

Now let’s imagine the data from two academic years — say the years 2010-11 and 2016-17 as in the OfS report — on the numbers of First Class and Other graduates.

The 2010-11 data looks like this, say:

  University A           University B
Firsts  Other          Firsts  Other
h   1000      0        h    500    500
i      0   1000        i    500    500

The two universities have identical intakes in 2010-11 (equal numbers of type h and type i students).  Students of type h do a lot better at University A than do students of type i; whereas University B awards a First equally often to the two types of student.

Now let’s suppose that, in the years that follow 2010-11,

• students (who all know which type they are) learn to target the “right” university for themselves
• both universities A and B tighten their final degree criteria, so as to make it harder (for both student types h and i) to achieve a First.

As a result of those behavioural changes, the 2016-17 data might look like this:

  University A          University B
Firsts  Other          Firsts Other
h   1800    200       h       0     0
i      0      0       i     500  1500

Now we can combine the data from the two universities, so as to look at how degree classes across the whole university sector have changed over time:

  Combined data from both universities:
2010-11                  2016-17
Firsts  Other            Firsts  Other
h   1500    500          h   1800    200
i    500   1500          i    500   1500
-------------            -------------
Total   2000   2000      Total   2300   1700
%     50     50              57.5   42.5

### The conclusion (not!)

The last table shown above would be interpreted, according to the methodology of the OfS report, as showing an unexplained increase of 7.5 percentage points in the awarding of first-class degrees.

(It is 7.5 percentage points because that’s the difference between 50% Firsts in 2010-11 and 57.5% Firsts in 2016-17.  And it is unexplained — in the OfS report’s terminology — because the composition of the student body was unchanged, with 50% of each type h and i in both years.)

But such a conclusion would be completely misleading.  In this constructed example, both universities actually made it harder for every type of student to get a First in 2016-17 than in 2010-11.

### The real conclusion

The constructed example used above should be enough to demonstrate that the method developed in the OfS report does not necessarily measure what it intends to.

The constructed example was deliberately made both simple and quite extreme, in order to make the point as clearly as possible.  The real data are of course more complex, and patterns such as shifts in the behaviour of students and/or institutions will usually be less severe (and will always be less obvious) than they were in my constructed example.  The point of the constructed example is merely to demonstrate that any conclusions drawn from this kind of combined analysis of all universities will be unreliable, and such conclusions will often be incorrect (sometimes severely so).

### That false conclusion is just an instance of Simpson’s Paradox, right?

Yes.

The phenomenon of analysing aggregate data to obtain (usually incorrect) conclusions about disaggregated behaviour is often (in Statistics) called ecological inference or the ecological fallacy.  In extreme cases, even the direction of effects can be apparently reversed (as in the example above) — and in such cases the word “paradox” does seem merited.

### Logistic regression

The simple example above was (deliberately) easy enough to understand without any fancy statistical methods.  For more complex settings, especially when there are several “explanatory” variables to take into account, the method of logistic regression is a natural tool to choose (as indeed the authors of the OfS report did).

It might be thought that a relatively sophisticated tool such as logistic regression can solve the problem that was highlighted above.  But that is not the case.  The method of logistic regression, with its results aggregated as described in the OfS report, merely yields the same (incorrect) conclusions in the artificial example above.

For anyone reading this who wants to see the details: here is the full code in R, with some bits of commentary.

## 2.  So, what is a better way?

The above has shown how the application of a statistical method can result in potentially very misleading results.

Unfortunately, it is hard for me (and perhaps just as hard for anyone else?) to come up with a purely statistical remedy — i.e., a better statistical method.

The problem of measuring “grade inflation” is an intrinsically difficult one to solve.  Subject-specific Boards of Examiners — which is where the degree classification decisions are actually made within universities — work very hard (in my experience) to be fair to all students, including those students who have graduated with the same degree title in previous years or decades.  This last point demands attention to the maintenance of standards through time.  Undoubtedly, though, there are other pressures in play — pressures that might still result in “grade inflation” through a gradual lowering of standards, despite the efforts of exam boards to maintain those standards.  (Such pressures could include the publication of %Firsts and similar summaries, in league tables of university courses for example.)   And even if standards are successfully held constant, there could still be apparent grade-inflation wherever actual achievement of graduates is improving over time, due to such things as increased emphasis on high-quality teaching in universities, or improvements in the range of options and the information made available to students (who can then make better choices for their degree courses).

## 3.  A few (more technical) notes

a.  For the artificial example above, I focused on the difficulty caused by aggregating university-level data to draw a conclusion about the whole sector.  But the problem does not go away if instead we want to draw conclusions about individual universities, because each university comprises several subject-specific exam boards (which is where the degree classification decisions are actually made).  Any statistical model that aims to measure successfully an aspect of behaviour (such as grade inflation) would need to consider data at the right level of disaggregation — which in this instance would be the separate Boards of Examiners within each university.

b.  Many (perhaps all?) of the reported standard errors attached to estimates in the OfS report seem, to my eye, unrealistically small.  It is unclear how they were calculated, though, so I cannot judge this reliably.  (A more general point related to this: It would be good if the OfS report’s authors could publish their complete code for the analysis, so that others can check it and understand fully what was done.)

c.  In tables D2 and D3 of the OfS report, the model’s parameterization is not made clear enough to understand it fully.  Specifically, how should the Year estimates be interpreted — do they, for example, relate to one specific university?  (Again, giving access to the analysis code would help with understanding this in full detail.)

d.  In equations E2 and E3 of the OfS report, it seems that some independence assumptions (or, at least, uncorrelatedness)  have been made.  I missed the justification for those; and it is unclear to me whether all of them are indeed justifiable.

e.  The calculation of thresholds for “significance flags” as used in the OfS report is opaque.  It is unclear to me how to interpret such statistical significance, in the present context.

## 4.  Opinion

This topic seems to me to be a really important one for universities to be constantly aware of, both qualitatively and quantitatively.

Unfortunately I am unconvinced that the analysis presented in this OfS report contributes any reliable insights.  This is worrying (to me, and probably to many others in academia) because the Office for Students is an important government body for the university sector.

It is especially troubling that the OfS appears to base aspects of its regulation of universities upon such a flawed approach to measurement.  As someone who has served in many boards of examiners, at various different universities in the UK and abroad (including as an external examiner when called upon), I cannot help feeling that a lot of careful work by such exam boards is in danger of simply being dismissed as “unexplained”, on the basis of some well-intentioned but inadequate statistical analysis.  The written reports of exam boards, and especially of the external examiners who moderate standards across the sector, would surely be a much better guide than that?