This new research paper, by a group led from the University of Stirling, made the national news in the UK last week:
Di Virgilio, T.G., et al., Evidence for Acute Electrophysiological and Cognitive Changes Following Routine Soccer Heading, EBioMedicine (2016), http://dx.doi.org/10.1016/j.ebiom.2016.10.029
My declaration of interest:
I will write some notes below about my reading of the paper. But first I should make clear that I am not a completely disinterested scientist when it comes to this topic. For quite some years now, my son and I have been avid supporters of West Bromwich Albion FC, where calls for better research on the long-term effects of heading footballs have been made following the death of former Albion and England centre forward Jeff Astle in 2002 (aged 59). Jeff Astle was a prolific goalscorer for West Brom, well known for his outstanding ability as a header of the ball. The Coroner’s verdict in 2002 was “death by industrial disease”, and his report on Jeff Astle’s death included the comment that “The trauma caused to the front of his brain [by heading the ball] is likely to have had a considerable effect on the cause of death.” There was even an adjournment debate in the House of Commons on this subject, following Jeff Astle’s death.
The background to my notes below is that I strongly support the case for better research on the long-term effects of heading a football: it seems clear that not enough is known about the health risks, and such questions as whether heading the ball is safer now that footballs have become lighter.
Some of the news headlines from last week:
Heading footballs ‘affects memory’ — BBC Scotland, 2016-10-24
Heading footballs affects memory and brain function, study finds — ITV News, 2016-10-24
Study finds heading a football has immediate effect on the brain — The Guardian, 2016-10-24
Heading a Soccer Ball Affects Memory Function — Wall Street Journal video, 2016-10-24
Calls for more research as football headers linked to memory loss — Sky News, 2016-10-24. (This features a short video clip of Dawn Astle, Jeff’s daughter, talking persuasively about the need for thorough, longitudinal research.)
Here are the original press release and an online article by the some of the authors of the original research paper:
Heading a football causes instant changes to the brain — NIHR press release, 2016-10-24
How we discovered that heading a football causes impairment of brain function — The Conversation, 2016-10-24
And on the same day, the story was reported also on the public news website of the UK’s National Health Service:
Heading footballs may cause short-term brain changes — NHS Choices, 2016-10-24
My reading of the original research paper
The research reported in the paper is a small, before-and-after experiment. Data are analysed from 19 amateur footballers who took part in a controlled heading-practice session, with various measurements made before and after the session (immediately before, immediately after, and at three later time-points up to 2 weeks after).
The paper’s main findings are based on before-to-after differences in three of the measurements made, these three having apparently achieved statistical significance in the experiment, with reported p-values less than the pre-assigned threshold of 0.05. The three “statistically significant” differences found were:
- The “primary outcome measure cSP” — a measure of corticomotor inhibition — was found to have increased for 14 of the 19 participants when measured immediately after the heading practice session. The reported p-value, for the apparent increase in response time that was seen on average, is 0.049. [Section 3.1 of the paper]
- The “Spatial Working Memory” (SWM) test scores showed an increased error rate on average (on the log scale the change was from 0.79 before to 1.00 after the heading session). The reported p-value for this apparent difference is 0.03. [Section 3.2 of the paper]
- The “Paired Associated Learning” (PAL) test scores also showed an increased error rate on average (on the log scale the change was from 0.38 before to 0.65 after). The reported p-value for this apparent difference is 0.007. [Section 3.2 of the paper]
How to interpret those apparent effects and their p-values?
I was prompted to think a bit about this by the reported p-value of 0.049 for the primary outcome measure: that’s only just less than the pre-assigned threshold of 0.05. So if it’s agreed that p equal to 0.05 is the largest value that can reasonably count as “statistically significant” evidence, the value of p=0.049 found for this apparent increase in cSP time should probably be labelled “almost insignificant”! (This is in agreement with the “14 out of 19” finding mentioned already above, for the number of subjects whose cSP time had shown any increase at all; a simple sign test is enough to tell us that 14 out of 19 is not quite significant at the 0.05 level.)
But was 0.05 a reasonable threshold to use, anyway? A computed p-value of 0.05, or even 0.03, should really be considered very weak evidence when quite a large number of different measurements are being recorded and tested, as was the case in this study. As Table 2 of the paper shows, there were 8 different measurements taken, each done on four occasions after the heading session: that’s a lot of chances to find some “significant” differences. The much-used threshold of p<0.05, which is designed to limit the chance of a spuriously significant finding to 5% when conducting a single test, is much more likely to throw up spuriously significant results when several hypotheses are being tested. A crude Bonferroni correction based on 8 tested differences, for example, would result in the threshold of 0.05 being reduced to 0.05/8 = 0.006, as a much more stringent criterion to apply in order to be sure that the chance of a spuriously significant finding is still less than 5%.
Of the paper’s three main findings, then, only the third one — the increased average error rate in the PAL test scores — seems to be an apparent effect that might demand further attention. The paper mentions [in Section 3.2] that an increased error rate in the PAL test is compatible with reduced long-term memory function. (But note that if we do take the route of a Bonferroni correction, to allow for the fact that 8 different measurements were tested — while still neglecting the number of occasions on which post-session measurements were made — the reported p-value of 0.007 still would fail to reach significance at the traditional 5% level.)
Some methodological quibbles and questions
Q1. The big one: Causality
The press release mentioned above, and hence the media coverage of this research paper, reports an apparently causal link found between routine heading of footballs and outcomes such as long-term memory function. Such a causal link does seem rather plausible, a priori. But the research reported in this paper does not (for me, at any rate) firmly establish that cause. The study design leaves open the possibility of an alternative explanation (for an increase in PAL test error scores, for example). The paper’s authors allude to this problem in their Discussion section [Section 4 of the paper], where they write: “Future work should include a control activity such as body movement without head impact”. I do agree; and careful design of the control regime is essential if causality is to be established compellingly.
What sort of alternative explanation(s) might there be? Well, the problem is that heading the football was not the only thing that happened to each of the 19 experimental subjects between the first two sets of measurements. Some other things that might conceivably have produced an effect are:
- the passing of time (e.g., time since breakfast?)
- the order in which measurements were taken (the research paper is not altogether clear on this, actually — there seem to be conflicting statements in Section 2.2 there)
- the thrill of taking repeated shots at a goal (which might have been just as great had the shots been volley-kicks instead of headers?)
I am not suggesting here that any of these possible alternative causes is the reality; only that if we want to establish that heading the ball is a cause of something, then other potential causes must be eliminated as part of the study design or through external scientific knowledge.
Q2. Missing data?
In Section 2.1 of the paper it is mentioned that there were originally 23 study participants recruited. This got reduced to 19 for the analysis, because “Data from one participant could not be analyzed and three more participants withdrew from the study for personal reasons”. It would have been good to have some more detail on this. In particular:
- what was it about one participant’s data that meant they could not be analyzed?
- at what point in the study did each of the other three participants withdraw?
Q3. Size of effect, misleadingly reported?
Update, 2016-11-04: I have now heard back from one of the paper’s authors about this question. Dr Magdalena Ietswaart has kindly told me that the 67% figure “is based on raw scores rather than log transformed values”. In which case, my conjecture below about the origin of that figure was false — and for that reason I have now struck through it, and I humbly apologise to the paper’s authors for having guessed wrongly about this particular point. (The 67% increase still does strike me as startling, though!)
As discussed above, of all the various effect sizes that are considered in the paper it is the increase in the PAL test error rate that might merit further attention, since that is the one effect for which the experimental evidence might be viewed as statistically significant. In the paper [Section 3.2], it is stated that the error score on the PAL task immediately after heading increased by 67%.
But this rather startling statement of the effect size — which appeared also in much of the press coverage, and indeed is the single number that prompted me to read the paper in full — appears to be wrong. If my calculations are correct, the increase found in the PAL test error rate is in fact more like 26% (which is still a fairly substantial increase, of course). The reason for the discrepancy is that the 67% figure appears to have been calculated directly from Table 2 in the paper, where the increase in the PAL test error rate is measured by averaging the logarithms of the raw error rates. The ratio (0.65 – 0.38)/0.38, calculated from Table 2, is roughly 1.67 to within the effects of rounding error, i.e., it corresponds to a 67% increase in the logarithm. But it makes no sense at all to calculate a ratio of logarithms — the numerical value of such a ratio in the present context is completely meaningless. (This will be obvious to the reader who understands how logarithms work, but probably not otherwise! The key point, mathematically, is that while a percentage increase in error rate — or in anything else, for that matter — does not depend at all on the units of measurement, the same is not the case after logarithmic transformation. A ratio of logarithms will depend in a completely arbitrary way on the original units of measurement used, and so will be meaningless.) How did I get the 26% figure mentioned above as the correct percentage increase? Well it’s actually something of a guess, because I do not have access to the raw data. (I have asked the paper’s authors if they will share the data with me; but right now I only have Table 2 to work from.) It’s probably not such a bad guess, though. I made the working assumption that the distributions underlying the figures in Table 2 are normal, which seems reasonable given the rationale for logarithmic transformation that is given in Section 2.8 of the paper. With that assumption, the ratio of means is calculated (by standard properties of the log-normal distribution) as
exp(0.65 + ((0.29)^2 / 2)) / exp(0.38 + (0.41^2 / 2)) = 1.26 I should emphasise, though, that this is largely guesswork. In particular, it is possible that I have misunderstood how the 67% figure, quoted in the Section 3.2 of the paper, was arrived at.
Let me restate here what I said near the top: I strongly support the case for better research on the long-term effects of heading a football. With that in mind, I am disappointed that the paper that I have read and discussed here does not provide very convincing evidence to help our understanding.
One rather positive aspect of the paper is that it did get a lot of media coverage, which helped to bring the issue (and accompanying memories of Jeff Astle!) to wider public attention, at least for a day or two.
But, as Dawn Astle so eloquently argued in the Sky News interview that’s linked above: there is still a clear need for good research on the matter of “routine”, but potentially long-lasting, brain injuries in football.
© David Firth, November 2016
To cite this entry:
Firth, D (2016). About heading soccer balls, and memory loss. Weblog entry at URL https://statgeek.wordpress.com/2016/11/03/about-heading-soccer-balls-and-memory-loss/.