Popular Science journals and the general public as well, don’t like negative results. Although such results may be valuable for science, people prefer ideas which promise fast and tremendous success. “Drink beetroot juice, and you run faster and longer!”, “Do our special TRX programme, and you gain incredible core strength!  “Follow this diet and..!” Very often such advertisement followed by another cliché – “Scientific studies proved this.”

However, how, actually, scientific research proves something? Well, it uses a powerful weapon — Statistics.
Statistics is indeed a handy tool for science. Maybe because Nature itself is based on probability, Statistics, whose “…inferences are made under the framework of probability theory”  (Wikipedia), is a robust method to describe natural events. However, when using inappropriately, Statistics may support wrong notions and create confusion. 

 

Is statistical significance always significant?

Most of us who studied Sports Science at universities were taught to use statistical significance. In most studies and experiments, scientists explore the probability that their intervention makes a difference. However, they always have to admit that this difference may happen by chance. So, before the experiment, they have to set the acceptable level of probability that a positive result happens accidentally – a. Then they do a research and obtain study’s probability of by-chance positive result — P. This value of P must be below the pre-determined level for acceptance of study as “statistically significant.” Otherwise, an idea should be rejected.

However, you can use some manipulations for changing P-value. For example, it significantly depends on sample size (how many participants you have tested). Martin Buchheit showed how the inclusion of just two people into the group of twelve changed research conclusion on the opposite, even though critical parameters, such as mean and standard deviation, remained the same. So, if you have twelve people then: “scientists proved that… the intervention had no effect”. However, include just two more participants and: “scientists found a significant effect of… on performance”.
A temptation for scientists, who are desperate to find something to publish is too big, and sometimes they cannot resist. Then we have a published article which begins to be cited by other authors and – vu-a-la ! Now we have “widely accepted scientific opinion” which may be followed by coaches and athletes.

Other disadvantages of P-value method are that it is not suitable for testing individual athlete and it does not show the magnitude of changes (though you can do it with additional analysis). Perhaps scientists are most interested in general tendencies, however, coaches more care about the extent of changes in an individual athlete. Did he/she improve? How sure may we be that he/she improved? Is this improvement big enough to influence performance? Until recently, I thought that the answers to these questions might be only based on coach’s experience. However, in works of Martin Buchheit and William Hopkins, I have found a statistical method which can give scientific support for our intuition. It is Magnitude-based Inferences (MBI). I would like to describe it in this post. I am not specialist in statistics, so perhaps my understanding is neither comprehensive nor entirely correct. Additionally, it is necessary to note that not all statisticians agree with benefits of MBI and, of course, P-value methods should not be rejected. MBI is just a useful addition to scientific arsenal of sports practitioners. So it is why I would like to share my understanding of this approach with colleagues. I will try to keep the explanation as simple as possible; nevertheless, some math is unavoidable.

Some necessary stuff

A few things we need  to know before start:
Standard Deviation (SD) is a measure of variability of data. For example, you may have test results: 1;2;3;4;5;6;7;8;9;10 and 4;4;5;5;5;6;6;6;7;7. It is obvious that the first set of data is more dispersed than the second although their means are the same (5.5). This pattern may tell practitioner that first data set was from athletes with completely different abilities or the test itself is hugely unreliable. SDs values reflect this. For the first set, it is 3.03 whereas for the second 1.08. Mathematically, SD is a square root of variance (average of squared distances from the mean). Another useful property of SD is that it allows quantifying the distribution of chances for a particular range of values to occur. For example, values between -1SD and +1SD (3.6 and 8.4 on picture 1) have 68% probability to happen (for normal distribution). There are some good explanations of SD on Internet and, trust me, it is not too complicated.

Coefficient of variation ( CV) is a measure of relative variability: SD/Mean. Very often expressed as SD/Mean x 100%.

 

Picture 1. Normal distribution.

Are you sure about test results?

At first glance, things look very simple. We test an athlete and, if his/her result is better than it was in the previous test, we can conclude that the athlete improved. The second question follows immediately: how sure we are in our conclusion?  The result may be just a consequence of the natural variations in athlete performance and test conditions. That leads to, so-called, Error of Measurement or Typical Error (TE). Thus, keeping in mind TE issue, we need to address the last question: what is the smallest progress in the test that can give us a hope for an improvement in real competitions? This is the Smallest Worthwhile Change (SWC).

SWC

Let’s start with the last question. List of SWCs for different physical tests already exists, thanks to work of mentioned above authors and other scientists. However, it is always good to understand in general the idea behind this guidance.
First of all, SWC may be derived from variations in performance. To that, you need to know CV for chosen event. You can find it by yourself analysing performance of your athlete in different competitions. For example, athlete showed results 10.1; 10.1; 9.9; 10.0; 10.1; 9.9; 9.9 seconds in 100 meters sprint during seven events. His average time is 10.0 sec and SD of data is 0.1 sec. That gives us CV of 0.1/10=0.01 or 1%.
Alternatively, we may take a typical CV value for the similar population of athletes.
Using observation and simulations, William Hopkins already did some work for us. He made a list of CVs for different events and suggested SWC as 0.3 CV. Thus, for our example, improvement of 0.03 sec (0.3% of mean) is SWC. It is necessary to note that this guidance is for solo events (track and field, bike, swimming, etc.). Changes of 0.3; 0.9; and 1.6 of CV Hopkins suggested considering as small, medium and large respectively.

A situation may be a little bit more complicated in team sports where there is no apparent connection between test and performance. For them, a suggestion is to use team data and calculate SWC based on Cohen d coefficient. You need to find a difference between player result in the first  and second tests and divide it by an overall team (between-subject) SD in these tests. So we compare changes for an individual player against all team variation. Improvements equal to 0.2 is SWC, whereas larger than 0.5 and 0.8 are medium and large respectively. Alternatively, SWC can be defined based on common sense and experience. For example, Martin Buchheit suggested considering 1% improvement in 20 m sprint test meaningful for a footballer, because 20 cm is a minimum distance player needs to be ahead of opponent for winning the ball.
So, don’t be afraid to set your own, logically grounded SWC, however, remember that this should be done before the test.

Well, now we define SWC; nevertheless, it is still not enough information to say how sure we are that the test showed changes. We need to deal with TE of the test. First of all, we have to find it.

Typical error.

When testing an athlete, we are interested in finding his/her “true result” in this test. Theoretically it can be done by testing an athlete infinitely many times and computing average of these tests.  It is obvious that such design is unreal and in reality our measurements are not precise . Thus it is necessary to take into account a random, unavoidable and unpredictable error of measurement in the test which is always present. It is why it is called “typical”. TE of the test is a “noise” which spoils the signal (real result); thus it is always better if the noise would be less than SWC. We can reduce noise by testing an athlete repeatedly over the short period and take the average of these tests. In this case variances responsible for TE tend to cancel each other. TE of test  is reduced by a factor √N (N=number of measures). For example, the average of two tests reduced TE by 30% whereas four tests halves TE. Because TE is a measure of reliability of a chosen test, we have to have a clear understanding about the amplitude of TE in our test for making a reasonable guess how close to ideal “true result” may be our data.

Example:

We hypothesised that golf putting is a reliable test to measure person’s hand-eye coordination (this may be stress-resistance or whatever). Note, that I am not talking about the validity of the test (is it really suitable for the declared task) but rather is it reliably reproducible. It means that we are interested to find test TE. Our participants are not golf players, so their technique and experience are equal. We give to five subjects for ten strokes each, and they try to score as many as possible. We have two ways to measure the reliability of the test. First is to give one person a few tests (5-8 ten-strokes series)  and to look how his results vary between these tests. Basically, SD of these tests is TE. It means that test results vary due to random variations in athlete performance. However, this may be not exactly true because variations can happen due to participant learns the test and has rapid improvement at the beginning. That fact may bias TE estimation. Thus Hopkins suggested another way: to test group of subjects two times and to find SD of changes between the tests. As I understand, he assumed that SD of changes is relatively independent of learning because improvements due to learning, over a short period,  are approximately the same for each participant. Most part of the deviation  then happens due to TE.  So let’s see how our participants performed (picture2).

 

Picture 2. Tests results in golf-putting task.

 

We can see that results vary between subjects and between two test for most participants (except Polly).

Variations between subjects combine TE of the test and difference in their abilities and we need to separate these two factors.
Additionally, participants have differences between two tests but we cannot be sure to what extent is it due to learning or TE?
As was discussed above, SD of changes may eliminate this problem. So we have:
Mean (of changes) =0
SD (of changes)= 2.24
We already see that changes are quite spread out.
Now we can calculate TE:

TE=SD/√2

Why we divided SD on √2? It follows from the assumption that variances which are responsible for TE of each test transfer into test’s differences and combined. Thus variance for the test’s differences is equal to the sum of TE-variances of two tests.
Because SD is a square root of variance now we may write following:
Squared SD of differences (A) is a sum of squared SD which is responsible for TE of each test (B).
A²=B²+B²   A²=2B²   B²=A²/2   B= A/√2.
This means that part of the test SD which is responsible for TE is equal to SD of individual changes between two tests divided by 1.4. This conclusion will be useful for us in the future.

TE= SD (of changes) /1.4

In our example: 2.24/1.4=1.6
So, TE in this test, for this type of subjects is 1.6 shots.

From a practical point of view this means that improvement after training of 1-2 good strokes from 10 cannot be considered as reliable. Or, if a difference between two participants is just 1-2 good strokes, we may not be sure that they definitely have different abilities in putting.

Test analysis

So now we have all necessary ingredients for analysing the results of testing an individual athlete statistically. These ingredients are TE of the chosen test, SWC and observed changes. The final piece of math allows us to find out the probability that the test result reflects improvement, trivial changes, or even impairment.

Let’s take an example:
We test a player before pre-season camp and after that and found that he/she improved by 1.5%. We know from the literature or our estimation that SWC for this test is 0.5% and TE is 1%.
Despite the fact that observed change (OC) shows improvement, and it is higher than TE (which is good), the probability of trivial changes (no changes) and even impairment still exists.
The probability that “true” result lies between -0.5 and 0.5 % is the probability of trivial change.
Below -0.5% — harmful changes
Over 0.5% — beneficial changes

So, how exactly probability is distributed in this example?

First, we assume that probability has a normal distribution (bell shape curve and even distribution of chances astride OC.
Next, we need to transfer TE back to SD because we need SD for calculations:
1 x √2 =1×1.4  SD=1.4.
From now I am going to use the word “unit” instead of “percent” while talking about changes because I don’t want to confuse percent of changes with a percent of probability: 1 unit of changes=1% of changes.

Picture 3. Distribution of probability for changes.

Then we can use Z table which shows a distribution of chances for “true result” to occur somewhere within our normal distribution. These chances are distributed relative to the distance (expressed as a fraction of SD) from the mean (in this case OC). Further from OC – fewer chances that true test result is there. Now look at the picture 3.

Point -0.5 is 2 units from point 1.5 (OC) That is a distance of 1.4 SD (2/1.4=1.4)

In accordance to Z-table, beyond a point -1.4 SD chances to meet “true value” are 8 %. So, 8% is the probability that our athlete, actually, decreased his performance.

Point 0.5 is 1 unit from 1.5 That is 0.7 SD (1/1.4=0.7)
Below -0.7 SD are 24 % of chances. Thus between -0.5 and 0.5 are 24-8= 16 % chances of trivial changes.

All changes that are above 0.5 are chances for improvement. This gives us: 100-24=76%

Conclusion: We have 76% confidence that changes are beneficial and athlete improved.

There is another, more simple, way to report the results. Hopkins suggested using “Likely limits” . This means that athlete “true result” most likely lies somewhere inside these limits. The borders of this range are defined as OC±TE. How is likely “true value” inside? Well, if TE is 0.7 SD (1/1.4) then inside OC±0.7 SD are 52% chances ( Z table).

If whole limit is inside, let’s say, “beneficial” area we can call result “clear beneficial”. However if limit spreads between two or three areas then it would be “unclear”.
In our example, limits are 1.5±1. The nearest low border is point 0.5 from where trivial changes begin. Our low limit is 1.5-1=0.5 which is precisely on the border of trivial changes. So we can conclude that likely limits in this test are entirely inside the beneficial area, and the result is “clearly beneficial” (picture 4).

Picture 4. Likely limits

Note that the lower is TE more likely both limits are in the same area.

Testing a team

Basically, the same approach may be applied to a team testing. Let’s say we have a team in pre-season camp and want to know if there is an overall improvement in aerobic conditioning. So we tested them at the beginning and after the training period. Variations in these two sets of data are due to:

1. Training

2. Differences between players

3. Random errors – TE

We may find team SD from the tests and calculate SWC as 0.2 SD. Difference between means of the tests gives us observed changes. Now, if we know TE from literature, we have all necessary to perform MBI analysis as was described above. It is a little bit more complicated if we want to find TE ourself. I use one-way repeated measures ANOVA technique for that purpose. It allows separation of variations in our tests and finding which part of them is due to TE.

Conclusion

Despite all these look a little bit complicated, there is nothing that may prevent thoughtful readers, whether they are athletes, coaches and, especially, sports scientists, to understand the idea. Math is at the level of squares and square roots, and stats is at the basic level as well (probably except ANOVA). For those, who don’t want or have no time to make calculations, there is Excel spreadsheet freely available. However, it is always useful to understand the method, at least in general. Thus, I strongly advise doing so. In my opinion, this is the case when, Statistic instead of being something scaring and vague for practitioners, can provide a useful tool for a scientific approach to performance.

Literature

1. Bartlett, J. W., and C. Frost. “Reliability, repeatability and reproducibility: analysis of measurement errors in continuous variables.” Ultrasound in obstetrics & gynecology 31.4 (2008): 466-475.

2. Batterham, A. M., & Hopkins, W. G. (2006). Making meaningful inferences about magnitudes. International Journal of Sports Physiology and Performance, 1(1), 50-57.

3. Buchheit, M. (2016). The Numbers Will Love You Back in Return-I Promise. Int J Sports Physiol Perform, 11(4), 551-554.

4. Errors of measurement, theory, and public policy. https://www.ets.org/Media/Research/pdf/PICANG12.pdf

5. Interpreting the result of fitness testing (Pyne, 2018).

6. Hopkins, W. G. (2000). Measures of reliability in sports medicine and science. Sports Medicine, 30(1), 1-15.

7. Magnitudes matter more than beetroot juice https://sportperfsci.com/magnitudes-matter-more-than-beetroot-juice/

8. Want to see my report coach? https://martin-buchheit.net/2017/02/18/want-to-see-my-report-coach/

 

 

Peter

Written by: Peter