The Seven Sins of Experimental Design and Statistics

By Michelle Taylor

Reproducibility is fast becoming the ‘currency’ of credible, robust, scientific research. If a result cannot be reproduced with the same dataset or methods, then it questions the validity of the original work. I am the resident Biostatistician in the Life and Health Sciences Faculty, and it is my job to ensure that biological research is reproducible, so that the University of Bristol fulfils its legal obligations in terms of the 3Rs of animal welfare, and maintains its credible reputation. Here I describe seven reproducibility ‘sins’ that I commonly encounter along with some solutions to help researchers improve the reproducibility of their work. They cover the experimental journey from design, to analysis, and final publication.

The Seven Sins of Experimental Design and Statistics

Experimental Design

1. Insufficient knowledge about the sample population

The first ‘sin’ is an insufficient knowledge of the sample population. When we take a sample of data, we need to ensure that it is suitably random and representative of the population we want to study. This means we need to understand much more about where the experimental samples have come from before they end up in the experiment. For example, zebrafish are cultured under a wide range of environmental parameters, which means that the effects produced in different labs could have been influenced by more than just the experimental treatment. Unknown sources of variation contribute to a lack of reproducibility between studies conducted at various times and places. The solution is to understand as much as possible about where the samples come from so that all relevant sources of variation can be identified and controlled.

2. Lack of randomization and blinding

The second ‘sin’ is inappropriate randomization and/or blinding. Randomizing treatments to independent samples and preventing the researcher from knowing these details at the time of data collection, is the primary method of accounting for sources of bias during an experiment. Biases contribute to a lack of reproducibility as they obscure the true treatment effects and weaken the validity of the experiment. The solution is to think carefully about which experimental units should be randomly allocated, and also how, when, and by whom to avoid biases in procedure, time, and personnel.

Statistical Analysis

3. Inappropriate analysis

The next ‘sin’ is inappropriate analysis for the data. All datasets have a set of characteristics, such as the type of outcome that is measured, the number of variables that are used to explain the outcome, and other structural features that provide context. There are numerous statistical approaches available, but unfortunately, too often an experiment is reduced to a relatively simple analysis, such as a t-test or Chi-sq. This is not only incorrect use of statistical tests but is also a missed opportunity to explore a biological question in greater context. The solution is for researchers to upgrade and expand their statistical knowledge to be able to apply the best analysis to match the complexity of the experiment.

4. Pseudoreplication

A relatively common ‘sin’ is pseudoreplication. This is simply either a lack of adequate replication of independent experimental units or incorrect application of statistical tests to the correct unit of replication. The most common source of pseudoreplication involves using a test that was intended for independent samples on data derived from clustered samples. A classic example is when ten mice are measured three times each in an experiment, and all 30 datapoints are put into a t-test. The repeated measurements of each mouse mean than the standard error for the mean of the group is too small, relative to that of a random, independent sample of 30 mice. This makes it more likely that the t-test gives a spuriously small p value that cannot be replicated in another study using the correct sample size of ten. The solution is to understand the correct unit of replication for the hypothesis of interest. Multiple measurements of the same unit require tests of related datapoints, such as repeated measures ANOVA or longitudinal analyses. Similarly, samples that come from a clear hierarchical structure – littermates, tanks of fish, plots of land within a field – require tests that take account of the clear nesting of treatment groups within a larger structure.

5. Insufficient power and overreliance on p values

The final analysis ‘sin’ is insufficient statistical power and overreliance on p values to determine whether a treatment had an effect. Low statistical power and small p values contribute to a lack of reproducibility because they themselves are unreliable. The solution is to understand the relationship between N (total sample size), the effect size, and statistical power. The most essential element in any analysis is the effect size – the magnitude of difference between treatment groups. Statistical power is the ability of the statistical test to ‘see’ the effect size. Large sample sizes provide statistical power by reducing the width of the confidence intervals around the group means, making effect size differences more obvious. At the correct sample size, all effect size differences will break the threshold of significance in a statistical test. The conventional aim for a study is to have statistical power of 0.8, or 80% chance of finding the significant effect an experiment is designed to produce, and there are many tools available to help researchers find the correct sample size to achieve this.

Interpretation / Reporting

6. Overinterpreting the data

The next ‘sin’ is overinterpreting the data. We have an established hierarchy of evidence in biosciences, where observational studies rank lower in their ability to convince the reader than experimental studies, which rank lower than reviews and meta-analysis. When chance findings and highly exploratory work are presented as if they were more rigorously designed confirmatory tests of a specific hypothesis, or claims are made using data that was highly specific or restricted in the samples used, results are difficult to reproduce. This is because the foundations of the original claims were not as high up the hierarchy as stated. The solution is to be clear about the limitations of the experiment, acknowledge sources of bias, and realize what the data show. Reproducible science does not mean simply finding the same things again, but being transparent and realistic in what was done and the limits of what we can claim.

7. Insufficient detail to correctly interpret the results

Finally, the most important ‘sin’ of all – insufficient detail to correctly interpret the results. Large scale reviews undertaken by major scientific governing bodies in the UK and US have shown that most of the failure to reproduce results – that is, to directly repeat the finding of a previous study using the same data – was due to a lack of sufficient information to repeat the experiment or analysis. Key areas that are historically and routinely missing from research papers include randomization and blinding, sample size calculations, statistical details, effect sizes, and details of the housing and husbandry of the study population. The solution is to increase publishing standards to improve communication. To achieve this, there is a growing collection of guidelines, such as ARRIVE, TOP and SAMPL, which can be used to standardize publishing requirements and ensure experiments are faithfully reported and can be reproduced.

~~~~

Tackling reproducibility and demonstrating that scientific methods and research are robust is becoming a key determinant of funding success, ethical approval, and publishing, so it is in all our interests to be aware of these issues and their solutions.

Resources and further information:

Webpages on the staff intranet for Life & Health Sciences: https://uob.sharepoint.com/sites/life-sciences/SitePages/Reproducibility-in-Life-Sciences.aspx

Altman DG. 1982. How large a sample? In: Gore SM AD, editor. Statistics in Practice l. London, UK: British Medical Association https://doi.org/10.1136/bmj.281.6251.1336

Serdar, C.C. et al. 2021. Sample size, power and effect size revisited: simplified and practical approaches in pre-clinical, clinical and laboratory studies. Biochem Med, 31(1): 010502. https://doi.org/10.11613/BM.2021.010502

HoC Science, Innovation and Technology Committee. Reproducibility and Research Integrity. 2023. https://publications.parliament.uk/pa/cm5803/cmselect/cmsctech/101/report.html

Hurlbert, S.H. 1984. Pseudoreplication and the Design of Ecological Field Experiments. Ecological Monographs, 54, 187-211. https://doi.org/10.2307/1942661

Lazic, S.E., 2010. The problem of pseudoreplication in neuroscientific studies: is it affecting your analysis? BMC Neuroscience, 11:5. https://doi.org/10.1186/1471-2202-11-5

Lazic, S.E., et al. 2018. What exactly is ‘N’ in cell culture and animal experiments? PLoS Biology. 16:e2005282. https://doi.org/10.1371/journal.pbio.2005282

National Academies of Sciences, Engineering, and Medicine. (2019). Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. https://doi.org/10.17226/25303

Taylor, M. (2024). Experimental Design for Reproducible Science. Talk given at Reproducibility by Design symposium, University of Bristol, June 2024. https://osf.io/uc6qv

Author:

Dr Michelle Taylor is a Senior Research Associate in Statistics and Experimental Design based at the University of Bristol. Her research background is in Evolutionary Genetics and Ecology, focusing on sexual selection and reproductive fitness, and she has previously held research positions at the Universities of Exeter and Western Australia. She moved into her current role at University of Bristol in 2018 to support the Animal Welfare Ethics Review Board and promote a wider focus on reproducibility across the Biosciences.