The Seven Sins of Experimental Design and Statistics

By Michelle Taylor

Reproducibility is fast becoming the ‘currency’ of credible, robust, scientific research. If a result cannot be reproduced with the same dataset or methods, then it questions the validity of the original work. I am the resident Biostatistician in the Life and Health Sciences Faculty, and it is my job to ensure that biological research is reproducible, so that the University of Bristol fulfils its legal obligations in terms of the 3Rs of animal welfare, and maintains its credible reputation. Here I describe seven reproducibility ‘sins’ that I commonly encounter along with some solutions to help researchers improve the reproducibility of their work. They cover the experimental journey from design, to analysis, and final publication.

The Seven Sins of Experimental Design and Statistics

Experimental Design

1. Insufficient knowledge about the sample population

The first ‘sin’ is an insufficient knowledge of the sample population. When we take a sample of data, we need to ensure that it is suitably random and representative of the population we want to study. This means we need to understand much more about where the experimental samples have come from before they end up in the experiment. For example, zebrafish are cultured under a wide range of environmental parameters, which means that the effects produced in different labs could have been influenced by more than just the experimental treatment. Unknown sources of variation contribute to a lack of reproducibility between studies conducted at various times and places. The solution is to understand as much as possible about where the samples come from so that all relevant sources of variation can be identified and controlled.

2. Lack of randomization and blinding

The second ‘sin’ is inappropriate randomization and/or blinding. Randomizing treatments to independent samples and preventing the researcher from knowing these details at the time of data collection, is the primary method of accounting for sources of bias during an experiment. Biases contribute to a lack of reproducibility as they obscure the true treatment effects and weaken the validity of the experiment. The solution is to think carefully about which experimental units should be randomly allocated, and also how, when, and by whom to avoid biases in procedure, time, and personnel.

Statistical Analysis

3. Inappropriate analysis

The next ‘sin’ is inappropriate analysis for the data. All datasets have a set of characteristics, such as the type of outcome that is measured, the number of variables that are used to explain the outcome, and other structural features that provide context. There are numerous statistical approaches available, but unfortunately, too often an experiment is reduced to a relatively simple analysis, such as a t-test or Chi-sq. This is not only incorrect use of statistical tests but is also a missed opportunity to explore a biological question in greater context. The solution is for researchers to upgrade and expand their statistical knowledge to be able to apply the best analysis to match the complexity of the experiment.

4. Pseudoreplication

A relatively common ‘sin’ is pseudoreplication. This is simply either a lack of adequate replication of independent experimental units or incorrect application of statistical tests to the correct unit of replication. The most common source of pseudoreplication involves using a test that was intended for independent samples on data derived from clustered samples. A classic example is when ten mice are measured three times each in an experiment, and all 30 datapoints are put into a t-test. The repeated measurements of each mouse mean than the standard error for the mean of the group is too small, relative to that of a random, independent sample of 30 mice. This makes it more likely that the t-test gives a spuriously small p value that cannot be replicated in another study using the correct sample size of ten. The solution is to understand the correct unit of replication for the hypothesis of interest. Multiple measurements of the same unit require tests of related datapoints, such as repeated measures ANOVA or longitudinal analyses. Similarly, samples that come from a clear hierarchical structure – littermates, tanks of fish, plots of land within a field – require tests that take account of the clear nesting of treatment groups within a larger structure.

5. Insufficient power and overreliance on p values

The final analysis ‘sin’ is insufficient statistical power and overreliance on p values to determine whether a treatment had an effect. Low statistical power and small p values contribute to a lack of reproducibility because they themselves are unreliable. The solution is to understand the relationship between N (total sample size), the effect size, and statistical power. The most essential element in any analysis is the effect size – the magnitude of difference between treatment groups. Statistical power is the ability of the statistical test to ‘see’ the effect size. Large sample sizes provide statistical power by reducing the width of the confidence intervals around the group means, making effect size differences more obvious. At the correct sample size, all effect size differences will break the threshold of significance in a statistical test. The conventional aim for a study is to have statistical power of 0.8, or 80% chance of finding the significant effect an experiment is designed to produce, and there are many tools available to help researchers find the correct sample size to achieve this.

Interpretation / Reporting

6. Overinterpreting the data

The next ‘sin’ is overinterpreting the data. We have an established hierarchy of evidence in biosciences, where observational studies rank lower in their ability to convince the reader than experimental studies, which rank lower than reviews and meta-analysis. When chance findings and highly exploratory work are presented as if they were more rigorously designed confirmatory tests of a specific hypothesis, or claims are made using data that was highly specific or restricted in the samples used, results are difficult to reproduce. This is because the foundations of the original claims were not as high up the hierarchy as stated. The solution is to be clear about the limitations of the experiment, acknowledge sources of bias, and realize what the data show. Reproducible science does not mean simply finding the same things again, but being transparent and realistic in what was done and the limits of what we can claim.

7. Insufficient detail to correctly interpret the results

Finally, the most important ‘sin’ of all – insufficient detail to correctly interpret the results. Large scale reviews undertaken by major scientific governing bodies in the UK and US have shown that most of the failure to reproduce results – that is, to directly repeat the finding of a previous study using the same data – was due to a lack of sufficient information to repeat the experiment or analysis. Key areas that are historically and routinely missing from research papers include randomization and blinding, sample size calculations, statistical details, effect sizes, and details of the housing and husbandry of the study population. The solution is to increase publishing standards to improve communication. To achieve this, there is a growing collection of guidelines, such as ARRIVE, TOP and SAMPL, which can be used to standardize publishing requirements and ensure experiments are faithfully reported and can be reproduced.

~~~~

Tackling reproducibility and demonstrating that scientific methods and research are robust is becoming a key determinant of funding success, ethical approval, and publishing, so it is in all our interests to be aware of these issues and their solutions.

Resources and further information:

Webpages on the staff intranet for Life & Health Sciences: https://uob.sharepoint.com/sites/life-sciences/SitePages/Reproducibility-in-Life-Sciences.aspx

Altman DG. 1982. How large a sample? In: Gore SM AD, editor. Statistics in Practice l. London, UK: British Medical Association https://doi.org/10.1136/bmj.281.6251.1336

Serdar, C.C. et al. 2021. Sample size, power and effect size revisited: simplified and practical approaches in pre-clinical, clinical and laboratory studies. Biochem Med, 31(1): 010502. https://doi.org/10.11613/BM.2021.010502

HoC Science, Innovation and Technology Committee. Reproducibility and Research Integrity. 2023. https://publications.parliament.uk/pa/cm5803/cmselect/cmsctech/101/report.html

Hurlbert, S.H. 1984. Pseudoreplication and the Design of Ecological Field Experiments. Ecological Monographs, 54, 187-211. https://doi.org/10.2307/1942661

Lazic, S.E., 2010. The problem of pseudoreplication in neuroscientific studies: is it affecting your analysis? BMC Neuroscience, 11:5. https://doi.org/10.1186/1471-2202-11-5

Lazic, S.E., et al. 2018. What exactly is ‘N’ in cell culture and animal experiments? PLoS Biology. 16:e2005282. https://doi.org/10.1371/journal.pbio.2005282

National Academies of Sciences, Engineering, and Medicine. (2019). Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. https://doi.org/10.17226/25303

Taylor, M. (2024). Experimental Design for Reproducible Science. Talk given at Reproducibility by Design symposium, University of Bristol, June 2024. https://doi.org/10.17605/OSF.IO/NCWE6

Author:

Dr Michelle Taylor is a Senior Research Associate in Statistics and Experimental Design based at the University of Bristol. Her research background is in Evolutionary Genetics and Ecology, focusing on sexual selection and reproductive fitness, and she has previously held research positions at the Universities of Exeter and Western Australia. She moved into her current role at University of Bristol in 2018 to support the Animal Welfare Ethics Review Board and promote a wider focus on reproducibility across the Biosciences.

Using Synthetic Datasets to Promote Research Reproducibility and Transparency

By Dan Major-Smith

Scientific best practice is moving towards increased openness, reproducibility and transparency, with data and analysis code increasingly being publicly available alongside published research. In conjunction with other changes to the traditional academic system – including pre-registration/Registered Reports, better research methods and statistics training, and altering the current academic incentive structures – these shifts are intended to improve trust, reproducibility and rigour in science.

Making data and code openly available can improve trust and transparency in research by allowing others to replicate and interrogate published results. This means that the published results can be independently verified, and can even help spot potential errors in analyses such as in this and this high-profile examples. In these cases, because data and code were open, errors could be spotted and the scientific record corrected. It is impossible to know how many papers without associated publicly available data and/or code suffer from similar issues. Because of this, journals are increasingly mandating both the data and code sharing, with the BMJ being a recent example. As another bonus, if data and code are available, readers can test out potentially new analysis methods, improving statistical literacy.

Despite these benefits and the continued push towards data sharing, many researchers still do not openly share their data. While this varies by discipline, with predominantly experimental fields such as Psychology having higher rates of data sharing, there is plenty of room for improvement. In the Medical and Health Sciences, for instance, a recent meta-analysis estimated that only 8% of research was declared as ‘publicly available’, with only 2% actually being publicly available. The rate of code sharing was even more dire, with less than 0.5% of papers publicly sharing analysis scripts.

~~~~

Although data sharing should be encouraged wherever possible, there are some circumstances where the raw data simply cannot be made publicly available (although usually the analysis code can). For instance, many studies – and in particular longitudinal population-based studies which collect large amounts of data on large numbers of people for long periods of time – prohibit data sharing for reasons of preserving participant anonymity and confidentiality, data sensitivity, and to ensure that only legitimate researchers are able to access the data.

ALSPAC (the Avon Longitudinal Study of Parents and Children; https://www.bristol.ac.uk/alspac/), a longitudinal Bristol-based birth cohort housed within the University of Bristol, is one such example. As ALSPAC has data on approximately 15,000 mothers, their partners and their offspring, with over 100,000 variables in total (excluding genomics and other ‘-omics’ data), it has a policy of not allowing data to be released alongside published articles.

These are valid reasons for restricting data sharing, but nonetheless are difficult to square with open science best practices of data sharing. So, if we want to share these kinds of data, what can we do?

~~~~

One potential solution, which we have recently embedded within ALSPAC, is to release synthetic data, rather than the actual raw data. Synthetic data are modelled on the observed data which maintain both the original distributions of the data (e.g., means, standard deviations, cell counts) and the relationships between variables (e.g., correlations between variables). Importantly, while maintaining the key features of the original data, the data are generated from statistical models, meaning that observations do not correspond to real-life individuals, hence preserving participant anonymity.

These synthetic datasets can then be made publicly available alongside the published paper in lieu of the original data, allowing researchers to:

  • Explore the raw (synthetic) data
  • Understand the analyses better
  • Reproduce analyses themselves

A description of the way in which we generated synthetic data for our work is included in the ‘In depth’ section at the end of this blog post.

While the synthetic data will not be exactly the same as the observed data, making synthetic data openly available does add a further level of openness, accountability and transparency where previously no data would have been available. Further, synthetic datasets can provide a reasonable compromise between the competing demands of promoting data sharing and open-science practices while maintaining control over access to potentially sensitive data.

Given these features, working with the ALSPAC team, we developed a checklist for generating synthetic ALSPAC data. We hope that users of ALSPAC data – and researchers using other datasets which currently prohibit data sharing – make use of this synthetic data approach to help improve research reproducibility and transparency.

So, in short: Share your data! (but if you can’t, share synthetic data). 

~~~~

Reference:

Major-Smith et al. (2024). Releasing synthetic data from the Avon Longitudinal Study of Parents and Children (ALSPAC): Guidelines and applied examples. Wellcome Open Research, 9, 57. DOI: 10.12688/wellcomeopenres.20530.1 – Further details (including references therein) on this approach, specifically applied to releasing synthetic ALSPAC data.

Other resources:

The University Library Services’ guide to data sharing.

The ALSPAC guide to publishing research data, including the ALSPAC synthetic data checklist.

The FAIR data principles – there is a wider trend in funder, publisher and institutional policies towards FAIR data, which may or may not be fully open but which are nevertheless accessible even where circumstances may prevent fully open publication.

Author:

Dan Major-Smith is a somewhat-lapsed Evolutionary Anthropologist who now spends most of his time working as an Epidemiologist. He is currently a Senior Research Associate in Population Health Sciences at the University of Bristol, and works on various topics, including selection bias, life course epidemiology and relations between religion, health and behaviour. He is also interested in meta-science/open scholarship more broadly, including the use of pre-registration/Registered Reports, synthetic data and ethical publishing. Outside of academia, he fosters cats and potters around his garden pleading with his vegetables to grow.

~~~~

In depth

In our recent paper, we demonstrate how synthetic data generation methods can be applied using the excellent ‘synthpop’ package in the R programming language. Our example is based on an openly available subset of the ALSPAC data, so that researchers can fully replicate these analyses (with scripts available on a GitHub page).

There are four main steps when synthesising data, which we demonstrate below, along with example R code (for full details see the paper and associated scripts):

1. After preparing the dataset, create a synthetic dataset, using a seed so that results are reproducible (here we are just using the default ‘classification and regression tree’ method; see the ‘synthpop’ package and documentation for more information) 

dat_syn <- syn(dat, seed = 13327)

2. To minimise the potential disclosure risk, when synthesising ALSPAC data we recommend removing individuals who are uniquely-identified in both the observed and synthetic datasets (in this example, only 4 of the 3,727 observations were removed [0.11%]) 

dat_syn <- sdc(dat_syn, dat, rm.replicated.uniques = TRUE) 

3. Compare the variable distributions between the observed and synthetic data to ensure these are similar (see image below) 

compare(dat_syn, dat, stat = “count”) 

4. Compare the relationships between variables in the observed and synthetic data to check similarity, here using a multivariable logistic regression model to explore whether maternal postnatal depressive symptoms are associated with offspring depression in adolescence (see image below) 

model.syn <- glm.synds(depression_17 ~ mat_dep + matage + ethnic + gender + mated + housing, family = “binomial”, data = dat_syn) 
compare(model.syn, dat) 

As can be seen, although there are some minor differences between the observed and synthetic data, overall the correspondence is quite high.