Using Synthetic Datasets to Promote Research Reproducibility and Transparency

By Dan Major-Smith

Scientific best practice is moving towards increased openness, reproducibility and transparency, with data and analysis code increasingly being publicly available alongside published research. In conjunction with other changes to the traditional academic system – including pre-registration/Registered Reports, better research methods and statistics training, and altering the current academic incentive structures – these shifts are intended to improve trust, reproducibility and rigour in science.

Making data and code openly available can improve trust and transparency in research by allowing others to replicate and interrogate published results. This means that the published results can be independently verified, and can even help spot potential errors in analyses such as in this and this high-profile examples. In these cases, because data and code were open, errors could be spotted and the scientific record corrected. It is impossible to know how many papers without associated publicly available data and/or code suffer from similar issues. Because of this, journals are increasingly mandating both the data and code sharing, with the BMJ being a recent example. As another bonus, if data and code are available, readers can test out potentially new analysis methods, improving statistical literacy.

Despite these benefits and the continued push towards data sharing, many researchers still do not openly share their data. While this varies by discipline, with predominantly experimental fields such as Psychology having higher rates of data sharing, there is plenty of room for improvement. In the Medical and Health Sciences, for instance, a recent meta-analysis estimated that only 8% of research was declared as ‘publicly available’, with only 2% actually being publicly available. The rate of code sharing was even more dire, with less than 0.5% of papers publicly sharing analysis scripts.

~~~~

Although data sharing should be encouraged wherever possible, there are some circumstances where the raw data simply cannot be made publicly available (although usually the analysis code can). For instance, many studies – and in particular longitudinal population-based studies which collect large amounts of data on large numbers of people for long periods of time – prohibit data sharing for reasons of preserving participant anonymity and confidentiality, data sensitivity, and to ensure that only legitimate researchers are able to access the data.

ALSPAC (the Avon Longitudinal Study of Parents and Children; https://www.bristol.ac.uk/alspac/), a longitudinal Bristol-based birth cohort housed within the University of Bristol, is one such example. As ALSPAC has data on approximately 15,000 mothers, their partners and their offspring, with over 100,000 variables in total (excluding genomics and other ‘-omics’ data), it has a policy of not allowing data to be released alongside published articles.

These are valid reasons for restricting data sharing, but nonetheless are difficult to square with open science best practices of data sharing. So, if we want to share these kinds of data, what can we do?

~~~~

One potential solution, which we have recently embedded within ALSPAC, is to release synthetic data, rather than the actual raw data. Synthetic data are modelled on the observed data which maintain both the original distributions of the data (e.g., means, standard deviations, cell counts) and the relationships between variables (e.g., correlations between variables). Importantly, while maintaining the key features of the original data, the data are generated from statistical models, meaning that observations do not correspond to real-life individuals, hence preserving participant anonymity.

These synthetic datasets can then be made publicly available alongside the published paper in lieu of the original data, allowing researchers to:

  • Explore the raw (synthetic) data
  • Understand the analyses better
  • Reproduce analyses themselves

A description of the way in which we generated synthetic data for our work is included in the ‘In depth’ section at the end of this blog post.

While the synthetic data will not be exactly the same as the observed data, making synthetic data openly available does add a further level of openness, accountability and transparency where previously no data would have been available. Further, synthetic datasets can provide a reasonable compromise between the competing demands of promoting data sharing and open-science practices while maintaining control over access to potentially sensitive data.

Given these features, working with the ALSPAC team, we developed a checklist for generating synthetic ALSPAC data. We hope that users of ALSPAC data – and researchers using other datasets which currently prohibit data sharing – make use of this synthetic data approach to help improve research reproducibility and transparency.

So, in short: Share your data! (but if you can’t, share synthetic data). 

~~~~

Reference:

Major-Smith et al. (2024). Releasing synthetic data from the Avon Longitudinal Study of Parents and Children (ALSPAC): Guidelines and applied examples. Wellcome Open Research, 9, 57. DOI: 10.12688/wellcomeopenres.20530.1 – Further details (including references therein) on this approach, specifically applied to releasing synthetic ALSPAC data.

Other resources:

The University Library Services’ guide to data sharing.

The ALSPAC guide to publishing research data, including the ALSPAC synthetic data checklist.

The FAIR data principles – there is a wider trend in funder, publisher and institutional policies towards FAIR data, which may or may not be fully open but which are nevertheless accessible even where circumstances may prevent fully open publication.

Author:

Dan Major-Smith is a somewhat-lapsed Evolutionary Anthropologist who now spends most of his time working as an Epidemiologist. He is currently a Senior Research Associate in Population Health Sciences at the University of Bristol, and works on various topics, including selection bias, life course epidemiology and relations between religion, health and behaviour. He is also interested in meta-science/open scholarship more broadly, including the use of pre-registration/Registered Reports, synthetic data and ethical publishing. Outside of academia, he fosters cats and potters around his garden pleading with his vegetables to grow.

~~~~

In depth

In our recent paper, we demonstrate how synthetic data generation methods can be applied using the excellent ‘synthpop’ package in the R programming language. Our example is based on an openly available subset of the ALSPAC data, so that researchers can fully replicate these analyses (with scripts available on a GitHub page).

There are four main steps when synthesising data, which we demonstrate below, along with example R code (for full details see the paper and associated scripts):

1. After preparing the dataset, create a synthetic dataset, using a seed so that results are reproducible (here we are just using the default ‘classification and regression tree’ method; see the ‘synthpop’ package and documentation for more information) 

dat_syn <- syn(dat, seed = 13327)

2. To minimise the potential disclosure risk, when synthesising ALSPAC data we recommend removing individuals who are uniquely-identified in both the observed and synthetic datasets (in this example, only 4 of the 3,727 observations were removed [0.11%]) 

dat_syn <- sdc(dat_syn, dat, rm.replicated.uniques = TRUE) 

3. Compare the variable distributions between the observed and synthetic data to ensure these are similar (see image below) 

compare(dat_syn, dat, stat = “count”) 

4. Compare the relationships between variables in the observed and synthetic data to check similarity, here using a multivariable logistic regression model to explore whether maternal postnatal depressive symptoms are associated with offspring depression in adolescence (see image below) 

model.syn <- glm.synds(depression_17 ~ mat_dep + matage + ethnic + gender + mated + housing, family = “binomial”, data = dat_syn) 
compare(model.syn, dat) 

As can be seen, although there are some minor differences between the observed and synthetic data, overall the correspondence is quite high.