Synthetic Sample Generation of the 4S Study Placebo Population Using a Stochastic Sampling Technique


We have successfully generated a synthetic sample of people whose group characteristics, such as mean age and gender, matched those of a real-life sample of people who took part in a clinical trial.

This synthetic sample of people could be used to populate clinical or economic models, allowing for preliminary assessments of interventions with minimal costs.

The population was generated from a base case of synthetic individuals whose characteristics match those of the general population of England. This means that while the mean group characteristics (age, gender, BMI, blood pressure, cholesterol levels, smoking behaviour, and alcohol consumption) were generated to match the chosen clinical trial population, the correlations and distributions of such characteristics behave similarly to those of a real-world population.


Supplementary Information:

Our aim was to generate a synthetic sample of individuals with similar average values to the risk factor variables in the 4S study sample, and similar distributions and correlation structure compared with the general population. Using this new approach, we were able to successfully generate synthetic samples that were comparable to the originals in aggregate.

We previously developed a technique to generate a synthetic population that matched the distribution and correlation structure as outlined in the Health Survey for England. For further information click here. Using this synthetic sample as a base case we used a stochastic resampling technique to generate a semi-random sample of people with characteristics that match those of the control group in the Scandinavian Simvastatin Survival Study (4S)2, using R.

The sample was matched on binary variables (gender, smoking status, diabetes) and continuous factors (age, BMI, systolic blood pressure, total cholesterol: HDL cholesterol ratio, cigarettes smoked per day and units of alcohol per week). The mean values for the risk factors matched the target sample to an accuracy of 1 decimal point. Table 1 presents the descriptive statistics generated by the synthetic sample and those reported in the 4S study, and Table 2 presents the statistically significant correlations of the synthetic sample data. As expected, there is a high correlation between systolic blood pressure and BMI. There is also a high correlation between BMI and total cholesterol / HDL ratio.

Table 1: Mean values and Error Rate


Table 2: Synthetic sample correlations (blank cells indicate a statistically non-significant correlation)


Figures 1 and 2 show the comparison of the density distributions of cholesterol ratio and systolic BP, for the synthetic sample generated and overall population data from the Health Survey for England. As expected, the distribution of total cholesterol / HDL ratio and systolic blood pressure were skewed upwards reflecting the higher average values in the high-risk 4S sample.

Figure 1: Systolic Blood Pressure

Figure 2: Total Cholesterol to HDL ratio

A limitation of this process was that we only had limited data regarding the baseline patient characteristics. Only the mean values were available from the study publications. With additional data we can improve this process and increase the accuracy of the synthetic sample generated.

Given a sufficient amount of data, this approach can be used to model the likely impact of new therapies or predict mortality for various sub-groups. This has been demonstrated by members of our team and will be presented as a poster at ISPOR Barcelona 20183. For further information click here. This could therefore be a useful tool in the planning and preparation of clinical trials, or in the estimation of variables for model predictions.



1 Martin, C., & Springate, C.E. (2018) “Synthetic Sample Generation Representing the English Population Using Spearman Rank Correlation and Chomsky Decomposition.” Presented at ISPOR 2018
2 Scandinavian Simvastatin Survival Study Group (1994). “Randomised trial of cholesterol lowering in 4444 patients with coronary heart disease: the Scandinavian Simvastatin Survival Study (4S).” Lancet 344(8934): 1383-1389.
3 Hines, J.E., Springate, C.E., Martin, C. (2018) “Modelling Likely Cardiovascular Disease Mortality with PCSK9 Inhibitors using a Synthetic Population.” Presented at ISPOR Europe 2018
« Back