We have successfully generated a synthetic sample of people whose group characteristics, such as mean age and gender, matched those of a real-life sample of people who took part in a clinical trial.
This synthetic sample of people could be used to populate clinical or economic models, allowing for preliminary assessments of interventions with minimal costs.
The population was generated from a base case of synthetic individuals whose characteristics match those of the general population of England. This means that while the mean group characteristics (age, gender, BMI, blood pressure, cholesterol levels, smoking behaviour, and alcohol consumption) were generated to match the chosen clinical trial population, the correlations and distributions of such characteristics behave similarly to those of a real-world population.
Our aim was to generate a synthetic sample of individuals with similar average values to the risk factor variables in the 4S study sample, and similar distributions and correlation structure compared with the general population. Using this new approach, we were able to successfully generate synthetic samples that were comparable to the originals in aggregate.
We previously developed a technique to generate a synthetic population that matched the distribution and correlation structure as outlined in the Health Survey for England. For further information click here. Using this synthetic sample as a base case we used a stochastic resampling technique to generate a semi-random sample of people with characteristics that match those of the control group in the Scandinavian Simvastatin Survival Study (4S)2, using R.
The sample was matched on binary variables (gender, smoking status, diabetes) and continuous factors (age, BMI, systolic blood pressure, total cholesterol: HDL cholesterol ratio, cigarettes smoked per day and units of alcohol per week). The mean values for the risk factors matched the target sample to an accuracy of 1 decimal point. Table 1 presents the descriptive statistics generated by the synthetic sample and those reported in the 4S study, and Table 2 presents the statistically significant correlations of the synthetic sample data. As expected, there is a high correlation between systolic blood pressure and BMI. There is also a high correlation between BMI and total cholesterol / HDL ratio.
Figures 1 and 2 show the comparison of the density distributions of cholesterol ratio and systolic BP, for the synthetic sample generated and overall population data from the Health Survey for England. As expected, the distribution of total cholesterol / HDL ratio and systolic blood pressure were skewed upwards reflecting the higher average values in the high-risk 4S sample.