Virtual cohorts: a method to overcome challenges in developing evidence in rare and complex diseases?
Eleanor Butler, Consultant at Lightning Health reviews the use of synthetic data and virtual cohorts to mitigate challenges associated with developing evidence in rare and complex diseases.
The rising costs and duration associated with clinical development programmes for new medicines has been cited as a barrier to the development and launch of new therapies. It takes an average of 10 years for a pharmaceutical company to complete the R&D process for a new medicine from initial discovery to launch, and for oncology products the median R&D cost is currently ~$2.77bn. Furthermore, due to the on-going rapid advances in scientific knowledge, manufacturers are focussing on research areas that are increasingly complex, with a higher risk of failure and associated market access challenges.
Traditionally, HTA bodies and payers will require “gold-standard” randomised control trials vs. an active clinical comparator or placebo/best supportive care (BSC) where this is not appropriate, in order to inform positive pricing, reimbursement and access decisions. However, increasingly companies developing new therapies for rare diseases face a multitude of challenges associated with traditional clinical trial methods, including difficulty recruiting sufficient trial participants in a timely manner and ethical issues associated with patients receiving placebo treatment that could expose them to unnecessary burden and risk. These challenges can be key drivers of increased costs and extended development timelines, often delaying patient access for new treatments in areas of high clinical unmet needs. The use of synthetic data and virtual cohorts may be one potential solution to mitigate some of these challenges.

What is synthetic data?
Synthetic data, in the context of clinical development, aims to minimise the need for or replace the control arm of a clinical trial. This is achieved by applying a sampling technique to natural history data obtained from electronic databases to create a synthetic control arm. Data are typically taken from patients who have previously received BSC for a specific condition and have had their outcomes tracked through electronic medical records (EMRs) or wearable medical devices. Sources of patient data can include completed clinical trials, observational studies, and data from registries.
Currently most patient-level data sets cannot be easily accessed, due to the requirement for individual agreements between institutions for research purposes and associated challenges for researchers to generate subsets or blend/mix data sets. However, the real–world data collected through EMRs or medical devices can be used to create simulation scenarios where models and processes interact to create a new dataset that is not directly taken from the real world. This process creates virtual cohorts that are instructed by real-world patient level cohort data, without sharing any data from real world subjects.
Potential benefits and limitations of synthetic data
The key benefit for manufacturers potentially afforded by synthetic data is the substantial reduction in demand for patient recruitment, through minimisation of control arm enrolment, therefore saving both time and resources. Although this may contain higher levels of heterogeneity than a traditional comparative RCT, situations in which a virtual cohort may provide benefits that outweigh the risks include rare and complex diseases where difficulties in recruitment are prohibitive to clinical development, areas where control group performance is historically well categorised, and situations where endpoints have been historically measured in a standardised way with consistent results between trials.
The use of a virtual cohort may also mitigate some of the ethical issues with placebo control arms that can have a direct link to difficulties in patient recruitment. This provides potential value to both manufacturers and patients through minimising unnecessary burden and risk for patients who would receive placebo (especially when treatment is time-sensitive) and the potential to increase trial enrolment due to the knowledge that the only possible treatment is the intervention drug. Theoretically this may also have value within a regulatory and payer setting due to the potentially increased number of patients receiving the treatment, however, the acceptability of synthetic data within these settings is unknown and there is currently no formal guidance associated with the use of virtual cohorts.
Despite increasing industry interest in the possibility of utilising virtual cohorts, most manufacturers are likely to be limited by the availability of datasets for this purpose. Currently the majority of pharmaceutical companies may only have access to internal historical data sets, generated from past clinical trials, observational studies, or registries, that will be insufficient in size with limited ability to stratify the data to achieve a validated analysis. Broad ranging industry collaboration will be required to pool historical patient data (and also real–world evidence from EMRs) to ensure the feasibility of large robust synthetic data sets that are suitable for the generation of virtual cohorts. However, currently the infrastructure to enable such data and knowledge exchange is largely undeveloped and without strong inter-industry and regulatory leadership, transformation will be challenging. Furthermore, there is a need for standardisation in the measurement of clinical outcomes for each disease state to ensure consistency and transferability of data.
Actions for the future
The current R&D paradigm increasingly represents a challenge for drug developers and in some situations is slowing down patient access to the newest scientific advances. Giving the rising costs of clinical development and the challenges with comparative data generation for new treatments in complex and/or rare diseases, the use of synthetic data and virtual cohorts may be a feasible method to optimise clinical development for these therapy areas, where HTA systems/payers are often already more flexible in their assessment methods.
The future success of machine learning, AI and big data in transforming the R&D landscape will depend on the availability of mature healthcare data for suitable patient cohorts, with globally consistent standards for defining patient characteristics and measuring clinical outcomes. This is unlikely to be achieved until key industry players can collaborate with regulators, payers, clinical societies, and patient organisations to develop the infrastructure and methodology to collect standardised data across a range of indications and localities.
Article published 15 April 2021.