Issue:November/December 2022
MACHINE LEARNING - Applying Machine Learning Techniques: Gaining Meaningful Life Sciences Insights From Genomics Data
INTRODUCTION
Through advancements in high-throughput technologies and data management systems, we have access to a vast and varied collection of datasets in the biomedicine space, including genomics data. In generating massive amounts of genomics data, there is a tremendous opportunity to gain meaningful insights to personalize medicine to a patient’s particular genomic makeup. However, genomics data is complex, and the data alone is not going to advance therapeutic development toward personalized medicine. For example, to pinpoint the right disease target, we need to know about the entire suite of biological processes. Personalizing therapies requires accurately classifying disease sub-types and the investigational compound’s sensitivity to various genomics profiles. Machine Learning (ML) offers a useful set of tools to glean these valuable insights from genomics data.
We organize our perspective of the complex space by diving into how genomics data interacts with other types of data, including compounds, proteins, electronic health records (EHRs), cellular images, and text, as well as by the stage of the R&D lifecycle, in which experts are leveraging the genomics ML. Given the scale and variability of data, experts may not find all the novel patterns ML can help pinpoint, improving prediction capabilities on varying tasks, such as drug response prediction. Uncovering unique and useful patterns can also help lead to the discovery of novel biological insights. Given that therapy discovery often consists of larger, resource-intensive experiments with a limited scope, many potential therapies may be missed. The predictive capabilities offered by ML solutions can help pharmaceutical and biotech companies shift focus onto additional experiments — providing an opportunity to catch or generate potential options.
As companies look for ways to unlock the potential of assets they may consider developing through genomics data, understanding how ML can help enable the process is critical.
USEFUL ML APPLICATIONS
In helping to identify patterns within varying interactions to extract insights from genomics data, ML applications can support faster and more effective drug development. And, it is possible to leverage ML applications throughout the entire therapeutic lifecycle. We have shared a few key applications below based on a systemic literature review published in the October 2021 issue of the data science journal Patterns.
Target Discovery
Pharmaceutical and biotech companies focus on the discovery of novel disease targets to determine how an investigational treatment might hone in on a molecule to produce a therapeutic effect, such as inhibition, to ultimately block the disease process all together. Relying heavily on the basics of human biology, these companies focus on target discovery to help identify target biomarkers, which aid in designing therapeutics to potentially stop the disease pathway and provide treatment to patients.
Druggable Biomarker Identification
Given diseases are driven by complex biological processes, biomarkers play a key role in helping researchers better understand and navigate these processes. It’s from these deeper insights that pharmaceutical and biotech companies can design therapeutics to stop the disease process and potentially cure it.
By mining through large-scale biomedical data, ML can help identify these biomarkers to accurately predict genotype-phenotype associations. Probing trained models of complex patient data can uncover potential biomarkers and identify patterns related to the disease mechanisms that may not be feasible through manual processing and analysis. Some key tasks related to biomarker identification via ML applications, include the following:
- Variant calling, the first step before relating genotypes to diseases, is used to specify which genetic variants are present in each individual’s genome from sequencing.
- Prioritization of pathogenic variants from an entire variant set that can include at least 1 million per person can potentially lead to disease targets. ML approaches can help by either predicting the pathogenicity given a set of features for a single variant or by using each genome profile as a data point to then predict disease risks from this profile.
- An ML-based model can be trained for rare disease detection, if sufficient data from patients with a rare disease and suitable controls exist. Formulating rare disease detection as a classification task, ML can help identify if a patient has a rare disease from his/her genomic sequence and other insights, including EHR data.
- As many diseases are driven by a specific set of genes forming pathways, it is useful to perform pathway analysis to identify these gene sets to have a more complete understanding of disease mechanisms.
Therapeutic Discovery
After identifying a treatment’s target, therapeutic discovery is the next step. In this discovery stage, working to design a therapy — a small molecule, antibody, gene therapy, and more — to control the disease target and block its pathway is key. This can include numerous phases and layers of tasks to help ensure the treatment is safe and effective.
There are several ways the use of an accurate ML model can help sponsors in this stage of drug development, including the following:
- Identify new molecules faster to reduce development timing by de-risking the research aspects of finding new therapies.
- Better prediction of a therapy’s response in a variety of cell lines in silico or in virtual cohorts for testing purposes to guide smarter decision making.
- Potential to greatly narrow down the drug screening space and reduce experimental costs and operational resources.
- Optimize the probability of clinical trial success, as therapy response insights are uncovered, to enable greater visibility to better protocol design.
- Enable the design of various gene therapies.
Additionally, as combination therapies can help modulate multiple targets to provide a novel mechanism of action in cancer treatments, it is possible to reduce adverse effects for the patient, given there are reduced dosages of each therapy. Screening the entire space of possible drug combinations is not feasible experimentally. Relying on ML models that can predict responses due to the drug combination and the genomic profile for a cell line can be valuable.
Clinical Studies
In the comprehensive literature review conducted by several life sciences professionals and published in Patterns journal, as previously noted, there are three areas of focus for genomics in clinical trials:
-Animal-to-human translation
-Cohort curation
-Causal effects
ML is regularly exploring the challenges of domain adaptation like that of the animal-to-human challenge. The ability to leverage machines to learn how phenotypic responses in mice translate to the responses in humans can greatly reduce the failure rate of early phase clinical trials.
Through ML, study teams can distinguish important factors for the primary endpoints and quickly identify them in patients by predicting patient profiles that will respond to treatment. ML-based approaches have been tackling this problem at a cellular and molecular level. Also, to address the problem of successfully identifying the right patients for the right trials, automated patient-trial matching using ML models are worth considering for sponsors to improve enrollment by taking heterogeneous patient data and trial eligibility criteria into account.
Mendelian randomization is a particular method that uses a measured variation in genes with already known function to evaluate the causal effect of modifiable exposure on a disease. If the gene is associated with the exposure and the outcome due to the exposure, genes can be an instrumental variable to simulate randomization. This method can help sponsors bypass clinical trials all together, add support for trials, and/or validate drug targets. ML methods are showing promise over more traditional regression approaches.
More advanced ML applications and causal inference methods, however, are challenging. For example, genes can associate with the outcome through another pathway beyond exposure, which then requires customized probabilistic models and a larger sample size for statistically significant estimation.
Post-Market Study
Once a treatment is approved for marketing and commercialization, pharmaceutical and biotech companies continue to monitor its efficacy and safety in clinical practice through numerous studies.
Given these studies house important information about the treatment that was not evident before regulatory approval, ML models can help mine through a large collection of texts and pinpoint useful signals for post-market surveillance. This includes numerous documentation sources generated in EHRs, insurance billing systems and more, which is considered real-world data. Analyzing and extracting large collections of key insights about the treatment, including use responses from patients with varying characteristics, through EHR data alone may not get the full picture of the patient experience. Clinical notes from patient visits also need to be considered but can be difficult for study teams to manually weed through. ML can help automate clinical note data extraction to secure critical patient insights, allowing study teams and sponsors to delve deeper into treatment use and the patient experience.
OUTSTANDING CHALLENGES TO AI FOR GENOMICS
When used appropriately, ML can transform the use of genomics data in drug development. However, as with any solution, there are several key factors that sponsors need to consider and address to ensure success in use.
For one, when the training and deployment data follow the same data distribution, ML can help leverage genomics data successfully. But, data distribution shifts have long been a challenge in ML use. For example, with predicting human response from animal response, we must deliberately teach the algorithms to learn what information translates from one domain to another. Also, typically, there are only a few drug response data points for new treatments. Technologists and clinical teams have to determine how to make an ML model learn when only a few examples are available.
In terms of racial bias in training data, it has been shown that ML models do not always translate well across all subpopulations. Models that may perform well on the discovery patient population generally have much lower accuracy and are not adequate predictors in other populations. And, since most discovery is performed with European-ancestry cohorts, predictive models may exacerbate health disparities, as they will not be available for or have lower value for African and Hispanic ancestry populations. This level of imbalance for minority patient populations requires specialized ML techniques. As a solution, ML is defined to make the prediction independent of variables, such as race, gender, and sexual orientation, and recent works have been proposed to ensure this standard in the clinical ML domain.
Lastly, given the amount of genomics data generated on a daily basis, it is obvious ML models can help with data aggregation and annotations. However, sponsors need to keep data privacy compliance in mind, as these insights contain sensitive patient information and are not shareable directly. Techniques to anonymize and de-identify these data using differential privacy, can potentially enable sharing of genomics data. Recent advances in federated learning techniques allow ML model training on aggregated data without sharing data.
WHERE ML CAN TAKE THERAPEUTIC DEVELOPMENT
Though ML applications to aid in therapeutics development will have their challenges, technical and otherwise, we are able to see how ML models using genomics data can help us better understand therapeutic tasks. Through a variety of ML applications, sponsors and study teams can dive deeper into tangible ways to personalize medicines. As such, the variety of ML models using genomics data will only grow and diversify, leading to more breakthroughs in drug discovery and development and further personalizing medicine for patients in need.
REFERENCE
- Huang K., Xiao C., Glass L, et al. HYPERLINK “https://www.sciencedirect.com/science/article/pii/S2666389921001768” Machine learning applications for therapeutic tasks with genomics data. Patterns. October 2021.
Lucas Glass is the Vice President of the Analytics Center of Excellence (ACOE) at IQVIA. The ACOE is a team of over 200 data scientists, engineers, and product managers that research, develop, and operationalize Machine Learning and data science solutions within the R&D space. He has launched more than a dozen Machine Learning offerings within R&D, such as site recommender systems, trial matching solutions, enrollment rate algorithms, drug target interactions, drug repurposing, and molecular optimization. His Machine Learning research, which is dedicated to R&D, has been published by AAAI, WWW, NIPS, ICML, JAMIA, KDD, and many others.
Total Page Views: 2641