MACHINE LEARNING - Applying Machine Learning Techniques: Gaining Meaningful Life Sciences Insights From Genomics Data


Through advancements in high-throughput technologies and data management systems, we have access to a vast and varied collection of datasets in the biomedicine space, including genomics data. In generating massive amounts of genomics data, there is a tremendous opportunity to gain meaningful in­sights to personalize medicine to a patient’s particular genomic makeup. However, genomics data is complex, and the data alone is not going to advance therapeutic development toward person­alized medicine. For example, to pinpoint the right disease target, we need to know about the entire suite of biological processes. Personalizing therapies requires accurately classifying disease sub-types and the investigational compound’s sensitivity to vari­ous genomics profiles. Machine Learning (ML) offers a useful set of tools to glean these valuable insights from genomics data.

We organize our perspective of the complex space by diving into how genomics data interacts with other types of data, includ­ing compounds, proteins, electronic health records (EHRs), cellu­lar images, and text, as well as by the stage of the R&D lifecycle, in which experts are leveraging the genomics ML. Given the scale and variability of data, experts may not find all the novel patterns ML can help pinpoint, improving prediction capabilities on vary­ing tasks, such as drug response prediction. Uncovering unique and useful patterns can also help lead to the discovery of novel biological insights. Given that therapy discovery often consists of larger, resource-intensive experiments with a limited scope, many potential therapies may be missed. The predictive capabilities of­fered by ML solutions can help pharmaceutical and biotech com­panies shift focus onto additional experiments — providing an opportunity to catch or generate potential options.

As companies look for ways to unlock the potential of assets they may consider developing through genomics data, under­standing how ML can help enable the process is critical.


In helping to identify patterns within varying interactions to extract insights from genomics data, ML applications can support faster and more effective drug development. And, it is possible to leverage ML applications throughout the entire therapeutic life­cycle. We have shared a few key applications below based on a systemic literature review published in the October 2021 issue of the data science journal Patterns.

Target Discovery
Pharmaceutical and biotech companies focus on the discov­ery of novel disease targets to determine how an investigational treatment might hone in on a molecule to produce a therapeutic effect, such as inhibition, to ultimately block the disease process all together. Relying heavily on the basics of human biology, these companies focus on target discovery to help identify target bio­markers, which aid in designing therapeutics to potentially stop the disease pathway and provide treatment to patients.

Druggable Biomarker Identification
Given diseases are driven by complex biological processes, biomarkers play a key role in helping researchers better under­stand and navigate these processes. It’s from these deeper in­sights that pharmaceutical and biotech companies can design therapeutics to stop the disease process and potentially cure it.

By mining through large-scale biomedical data, ML can help identify these biomarkers to accurately predict genotype-phenotype associations. Probing trained models of complex patient data can uncover po­tential biomarkers and identify patterns re­lated to the disease mechanisms that may not be feasible through manual processing and analysis. Some key tasks related to biomarker identification via ML applica­tions, include the following:

  • Variant calling, the first step before re­lating genotypes to diseases, is used to specify which genetic variants are pres­ent in each individual’s genome from sequencing.
  • Prioritization of pathogenic variants from an entire variant set that can in­clude at least 1 million per person can potentially lead to disease targets. ML approaches can help by either predict­ing the pathogenicity given a set of fea­tures for a single variant or by using each genome profile as a data point to then predict disease risks from this pro­file.
  • An ML-based model can be trained for rare disease detection, if sufficient data from patients with a rare disease and suitable controls exist. Formulating rare disease detection as a classification task, ML can help identify if a patient has a rare disease from his/her ge­nomic sequence and other insights, in­cluding EHR data.
  • As many diseases are driven by a spe­cific set of genes forming pathways, it is useful to perform pathway analysis to identify these gene sets to have a more complete understanding of disease mechanisms.

Therapeutic Discovery
After identifying a treatment’s target, therapeutic discovery is the next step. In this discovery stage, working to design a therapy — a small molecule, antibody, gene therapy, and more — to control the disease target and block its pathway is key. This can include numerous phases and layers of tasks to help ensure the treatment is safe and effective.

There are several ways the use of an accurate ML model can help sponsors in this stage of drug development, including the following:

  • Identify new molecules faster to reduce development timing by de-risking the research aspects of finding new thera­pies.
  • Better prediction of a therapy’s re­sponse in a variety of cell lines in silico or in virtual cohorts for testing purposes to guide smarter decision making.
  • Potential to greatly narrow down the drug screening space and reduce ex­perimental costs and operational re­sources.
  • Optimize the probability of clinical trial success, as therapy response insights are uncovered, to enable greater visi­bility to better protocol design.
  • Enable the design of various gene ther­apies.

Additionally, as combination thera­pies can help modulate multiple targets to provide a novel mechanism of action in cancer treatments, it is possible to reduce adverse effects for the patient, given there are reduced dosages of each therapy. Screening the entire space of possible drug combinations is not feasible experi­mentally. Relying on ML models that can predict responses due to the drug combi­nation and the genomic profile for a cell line can be valuable.

Machine learning applications for therapeutic tasks with genomics data.

Click image to enlarge

Clinical Studies
In the comprehensive literature review conducted by several life sciences profes­sionals and published in Patterns journal, as previously noted, there are three areas of focus for genomics in clinical trials:

-Animal-to-human translation
-Cohort curation
-Causal effects

ML is regularly exploring the chal­lenges of domain adaptation like that of the animal-to-human challenge. The abil­ity to leverage machines to learn how phe­notypic responses in mice translate to the responses in humans can greatly reduce the failure rate of early phase clinical trials.

Through ML, study teams can distin­guish important factors for the primary endpoints and quickly identify them in pa­tients by predicting patient profiles that will respond to treatment. ML-based ap­proaches have been tackling this problem at a cellular and molecular level. Also, to address the problem of successfully iden­tifying the right patients for the right trials, automated patient-trial matching using ML models are worth considering for sponsors to improve enrollment by taking heteroge­neous patient data and trial eligibility cri­teria into account.

Mendelian randomization is a partic­ular method that uses a measured varia­tion in genes with already known function to evaluate the causal effect of modifiable exposure on a disease. If the gene is as­sociated with the exposure and the out­come due to the exposure, genes can be an instrumental variable to simulate ran­domization. This method can help spon­sors bypass clinical trials all together, add support for trials, and/or validate drug tar­gets. ML methods are showing promise over more traditional regression ap­proaches.

More advanced ML applications and causal inference methods, however, are challenging. For example, genes can as­sociate with the outcome through another pathway beyond exposure, which then re­quires customized probabilistic models and a larger sample size for statistically significant estimation.

Post-Market Study
Once a treatment is approved for marketing and commercialization, pharmaceutical and biotech companies con­tinue to monitor its efficacy and safety in clinical practice through numerous studies.

Given these studies house important information about the treatment that was not evident before regulatory approval, ML models can help mine through a large col­lection of texts and pinpoint useful signals for post-market surveillance. This includes numerous documentation sources gener­ated in EHRs, insurance billing systems and more, which is considered real-world data. Analyzing and extracting large col­lections of key insights about the treat­ment, including use responses from patients with varying characteristics, through EHR data alone may not get the full picture of the patient experience. Clin­ical notes from patient visits also need to be considered but can be difficult for study teams to manually weed through. ML can help automate clinical note data extraction to secure critical patient insights, allowing study teams and sponsors to delve deeper into treatment use and the patient experi­ence.

Machine learning applications for therapeutic tasks with genomics data.

Click image to enlarge


When used appropriately, ML can transform the use of genomics data in drug development. However, as with any solution, there are several key factors that sponsors need to consider and address to ensure success in use.

For one, when the training and de­ployment data follow the same data distri­bution, ML can help leverage genomics data successfully. But, data distribution shifts have long been a challenge in ML use. For example, with predicting human response from animal response, we must deliberately teach the algorithms to learn what information translates from one do­main to another. Also, typically, there are only a few drug response data points for new treatments. Technologists and clinical teams have to determine how to make an ML model learn when only a few examples are available.

In terms of racial bias in training data, it has been shown that ML models do not always translate well across all subpopu­lations. Models that may perform well on the discovery patient population generally have much lower accuracy and are not adequate predictors in other populations. And, since most discovery is performed with European-ancestry cohorts, predictive models may exacerbate health disparities, as they will not be available for or have lower value for African and Hispanic an­cestry populations. This level of imbalance for minority patient populations requires specialized ML techniques. As a solution, ML is defined to make the prediction inde­pendent of variables, such as race, gen­der, and sexual orientation, and recent works have been proposed to ensure this standard in the clinical ML domain.

Lastly, given the amount of genomics data generated on a daily basis, it is obvi­ous ML models can help with data aggre­gation and annotations. However, sponsors need to keep data privacy com­pliance in mind, as these insights contain sensitive patient information and are not shareable directly. Techniques to anonymize and de-identify these data using differential privacy, can potentially enable sharing of genomics data. Recent advances in federated learning techniques allow ML model training on aggregated data without sharing data.


Though ML applications to aid in ther­apeutics development will have their chal­lenges, technical and otherwise, we are able to see how ML models using ge­nomics data can help us better understand therapeutic tasks. Through a variety of ML applications, sponsors and study teams can dive deeper into tangible ways to per­sonalize medicines. As such, the variety of ML models using genomics data will only grow and diversify, leading to more break­throughs in drug discovery and develop­ment and further personalizing medicine for patients in need.


  1. Huang K., Xiao C., Glass L, et al. HYPERLINK “https://www.sciencedi­” Machine learning ap­plications for therapeutic tasks with genomics data. Patterns. October 2021.

Lucas Glass is the Vice President of the Analytics Center of Excellence (ACOE) at IQVIA. The ACOE is a team of over 200 data scientists, engineers, and product managers that research, develop, and operationalize Machine Learning and data science solutions within the R&D space. He has launched more than a dozen Machine Learning offerings within R&D, such as site recommender systems, trial matching solutions, enrollment rate algorithms, drug target interactions, drug repurposing, and molecular optimization. His Machine Learning research, which is dedicated to R&D, has been published by AAAI, WWW, NIPS, ICML, JAMIA, KDD, and many others.