Advanced analytics: Powering pharma’s most critical data mission

How analytics-driven insights can help drugmakers navigate the leap from trial to market—and add clinical and financial value at every step

Drug-development efforts are always fraught with  promise and peril. Today, more than half of all investigational drugs in Phase III clinical trials fail 1, due to factors such as flawed design, inappropriate endpoints, safety issues, lack of funding, failure to demonstrate clinical efficacy and under-enrollment. Delays create economic consequences for drugmakers in terms of sunk costs and foregone earnings, and clinical consequences for physicians and patients who would have potentially benefitted from the novel therapy.

Meanwhile, clinical study costs continue to skyrocket. A recent study2 that examined 138 clinical trials that supported the FDA approval of 59 new therapies found that the median estimated cost per trial was $19 million. Digging deeper, those studies that had no control arm incurred a mean estimated cost of $13.5 million per trial, while for those that included a placebo-controlled or comparator-drug study arm, the mean cost per trial of $35.1 million.

Against the backdrop of high failure rates and untenable clinical trial costs, drug developers are sharpening their focus on the promise of advanced data-analytics methodologies. At the same time, there is growing willingness among regulators and payers to evaluate a strong data-driven value case when considering regulatory submissions—for new medications and label expansions—and negotiating drug pricing and reimbursement contracts and formulary status (for recent examples from pharma and biotech, see sidebar further down page).

Improving the odds of clinical and commercial success

“From an accessibility point of view, there is no single place where drug developers can get all the real-world data (RWD) they would potentially need,” says Fernando Schwartz, PhD, AVP, global data science head, commercial, Merck & Co. Instead, drug developers must buy RWD from many disparate sources.

In this era of pervasively computerized healthcare, RWD that reflects both patient-level and population-level insights can be found in:

  • Electronic health records (EHR) systems
  • Billing and claims records related to healthcare and pharmacy activities
  • Disease- and drug-specific patient registries
  • Medical-imaging and lab-testing databases
  • Wearable medical-monitoring and consumer health-tracker devices
  • Patient-reported outcomes (PROs)
  • Genomics data
  • Information collected during telemedicine encounters
  • Clinical findings from prior trials

Historically, such repositories of RWD were thought to be too vast or too complex to be used effectively. But today, advanced data-analytics and modeling capabilities are able to explore and exploit the insights that are buried there, and enable relevant analyses, simulations and predictions to be carried out, yielding actionable, evidence-based findings, or real-world evidence (RWE).

“Collectively, efforts to build and exploit predictive algorithms (that can benefit from a longitudinal view through all aspects of a patient’s treatment and disease progression) offer a lot of promise, and early successes suggest a strong and growing future role for them. But there have also been  a lot of strong claims out there with regard to what all of these in silico efforts can produce, and much of the full potential has yet to be seen,” says Schwartz. “Nonetheless, forward-looking drug developers are setting themselves up to explore these data-analytics technologies to fully understand the molecule and its potential clinical and financial impact.”

“Having data is not enough—the trick is to derive insights from the data, and that is where the bulk of the innovation is happening today,” adds Roman Casciano, senior vice president and general manager, Certara Evidence and Access. This helps drug developers to better understand the impact that factors such as the distribution of disease severity have on the measurable differences in treatment effect between the study arms, and develop predictions related to outcomes and probability of success as a function of trial design, he notes.

The modeling toolbox

Modeling algorithms are first established and trained by humans, and then validated by advanced data-science methodologies, including artificial intelligence (AI), machine learning (ML), natural language processing (NLP) and others. Such algorithms are able to analyze massive data sets, find patterns, evaluate competing trial-enrollment criteria and endpoints, and identify safety signals and other anomalies earlier in the drug-development process.

“It is truly the upside and downside of a free-market society—competition leads to much more innovation when it comes to commercially available sources of valuable RWD,” says Ed Ikeguchi, CEO of AiCure, “But it creates persistent interoperability issues that must be reconciled if we are to ever make full use of the rich RWD sources available.”

“With so much fragmentation in the realm of data generators and data aggregators today, you literally cannot lay the data side by side,” adds Schwartz. “It’s like trying to assemble a jigsaw puzzle where some pieces are built from cardboard while others are built from wood or metal. This complicates the way you can build models.”

Today, considerable time and effort is being expended to clean and harmonize RWD aggregated from disparate sources, and each data set requires labor-intensive custom onboarding. “Such data must be linked through a common token while maintaining HIPAA compliance, and then standardized, normalized and integrated into a data set that can drive business analytics that are actionable,” explains Theresa Greco, chief commercial officer for Prognos Health, whose platform allows clients to securely analyze billions of HIPAA-compliant laboratory-testing data and health records related to more than 325 million de-identified patients. “Such efforts can index and process billions of records in seconds, making it possible to provide the answer to key healthcare questions in minutes instead of months,” she says.

According to Aaron Galaznik, MD, Head, RWE Solutions, Acorn AI by Medidata, the company is able to analyze a million data points and identify 4,000 data patterns in less than an hour, “with the combination of smart data engineering, extensible cloud-based solutions and advanced analytic capabilities.”

Meanwhile, patient centricity has emerged as the rallying cry in drug development these days. Examining PROs and endpoints helps drug sponsors understand not just the patients’ unmet medical needs but their preferences and values (related to quality of life), too. By creating clinical outcome assessments (COA) at scale, based on RWD, drug developers can ensure that the patient perspective is pervasive in all drug-development and commercialization efforts. Today the industry is just scratching the surface on which components of the COA should be added to trials.

Addressing clinical trial deficiencies

While no one disputes the inherent value of the randomized controlled trial (RCT) process—which is routinely referred to as the gold standard for drug development—many would argue that the process is deeply flawed by design, and should no longer be the only method by which investigational therapies move from the bench to the bedside.

RCTs routinely use strict inclusion and exclusion criteria —based on age, gender, racial/ethnic status, pregnancy, the presence of comorbidities and genetic status (such the presence or absence of specific biomarkers that correlate with clinical outcomes)—to create an ideal, homogeneous group of trial participants. However, such strict gatekeeping leaves many patient subgroups out in the cold and narrows the generalizability of the clinical findings in real-world healthcare settings.

In addition, a variety of structural barriers related to race/ethnicity, age, gender, educational and socioeconomic status and literacy challenges often limit patient access to healthcare in general, and trial opportunities in particular. Persistent diversity issues during trial enrollment result in serious gaps in the knowledge base, leaving physicians to extrapolate the RCT findings at the point of care. “Over time, these disparities often lead to lower-than-expected clinical efficacy, uptake and adherence, and ultimately worse outcomes in the real world once new drugs and medical products launch,” says Galaznik.

Data-driven studies can help fill in the missing pieces and greatly expand the findings in the trial to make them more relevant—at larger scale—to the heterogeneous patient subgroups in the real world, says Dan Riskin, MD, CEO of Verantos. Two types of RWE studies are particularly useful to address persistent diversity challenges in RCTs:

1. Comparative effectiveness studies. These compare the clinical effectiveness of several therapies, in order to identify which performs better within specific patient subgroups. The findings can help to differentiate a therapy’s performance in a crowded therapeutic space.

2. Subgroup analytics.These involve studies of how a given therapy performs within a targeted patient subgroup, to develop new clinical and safety insights that can inform clinical care and address important knowledge gaps.

“If investigators are able to compellingly show that certain subgroups fare better on their therapy (using high-validity subgroup analytics and comparative effectiveness studies), they can help to improve the standard of care for prescribers, improve clinical outcomes for patients, and help drug companies to differentiate their products in ways that can help to increase market share,” says Riskin.

To that end, it is critical for pharmaceutical companies to think through all of the opportunities—both clinical and financial—as they are designing their trials, and not focus only on the immediate outcome of the study. “Sometimes we see a trial protocol call for blood to be drawn to analyze for the presence or absence of one or more biomarkers—not because such information is required for existing treatment, but rather with the hope of being able to look back and mine that data later to identify unanticipated clinical correlations,” says Ikeguchi.

“Drug developers are increasingly having to generate more evidence, in addition to the RCT. This is particularly notable in oncology, for example, where standards of care are rapidly evolving, or where today’s relapsed patients are different from yesterday,” adds Akiko Shimamura, senior director of product, medical analytics and outcomes research, Acorn AI by Medidata.

 Structured versus unstructured data

Today, large commercial companies pool and curate data from parallel sources, so that relevant data elements can be mined using appropriate eligibility criteria. In principal, that sounds simple; in practice, it is not.

Adding to the challenge is the fact that RWD typically exists in two formats—structured and unstructured. Structured data, such as that claims and billing data found in a payer or pharmacy database, and clinical data captured in an EHR system, is typically gathered and managed in a highly formatted structure. Such organizational logic makes it relatively easy to access and curate the information using data-mining techniques.

However, due to rigid coding requirements used to impose uniformity during data entry, the information contained in structured claims and billing databases may not adequately express the types of clinical details investigators need. “From a study standpoint for drug developers, the richer clinical details are those found in the form of unstructured physician notes that are housed within the EHR system,” says Schwartz.

While the type of unstructured information found in EHR systems and other sources does not fit neatly into highly formatted rows and columns, it provides a gold mine of clinical information to be prospected. However, the lack of structure, coupled with other confounding factors such as the use of medical jargon, shorthand, abbreviations and misspellings, has traditionally consigned unstructured data to the margins of the big data universe.

Fortunately, in recent years, ongoing advances in NLP, a subset of data analytics, has helped to address this challenge. “NLP techniques try to evaluate the vast amount of clinical information that resides in physician notes related to millions of patients and extract the most relevant clinical details to enable deeper clinical insights,” says Schwartz. NLP is also being used to automate literature reviews and glean clinical insights from notes collected by nursing call centers and other telemedicine encounters, databases of patient-reported outcomes and more.

“Our NLP processes, initially trained on large publicly available datasets and then finetuned on our specific dataset, are able to outperform humans,” says Adam Petranovich, chief data scientist, Prognos Health.

For instance, given an arbitrary anatomical pathology or molecular diagnosis report, “this process is able to automatically infer its interpretation, site, subsite and other meta information to be used or discovered by other systems,” he points out.

Synthetic control arms

The use of a placebo-controlled clinical study arm is a necessary yet controversial aspect of RCT design, driving up trial costs while undermining recruitment efforts.

In certain therapeutic catagories, particularly oncology, the notion of randomly assigning some of the patient volunteers into a placebo-controlled study arm and thus squandering precious time is considered unethical. In other cases, such as those targeting rare diseases, it can be hard to find enough patients to populate the study and control arms. And while single-arm, open-label trials, where all patients are treated with the therapy being studied, can yield relevant findings, they provide no option for comparison to a placebo or to the standard of care.

To address these challenges, the use of synthetic control arms (a term trademarked by Medidata, which is also called an external control arm) is on the rise. Such studies give today’s advanced data-analytics methodologies based on big data a chance to shine.

Synthetic control arms are small, standalone studies that are conducted using de-identified, population-level data mined from various sources of RWD. They enable the use of pooled clinical trial data “that has rich covariates typically not available in the real world, and allow for side-by-side comparisons of the RCT population with the real-world population, to contextualize the trial and demonstrate relevance of the trial findings with the payer population,” explains Shimamura.

These external control arms, based on data analytics, serve as proxies for traditional control arms and can help to elucidate and expand the findings of the current trial and allow for head-to-head comparisons of the investigational drug versus the standard of care—while lowering the costs and reducing both patient burden and ethical considerations.

“The application of synthetic controls is not a new phenomenon, though it is certainly on the rise, especially in the formal regulatory approval context,” says Casciano.

Focusing on the go-to-market strategy

 The traditional contracting model between manufacturers and payers, typically defined by rebate payment levels based on volume, “is quickly becoming outdated and unsustainable,” says Galaznik. By contrast, so-called value-based contracts (VBCs), which are sometimes called outcomes-based agreements (OBAs), strive to tie reimbursement rates to demonstrated clinical outcomes.

Such data-driven contracts are designed to compensate drugmakers—with more favorable pricing or reimbursement terms or formulary positioning—when specific agreed-upon patient outcomes or other clinical measures are achieved during real-world use of the therapy over time and are expected to grow over time.

Using the latest data-analytics efforts, “drug sponsors can and should use their findings to demonstrate both the clinical and economic value of their new medications using data-driven studies, and communicate those opportunities to payers,” says Ikeguchi.

But the stakes are higher than ever. “We used to be asked to provide ‘regulatory-grade data’ for regulatory submissions. That is now being categorized as ‘research-grade data,’ which supports high-validity evidence to make clinical assertions,” notes Riskin. “The term research-grade implies use by not only regulators, but also by payers and clinicians, to inform their decision-making with regard to drug pricing, reimbursement, formulary placement and care pathways.”

Meanwhile, for VBCs to work from an operational standpoint, drug companies and payers will need to create robust systems that can do two things—provide longitudinal tracking of outcomes over time among the insured patients and adjudicate the financial terms to allow for reimbursement per the contract times. Over time, industry experts expect to see the use of blockchain technology —which can create “one single version of the truth” in terms of tracking clinical factors and other performance metrics with complete traceability—to become the technical backbone that will enable such agreements to flourish at scale.

“As an industry, we have to do a better job of improving our models, and, importantly, validating them and communicating our successes, and we must continue to advance common standards for data collection, aggregation and sampling methods to assure representativeness and quality of the data underneath our most sophisticated analyses,” adds Casciano. “There has certainly been movement in that direction, and this trend needs to continue.” 



1. Hwang T.J., et al., Failure of investigational drugs in late-stage clinical development and publication of trial results. JAMA Intern. Med. 2016;176:1826–1833;

2. Moore T., et al.. Estimated costs of pivotal trials for novel therapeutic agents approved the US Food and Drug Administration, 2015-2016 [published online September 24, 2018]. JAMA Intern Med.;