PHUSE/FDA Data Science Innovation Challenge – Drug Safety Surveillance Challenge Abstracts

A current challenge of the PHUSE/FDA Data Science Innovation Challenge is 'Drug Safety Surveillance' which looks to define an ideal process for surveillance of multiple data sources that would be useful in the development of a pharmacovigilance tool. Here you will find the abstracts of the accepted participants who have proposed a solution to this challenge.

Datacise ™ Integrated Safety Explorer
Submitted by: Eric Harvey

Pharmacovigilance is entering a new era in data sourcing for surveillance. The ability to capture safety data has grown manyfold, along with the opportunity to benefit from aggregated safety knowledge bases. Nonetheless, most life sciences and regulatory agencies do not fully tap into this ever-growing safety knowledge base. It is our opinion that they do not have the luxury to allocate enough human resources to answer any questions beyond those most critical to safety.

To begin with, MMS searched for a worldwide unified registry for reporting adverse reactions to drug and biologic products. It was determined that such data is currently fragmented or unavailable and is suspected to be underreported. There are many disconnected tools available which try to answer safety questions from multiple dimensions. MMS has taken a proactive approach to creating integrated interactive data visualizations, which will allow for quick comparisons between data from multiple traditional sources, as well as emerging sources like social media and genomics data.

For this challenge, we chose nine publicly available data sources that are relevant to the final goal. With advances in technologies, it is easy to generate, source and parse safety data with a fair amount of data engineering knowledge. However, it takes a substantial amount of time for safety scientists or FDA reviewers to answer the question “Why do we have a signal?”

Today’s analytics tools let end users generate static safety reports, interact with data via dashboards, or manually explore data in various formats. These tools can lend insight into a certain safety metric at a single point in time, but it is unclear and difficult to ascertain in real time, with certainty, why a safety metric is emerging or changing.

The MMS proprietary Datacise™ platform Integrated Safety Explorer aims to:

  • answer the ‘why’ behind a signal
  • democratise safety data in conjecture with emerging sources such as social media and genomics.


Submitted by: Deb Piaseczynski

PRA’s Social Intelligence & Communities team researches how patients, caregivers, and HCPs talk about a specific disease. This complementary data allows us to truly understand the voice of the patient, which helps us learn how they describe their conditions, what their challenges and frustrations are, the type of information they seek and to identify behavioural trends and activities. Working in collaboration with data scientists, therapeutic experts and clinical operations, PRA uses real-world data, study protocol and site/country lists to develop queries within our global social intelligence platform. The team monitors, captures and analyzes large volumes of online conversations and consumer digital behaviour data.

PRAView harnesses the power of our multiple social and medical databases and, using AI-powered social listening and social platform monitoring, targets specific keywords related to public conversations about AEs, side effects, symptoms, and other physical and emotional trends among indication-specific audiences. This information is visualized via a real-time dashboard that provides demographics (age, gender, geography, ethnicity), emotions, sentiment, drivers and conversation volume.

This automated system provides global, multilingual, multicultural monitoring across multiple social platforms 24/7. For each target mention, we will provide a direct link to the actual post, a screenshot, time and report date and author information (if available) within 24 hours of posting. Monthly reporting by channel provides basic analytics while uncovering potential trends.


Submitted by: Azia Tariq

Quickly identifying adverse events for a newly launched drug can be critical in preventing further complications in patients and informing how a drug performs in the market after launch without the controlled conditions of a clinical trial. Typical channels for reporting AEs include FAERS, which supports the FDA's post-marketing safety surveillance, and MedWatch, the FDA’s medical product safety reporting program for health professionals, patients and consumers.

Traditional reports can take time and so there may be a gap from when the drug launches to market to when the FDA receives enough AE reports to make a decision for drug label change. These reports may also not account for patients that do not report an AE experience at all, leading to some AEs being underreported. For example, a drug that receives reports from medical providers may not account for AE experience by individuals who opt not to see a doctor. They may, however, choose to ask their social circle via social media.

A proof of concept tool that utilizes social media to fill the gap between launch and existing pharmacovigilance systems catching up (~6 months) has the potential to reduce this lag and allow the FDA to receive real-time information regarding a drug by scraping and analyzing social media data. This combined with the current reporting systems will better enable the FDA to make informed decisions based on a comprehensive AE profile of the drug. The tool may flag potential new AEs that were not part of the clinical submission database/drug label, generating targets for further follow-up using existing systems. Using an R-Shiny dashboard, we can build a tweets database based on each drug and help visualize these AE spikes occurring on various social media platforms.

This proof of concept system identifies many of the challenges that need to be faced in order to have a fully functional and scalable solution deployed on live-stream social media data. While development of the tool has been hindered by a lack of suitably tagged messages/datasets to train algorithms within this bespoke task, we have outlined paths forward with the end objectives in mind. Given the limited data we have access to, our current proof of concept will present unacceptably high false positive rates when accounting for the volume of social media messages and the low proportion of them reporting drug-related adverse events. The short-run solution we employ recommends a manual review of all system-tagged messages until the system can be refined to achieve sufficient specificity. We also strongly encourage proliferation of the project to cross-industry collaboration, to produce an adequate training/test set with multiple language support.


DeepFAERS: AI for Pharmacovigilance by Diving Deep into FAERS
Submitted by: Zhichao Liu, Weida Tong and Xiaowei Xu

Protecting the safety of patients and consumers is the main mission of the FDA. Our solution is aimed at developing deep neural network models for post-market drug safety surveillance. The data from the FDA Adverse Event Reporting System (FAERS) will be collected and analysed for modelling adverse events after using a drug or therapeutic biologic product.

In addition, the adverse events will be mined from social media data using deep neural language models which can discover adverse events from the texts posted by the patients and consumers. The adverse event will be detected, assessed and monitored using the deep neural network model to protect public health. 


Deep dissection of post market surveillance signal
Submitted by: Dong Wang and Weida Tong

Monitoring and acting on safety signals from post-market surveillance is a critical component of the FDA’s mission to protect patient safety. Databases like the FDA Adverse Event Reporting System (FAERS) have been extensively analysed to mine safety signals to provide alerts for potential regulatory actions.

However, the current methods tend to focus on high-reporting frequencies in general, while subpopulation-focused analysis has not been sufficiently utilised. In this proposal, we will provide new statistical and data mining tools for the deep dissection of post-market surveillance signals so alerts can be given for different sexes, age groups, and other subpopulations. This approach will not only provide means for more focused alerts, it will also detect subpopulation-specific signals that can be diluted in a general analysis


A deep learning phylogenomic approach to advance post market safety surveillance of drugs and biologic products
Submitted by: Pere Puigbo

Artificial Intelligence (AI) systems can be trained to find statistical structures from large and complex datasets. These techniques are currently in high demand in several industries to analyze unstructured datasets from the Internet of Things such as text, images and real-time and spatial data. Several AI systems, including machine learning and deep learning algorithms, can be used for market safety surveillance of drugs and biologic products. However, real-time systems present an NP problem that should be approached with heuristic methods. Heuristic algorithms perform analyses much faster but reduce precision and accuracy.

Here, I propose a solution to speed up screening time by performing a comprehensive phylogenomic classification of potential targets of biological products. In other words, I will use phylogenetic trees to classify target proteins/genes, and each clade of the tree will have a risk score associated that may change over time. The proposed solution is divided into three work packages (WPs), the specific aims of which are as follows:

WP1. Phylogenomic classification: to classify targets (genes and proteins from the KEGG database) of drugs and biological products using phylogenomic methods (including sequence analysis, phylogenetic trees and multifactorial analyses).

WP2. Microevolutionary analysis: to identify what kind of targets are more prone to safety issues and how these change over time. I will use the phylogenetic trees from WP1 to determine how the safety score changes over time, in a controlled microevolutionary scenario.

WP3. Screening with deep learning: to find safety risks associated with drugs and biological products. The aim of WP3 is to determine how the reduction of data dimensionality (reduction of the number of targets through phylogenetic clustering) facilitates real-time monitoring and surveillance. The goal is to reduce the time of predictions, without losing predictive power.


Counterfeit Drug Detection
Submitted by: Xin Liu

Counterfeit drugs cost the lives of hundreds of thousands of people in the world every year. The traditional counterfeit drug detection solutions are manual, labor-intensive, ineffective, costly and intrusive. The Precise team has developed an innovative counterfeit drug detection solution that is based upon the latest all-spectrum image recognition and on-the-edge TPU-based machine learning technologies.

Our solution has been proven to be accurate, automatic, near real-time, low-cost and non-intrusive. In this challenge, we would like to discuss the concept, hardware design and deep learning model for counterfeit drug detection built on AutoML and the end-to-end process for image augmentation, image processing and ML model training. 


Early Safety Signal Detection from Public Internet Sources
Submitted by: Jason Meyer

Post-marketing surveillance of FDA-regulated products is critical in identifying potential adverse events (AEs) in the real-world population. An infrastructure leveraging near real-time data to facilitate early signal detection may aid the FDA’s mission in continual assessment of products’ risk profiles.

Our goal is to establish an active surveillance system that collects, annotates, standardizes, evaluates and presents data from public open sources using advanced data science technology to enhance the FDA’s ability for early safety signal detection.

We propose using an artificial intelligence (AI) based approach to detect early safety signals from social media sites such as Reddit, Twitter and WebMD as AI techniques excel at extracting meaningful patterns from large volumes of ambiguous data. To augment the AI-based detection system, signals detected from social media data can be evaluated in the context of other data sources, such as the FDA reporting systems FAERS and VAERS.

We will establish an infrastructure to mine social media data for safety signals in near real time via the following steps:

– Select social media sites and collect data where potential AEs are reported.

– Extract language fragments from the sample data using natural language processing techniques.

– Annotate keywords from the sample data into standardized AE terminologies, supplemented with keywords from FAERS, VAERS.

– Build an extensive list of keywords for AEs from the initial list by applying Word embedding techniques to a large sample of social media data.

– Build a supervised ML model for determining potential safety signals.

– Aggregate and present the model results in a dynamic interpretable dashboard with geographic and demographic information.

This infrastructure will likely complement the FDA’s current surveillance networks, enhancing the early detection of safety signals warranting further investigation and systematic examination.


Drug safety surveillance using media data and public regulatory database
Submitted by: Xiuting Mi

We are aiming to leverage media data to evaluate the potential contribution of mining social media networks for postmarketing drug safety surveillance. Collected data mainly come from two social media platforms but not limited to these two: public English-language Twitter via its open API and simplified Chinese-language Sina Weibo (biggest mainland China social media). Drug adverse event related posts are identified and characterised through natural language processing method and then analysed. Using the data, we want to evaluate the level of concordance between social media posts mentioning AE-like reactions and spontaneous reports received by a regulatory agency, namely FDA. If possible, we also want to identify potential new safety signals across ethnic, country, and etc.

In terms of methodology, we are planning to implement well known NLP steps on the data for both English and Chinese, via segmentation and tokenisation analysis. After having corpus, we could apply text classification to match with MedDRA dictionary. Further exploratory statistical analyses will include but are not limited to counting the frequency of occurrence of different AE terms respectively for both social media posts and health authority database, and creating plots (swim-lane etc) to find potential correlation between two sources.

Challenge Stream Chairs:

Patricia Hegarty and Suranjan De


Posted by on

Categories: PHUSE News

Related Blogs

Add Your Comments

Thinking of joining PHUSE?

Already a member but not sure how you can benefit?

PHUSE is an expanding, global society with a global membership of clinical data scientists. It requires a large pool of resources to help with its running, and so there are many opportunities for members to become involved. Whether it's chairing a conference, presenting at an event, leading a working group or contributing to the quarterly online newsletter, we are always keen to hear from volunteers.

Find Out More