Precision Medicine & Genomics Data Management Q&A

In March 2017, the Foundation for eHealth Initiative hosted a webinar State of the Art in Data Management for Precision Medicine and Genomics.

Featured speakers included:

  • Jim Buntrock, Vice Chair Information Management and Analytics, Mayo Clinic
  • Josh Peterson, MD, MPH, Vanderbilt University School of Medicine
  • Paul Terry, CEO and CTO, PHEMI Systems

We’re pleased to share the panelists’ answers to the outstanding questions from the webinar
Please click the arrow next to the question to view the answer

Click the arrow again to collapse the answer.

Some estimate that up to 80% of personal health information is unstructured. How do you deal with the unstructured data? Do you use natural language processing?

Mayo: Mayo uses select NLP methods on a per-project basis, where feature extraction from unstructured content is needed. For this, Mayo uses the open-source framework cTAKES and other internally developed methods.

PHEMI: Whereas some healthcare data certainly fits very nicely in a relational database, it’s true that a vast majority of healthcare data is considerably more complex – either unstructured or semi-structured. Common sources of unstructured data include:

  • Clinical notes, diagnostics reports, discharge summaries, etc.
  • Lab results
  • DICOM images from CT (computed tomography), MRI (magnetic resonance imaging), Ultrasound, X-ray and many more.
  • Genomics
  • Wearables
  • Implanted devices
  • Bedside telemetry

What’s really important with all these sources is the ability to:

  • Extract information from these documents so the information can be analyzed
  • Preserve the original document so you can extract new information in the future
  • Always maintain provenance (a link back to the source of truth) so you can verify the accuracy of the data you extract.
  • Extract the semantics of the data—for example, so you can see if someone has a family history of hypertension, has had hypertension in the past, is at risk of hypertension, or currently is being treated for hypertension.
  • Use the extracted data without risk of exposing Personal Health Information and violating HIPAA privacy guidelines.
  • Tag the extracted data with a confidence score such as moderate confidence for machine-extracted data; higher confidence for human-over-reading of the data; and highest confidence for validated in a patient exam.
  • Perform this analysis automatically when new data is ingested into the system.

PHEMI supports a variety of NLP tools – open-source and proprietary – to derive data, so customers can extract information, classify, and augment their own metadata.

How do you see VNA technology and big data NOSQL databases coexisting in the future?

Mayo: Mayo Clinic has viewed archive as a long-term infrastructure component of DNA archive similar to Image Archive. We have reviewed technologies, but have not deployed an archive method. Big data and HPC methods are used in processing data. Variants management, filtering, and query are performant using NoSQL technologies like HBase or MongoDB.

PHEMI: VNA is better suited for cost-effective and reliable long term archival of source records (e.g., “cold” storage of high resolution images), while NoSQL technology is better suited for applications where frequent low-latency access is required (e.g., records actively being used for research purposes.)

With respect to images, there are two key aspects to long-term archiving: workflow and analytics (VNA is very good for integrating with the radiology workflow, universal viewing systems, and existing clinical systems. A big data system is better suited to population-level queries that compare imaging with demographics, survival rates, treatment plans, and gene mutations. The power of this sort of big data query is that it can be used for research purposes to build cohorts, for instance. It’s also well-suited to provide assistance to physicians on treatment options and survival curves for new patients

For our Masters in Health Informatics Program, we want to prepare our students for the skills that will be in demand not only on graduation day but 5-10 years after. What skills do they need to work with health-related data, including Precision Medicine & Genomics, as the field matures?

PHEMI: Medical research is a very data-driven science today. In the future, the day-to-day practice of medicine needs to and will become more data-driven. We experience this “data-driven” approach whenever we buy something on Amazon and we see that people who bought this, also bought that. This is an example of learning from others to provide physicians with hints and recommendations–e.g., Patients like this responded well to these treatment plans.

The other thing is that physicians are increasingly being inundated with more and more data. We’re going to need ways to quickly identify the most important information so we know when, where and how to intervene. e.g., which patients are at risk of diabetes? How can I best encourage Mr. Smith to change his behavior? Which treatment plan is most effective for Ms. Jones?

Its also critical to understand key concepts around privacy, governance, policy, consent management, and the absolute necessity of combining data to create value. Students need to understand that data holds transformative value, yet also must be used appropriately.

From an industry perspective, are cloud-based deployment models being considered now?

Mayo: Cloud is being actively explored for both IaaS and PaaS offerings in the support of data processing and management.

PHEMI: There are a few different “flavours” of cloud: Infrastructure as a Service, Platform as a Service, Software as a Service. We’re seeing more uptake on the first two in healthcare. We are certainly seeing a lot more interest in cloud models, mainly to benefit from the agility, flexibility, and economics of the cloud. However, compliance issues always have to be factored in. PHI and PII come with a whole set of real issues. Ensure you select someone who can meet the compliance, security, and performance needs of your organization. Are they operating to HITRUST and HIPAA compliant standards?

The other thing that hits you with the cloud is the pricing; essentially, pricing is based on usage, so putting your data into the cloud is “free” but computing on it, and possibly taking it out, costs money. That can discourage actual use of the data, since operating expenses can shoot up every time a data scientist explores the data.

What analytic tools does PHEMI use for visualizations to the end user?

PHEMI: It all depends on who’s accessing the data. Power users like data scientists and data engineers use the Data Science Tool Kit (DSTK) – a very powerful programmatic and visualization workbench for interactive data preparation, discovery, analytics and visualization. Data analysts, researchers, bio-informaticians and other end users can use their organizations’ analytics tools like Tableau, Qlikview, R, Tableau, and Microstrategy, and others through exporting of manufactured datasets.

Also, end users have access to the PHEMI REST API that enables easy integration with custom applications and dashboards, like powerful JavaScript visualization libraries (e.g., D3 and Charts.)

For PHEMI, do you have any new ontology layers built to integrate with genomic data? Or is it more hard translation?

PHEMI: There are multiple standards and competing ontologies to interpret genomic data and the metadata attached to that genomic data. We’re currently working on data dictionaries to handle this. We take an approach where we can represent the genome in any number of ways in the system if the data dictionary entry has been made.

How scalable is the integration of genetic sequence data into Clinical Decision Support (CDS) systems, and how receptive are clinicians to introducing this into current workflows?

Mayo: Most of the EHRs provide a limited mechanism to store discrete variants for use in traditional CDS. There is awareness that larger scale panel testing or WES or WGS will break this model.

PHEMI: Genomics, proteomics and metabolomics data will eventually support CDS systems. We are starting to see early iterations of it in certain areas. There’s a company here in BC developing next-generation sequencing tests to detect mutations in cancer DNA, and then combines these results with clinical data to provide a recommendation for treatment to the treating physician.

From a privacy perspective, could you let us know what we need to consider for genomic data and next generation sequencing?

Mayo: We are treating genomic data the same as PHI. For research, authorized protocols research requirements apply. For other uses, we treat as PHI and generally follow the Least Privilege Access necessary to perform the role.

PHEMI: An individual’s genomic data is considered PHI; it cannot be truly de-identified.

But, it is still possible to run queries to build a cohort. For example, “Show me all patients with a mutation in BRCA1 or patients with a mutation in a certain region or CNV.” PHEMI calculates the risk of re-identification, so if the risk of re-identification is high because the resulting dataset is too small, then PHEMI will filter out the at-risk data.

I’d say that every organization doing precision medicine work is having discussions on how and where PHI data can be appropriately used and how patient consent needs to be managed. That’s part of a much broader societal discussion around the ethical and legal implications of having this information; for example, how it may relate to hiring practices or insurance coverage for individuals with certain genetic markers. As genomic privacy policies and practices are maturing, it will become increasingly important to ensure that genomic data management platforms can support fine-grained access and privacy controls.

And while the webinar was primarily concerned with human genomics data, it’s also worth pointing out there’s significant work going on with the sequencing of plants, bacteria, viruses, etc., which don’t have the same issues around privacy and PHI.

As genomic privacy policies and practices are maturing, it is important to ensure that genomic data management platforms can support fine-grained access and privacy controls. For example, PHEMI Precision Medicine Edition gives privacy officers fine-grained control over privacy and access right down to an individual base pair position.

Do you think bio-banks are plausible for aiding your research?

Mayo: Yes. We have invested in a 50K control bio-bank that is used for feasibility and further case-control designs. If setup properly, they typically need good clinical phenotype characterization and patient consent and authorization to be eligible for multiple studies.

PHEMI: Bio-banks are often used as a long-term storage mechanism; some argue that they are the cheapest way to store DNA. At the BC Cancer Center, they’ve been keeping prostate cancer tissues since the seventies. Because the cost of sequencing has dropped so dramatically, they can re-sequence the tissues when needed for new analysis and research.

Do you see by having this kind of precision medicine the practice of medicine changing? And is it integrated/measured against outcomes?

Mayo: Yes. This is represents another disruptive technology that will help physicians better diagnosis, treat, or prognose patients. Adoption is slowed based on education, reimbursement, and applicability. Most testing is based on clinical utility, but we see more opportunity to measure impact.

PHEMI: Precision medicine research aims to turn today’s broad “trial and error” probabilistic treatment approach into a more focused, individualized, and evidence-based approach for disease prevention and treatment. It’s about looking at my molecular makeup to diagnose me, prevent disease, and treat me in a way that matches my molecular profile. Personalized medicine has the potential to replace one-size-fits-all treatment plans and at the same time address the unsustainable healthcare costs, with patient-specific effective treatment.

I think its also important to note that precision medicine is not strictly about genomics; its about disease treatment and prevention that factors individual differences in lifestyle, environment, and genetics.

We’re certainly seeing the prevention angle with individuals demanding more attention to their wellbeing, and organizations incenting healthy behaviour. We’re working with a payer who has a huge wellness program that is very outcomes related – better insurance premiums, personalized rewards, and grocery discounts for those who are purchasing healthy food. This payer tracked lower health care costs of 7% to 14% for those participating in the wellness program.

For Mayo, are there any discussions on possibly making this data available for eligible third parties to do research?

Mayo: Discussions are limited and concerns of meeting HIPAA, HITRUST, or SOC 2 regulations. De-identification is difficult.

Is there an analytic solution that is uniformly available across your databases that is easy to use, does not require a PhD, can be used by a subject matter expert, and is fast--a few seconds response time? What are some of our big states using to handle their big data such as in Medicaid?

Mayo: We provide some self-service tools for variant frequency on Mayo population. Genomics data by its nature requires some level of domain familiarity and self-service has to factor some of this expertise.

PHEMI: PHEMI support the use of any 3rd party tool that the domain expert is comfortable with (R, Bioconductor, SPSS, Tableau, SAS, etc.) through exporting of manufactured datasets.

What are the panel views of genomic analytics services likely business models? Will individual provider networks develop these services in house or outsource to 3rd party solution providers.

Mayo: This will vary per organization. There has been and will be new genomic analysis companies that will provide a range of services from sequencing to interpretation. Due to Mayo’s reference laboratory business we will continue to process and interpret results.

PHEMI: I strongly believe that there will be companies whose primary business will be to interpret the data, not only genomic, but also microbiomic, proteomic etc. With the cost of sequencing now in the range of affordability for many individuals, the demand for people who can actually interpret that data is growing. We’ll see “back office” departments or companies who have the specialized skills to interpret the data.

How much of genome do you sequence most commonly (whole genome, exome, or panel) and how you handle privacy requirements of genomic data?

Mayo: Clinically, it is panel testings. For research, it is WES or WGS. We treat genomic data the same as clinical data. IRB governs research protocol. Patient care follows both laboratory and clinical use.

What PHRs have you linked to in your work?

Mayo: None.

PHEMI: We all know that EMR/EHRs are not fantastic for storing the unstructured data that we are now seeing in healthcare. We’re working with a client – Molecular You – who is building a PHR that contains a huge variety of data – not only the data stored in a traditional EMR, but also an individual’s genomic, proteomic, metabolomic, and microbiomic data – with the goal of delivering a comprehensive, personalized digital health guide.

Given the importance of social determinants and the nature of how those data are gathered, have you applied text analytics to mining and modeling efforts?

Mayo: Limited, but definitely opportunity.

Does PHEMI use controlled vocabularies/ontologies to integrate/aggregate and search data?

PHEMI: We use metadata standards and data dictionaries to integrate, aggregate, and search data. We are definitely interested in working with customers who want to help us advance these capabilities.

Are any of the panelists documenting ROI of these analytics services? For example, improved outcomes, other metrics? Is that information going to be made public - shared?

Mayo: Outcomes have been mostly useful for understanding the consideration of testing or change of order based on result especially for Pharmacogenomics rules.

Does storage of data (large volumes from genetic sequencing) represent a major hurdle for deploying these services?

Mayo: Genomic sequencing takes capital and resources from instrumentation, storage, processing of data. No easy short-cuts.

PHEMI: The data volumes involved mean that it’s easy to do things wrong. At a recent conference, a speaker remarked that their DIY cloud-based genomics storage + bioinformatics solution ended up costing a prohibitive amount after the first year. The technologies exist to deal with the large volumes, but finding the right solutions that can scale is the hard part.

At PHEMI, we focus on precisely this data management and protection problem – at scale.

Can Blockchain help with building larger more accessible biobanks?

Mayo: Blockchain may offer some opportunity for federated sharing. But technology acceptance may take some time with patient data.

PHEMI: We don’t see Blockchain has having a near-term role in this space. I wouldn’t say “never,” but not right now. As the data gets bigger in Blockchain, it becomes less searchable.

Does PHEMI have a way to scan/import paper-based form data?

PHEMI: We can receive any type of data in any format. Right now, for a customer doing research on premature cardiovascular disease, we ingest the scanned image (unstructured data), decorate it with metadata, run some OCR on the image to extract the information and prepare it for analytics (essentially creating more metadata on the original file), and then combine the prepared data with the other collected data (patient surveys, geneology info, bloodwork) to create a rich set of curated patient data for analysis.

Are you working with the DNA study of one million people at MIT?

Mayo: Not aware of this work.


If you are interested in learning more from industry pioneers in precision medicine and genomics about data management strategies and challenges, download the full research report today.

Download the Report Now!