
CancerBase.org – The Global HUB for Diagnoses, Genomes, Pathology Images: A Real-time Diagnosis and Therapy Mapping Service for Cancer Patients – Anonymized Medical Records accessible to anyone on Earth
Reporter: Aviva Lev-Ari, PhD, RN
UPDATED on 10/29/2019
Pan-cancer computational histopathology reveals mutations, tumor composition and prognosis
Yu Fu1, Alexander W Jung1, Ramon Viñas Torne1, Santiago Gonzalez1,2, Harald Vöhringer1, Mercedes Jimenez-Linan3, Luiza Moore3,4, and Moritz Gerstung#1,5 # to whom correspondence should be addressed 1) European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, UK. 2) Current affiliation: Institute for Research in Biomedicine (IRB Barcelona), Parc Científic de Barcelona, Barcelona, Spain. 3) Department of Pathology, Addenbrooke’s Hospital, Cambridge, UK. 4) Wellcome Sanger Institute, Hinxton, UK 5) European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany.
Correspondence:
Dr Moritz Gerstung European Molecular Biology Laboratory European Bioinformatics Institute (EMBL-EBI) Hinxton, CB10 1SA UK. Tel: +44 (0) 1223 494636 E-mail: moritz.gerstung@ebi.ac.uk
Abstract
Pan-cancer computational histopathology reveals mutations, tumor composition and prognosis
Here we use deep transfer learning to quantify histopathological patterns across 17,396 H&E stained histopathology image slides from 28 cancer types and correlate these with underlying genomic and transcriptomic data. Pan-cancer computational histopathology (PC-CHiP) classifies the tissue origin across organ sites and provides highly accurate, spatially resolved tumor and normal distinction within a given slide. The learned computational histopathological features correlate with a large range of recurrent genetic aberrations, including whole genome duplications (WGDs), arm-level copy number gains and losses, focal amplifications and deletions as well as driver gene mutations within a range of cancer types. WGDs can be predicted in 25/27 cancer types (mean AUC=0.79) including those that were not part of model training. Similarly, we observe associations with 25% of mRNA transcript levels, which enables to learn and localise histopathological patterns of molecularly defined cell types on each slide. Lastly, we find that computational histopathology provides prognostic information augmenting histopathological subtyping and grading in the majority of cancers assessed, which pinpoints prognostically relevant areas such as necrosis or infiltrating lymphocytes on each tumour section. Taken together, these findings highlight the large potential of PC-CHiP to discover new molecular and prognostic associations, which can augment diagnostic workflows and lay out a rationale for integrating molecular and histopathological data.
https://www.biorxiv.org/content/10.1101/813543v1
Key points
● Pan-cancer computational histopathology analysis with deep learning extracts histopathological patterns and accurately discriminates 28 cancer and 14 normal tissue types
● Computational histopathology predicts whole genome duplications, focal amplifications and deletions, as well as driver gene mutations
● Wide-spread correlations with gene expression indicative of immune infiltration and proliferation
● Prognostic information augments conventional grading and histopathology subtyping in the majority of cancers
Discussion
Here we presented PC-CHiP, a pan-cancer transfer learning approach to extract computational histopathological features across 42 cancer and normal tissue types and their genomic, molecular and prognostic associations. Histopathological features, originally derived to classify different tissues, contained rich histologic and morphological signals predictive of a range of genomic and transcriptomic changes as well as survival. This shows that computer vision not only has the capacity to highly accurately reproduce predefined tissue labels, but also that this quantifies diverse histological patterns, which are predictive of a broad range of genomic and molecular traits, which were not part of the original training task. As the predictions are exclusively based on standard H&E-stained tissue sections, our analysis highlights the high potential of computational histopathology to digitally augment existing histopathological workflows. The strongest genomic associations were found for whole genome duplications, which can in part be explained by nuclear enlargement and increased nuclear intensities, but seemingly also stems from tumour grade and other histomorphological patterns contained in the high-dimensional computational histopathological features. Further, we observed associations with a range of chromosomal gains and losses, focal deletions and amplifications as well as driver gene mutations across a number of cancer types. These data demonstrate that genomic alterations change the morphology of cancer cells, as in the case of WGD, but possibly also that certain aberrations preferentially occur in distinct cell types, reflected by the tumor histology. Whatever is the cause or consequence in this equation, these associations lay out a route towards genomically defined histopathology subtypes, which will enhance and refine conventional assessment. Further, a broad range of transcriptomic correlations was observed reflecting both immune cell infiltration and cell proliferation that leads to higher tumor densities. These examples illustrated the remarkable property that machine learning does not only establish novel molecular associations from pre-computed histopathological feature sets but also allows the localisation of these traits within a larger image. While this exemplifies the power of a large scale data analysis to detect and localise recurrent patterns, it is probably not superior to spatially annotated training data. Yet such data can, by definition, only be generated for associations which are known beforehand. This appears straightforward, albeit laborious, for existing histopathology classifications, but more challenging for molecular readouts. Yet novel spatial transcriptomic44,45 and sequencing technologies46 bring within reach spatially matched molecular and histopathological data, which would serve as a gold standard in combining imaging and molecular patterns. Across cancer types, computational histopathological features showed a good level of prognostic relevance, substantially improving prognostic accuracy over conventional grading and histopathological subtyping in the majority of cancers. It is this very remarkable that such predictive It is made available under a CC-BY-NC 4.0 International license. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. bioRxiv preprint first posted online Oct. 25, 2019; doi: http://dx.doi.org/10.1101/813543. The copyright holder for this preprint signals can be learned in a fully automated fashion. Still, at least at the current resolution, the improvement over a full molecular and clinical workup was relatively small. This might be a consequence of the far-ranging relations between histopathology and molecular phenotypes described here, implying that histopathology is a reflection of the underlying molecular alterations rather than an independent trait. Yet it probably also highlights the challenges of unambiguously quantifying histopathological signals in – and combining signals from – individual areas, which requires very large training datasets for each tumour entity. From a methodological point of view, the prediction of molecular traits can clearly be improved. In this analysis, we adopted – for the reason of simplicity and to avoid overfitting – a transfer learning approach in which an existing deep convolutional neural network, developed for classification of everyday objects, was fine tuned to predict cancer and normal tissue types. The implicit imaging feature representation was then used to predict molecular traits and outcomes. Instead of employing this two-step procedure, which risks missing patterns irrelevant for the initial classification task, one might directly employ either training on the molecular trait of interest, or ideally multi-objective learning. Further improvement may also be related to the choice of the CNN architecture. Everyday images have no defined scale due to a variable z-dimension; therefore, the algorithms need to be able to detect the same object at different sizes. This clearly is not the case for histopathology slides, in which one pixel corresponds to a defined physical size at a given magnification. Therefore, possibly less complex CNN architectures may be sufficient for quantitative histopathology analyses, and also show better generalisation. Here, in our proof-of-concept analysis, we observed a considerable dependence of the feature representation on known and possibly unknown properties of our training data, including the image compression algorithm and its parameters. Some of these issues could be overcome by amending and retraining the network to isolate the effect of confounding factors and additional data augmentation. Still, given the flexibility of deep learning algorithms and the associated risk of overfitting, one should generally be cautious about the generalisation properties and critically assess whether a new image is appropriately represented. Looking forward, our analyses revealed the enormous potential of using computer vision alongside molecular profiling. While the eye of a trained human may still constitute the gold standard for recognising clinically relevant histopathological patterns, computers have the capacity to augment this process by sifting through millions of images to retrieve similar patterns and establish associations with known and novel traits. As our analysis showed this helps to detect histopathology patterns associated with a range of genomic alterations, transcriptional signatures and prognosis – and highlight areas indicative of these traits on each given slide. It is therefore not too difficult to foresee how this may be utilised in a computationally augmented histopathology workflow enabling more precise and faster diagnosis and prognosis. Further, the ability to quantify a rich set of histopathology patterns lays out a path to define integrated histopathology and molecular cancer subtypes, as recently demonstrated for colorectal cancers47 .
Lastly, our analyses provide It is made available under a CC-BY-NC 4.0 International license. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
bioRxiv preprint first posted online Oct. 25, 2019; doi: http://dx.doi.org/10.1101/813543.
The copyright holder for this preprint proof-of-concept for these principles and we expect them to be greatly refined in the future based on larger training corpora and further algorithmic refinements.
https://www.biorxiv.org/content/biorxiv/early/2019/10/25/813543.full.pdf
During his 2016 State of the Union address, President Barack Obama called on Vice President Joe Biden – who had months earlier lost his son Beau to brain cancer – to head a “moonshot” to significantly accelerate research into the disease. The president said he wanted to harness the spirit of American innovation that took us from zero to landing a man on the moon in a decade to similarly find new ways to prevent, diagnose and treat cancer.
One of those intrigued by that call to action was Stanford’s Jan Liphardt, an associate professor of bioengineering who specializes in biophysics, the tumor microenvironment and data analysis. Stanford Engineering talked to Liphardt about how he came to be involved with the moonshot and his approach to using data and the voice of patients to better understand cancer and how it can be treated, and how sharing information can better inform the course of cancer research.
How did you get involved in the National Cancer Moonshot?
In March, after the president’s charge, the vice president challenged scientists, doctors, industry and patients to give their best ideas to the moonshot. The White House also reached out to a few outsiders, myself included. The White House instructions were unusual: “Do something big and different. There is no money and you have 87 days. Go.”
I like a challenge, and this was a chance to serve, even in the face of administrative hurdles. So I looked for advice, teammates and support. Russ Altman, a colleague at Stanford, suggested it was time to give patients a way to volunteer their own health data in order to help find cures. I collaborated with Peter Kuhn, a professor of medicine and engineering at the University of Southern California, who’s known for carefully listening to cancer patients, advocates and their supporters. In short order we had links with advocates like AnneMarie Ciccarella, Sonja Durham, Lori Marx-Rubiner, Jack Whelan and Jack Park. That’s how we got to CancerBase.org.
What’s the idea the team came up with?
We thought for about a week: What would matter to the patients that Stanford and other research institutions serve? What would scale? Well, we’re not going to run a clinical trial, go near protected health information, invent a new drug or write a research proposal. There’s no time for that. Whatever it was, it had to be useful, scalable, legal and different. That pointed to data, the web, patients and decisions.
One thing jumped out: Right now, there’s significant friction in medical data sharing. People all over the world can already effortlessly share other kinds of information – pictures, movies, ideas, stories, tweets. Increasingly, they are using the same tools to share personal medical information. It’s remarkable what cancer patients already share: diagnoses, genomes, pathology images. But that information is not yet widely used to understand where they are with their diseases.
Ideally, everyone, including scientists and doctors, would have as much information as possible at their fingertips. Many patients think when they give data for research, magically scientists all over the world can dig into this information, find patterns and help. The practical reality is that it’s nearly impossible for any one scientist to access the amounts of data they would like.
So that’s the simple idea: A global map and give patients the tools they need to share their data – if they want to. They can donate information for the greater good. In return, we make a simple promise: When you post data, we’ll anonymize them and make them available to anyone on Earth in one second. We plan to display this information like real-time traffic data. HIPAA doesn’t apply to this direct data-sharing. The patients can give us whatever information they want, and they can tell us what they want us to do with it. We’re a conduit. Their data belong to them, not to us.
How does it work?
Today we ask just five basic questions. Over time we will add more. You join, give some information, and we’ll put you on a global map. Right now, some of the things we don’t know about cancer are incredibly simple: Where is everyone on Earth with cancer? How old are they? What is their diagnosis? Did their cancers metastasize? Global, instantaneous data sharing is the story.
In a second phase, we are going to see if we can plot all the information just like Waze does for traffic. Our role is to synthesize the information and plot it in ways that ordinary people can understand. Think of it this way – patients want to be able to chart their treatment path. Who went straight, who went left? People just getting on the highway are curious about what people did who came before them, and what happened to those people. Did they arrive at the destination easily and promptly? We’re a real-time diagnosis and therapy mapping service for cancer.
You say that giving patients a way to share their health data is important to help finding cures. Why?
Let me give you a specific example. At Stanford, I’m part of a team of cancer biologists and clinicians funded by the Stanford Cancer Institute to think about the next generation of screening for breast cancer in the U.S. Every year, the U.S. uses mammography to screen more than 40 million women for breast cancer. In this project, it quickly became clear that there is currently no central, easy-to-access repository of mammograms for research use.
That’s a major lost opportunity – our nation spends billions on screening, but we don’t store, share and analyze this information in a scalable and simple manner. In the traditional approach, our team would spend several hundred thousand dollars, and about three years, to assemble perhaps 1,000 mammograms. We would then use this tiny dataset to try to find something interesting, but since the dataset is so small, we would be blind to rare features of breast cancer and its predictors. It clearly makes a lot more sense to compare and explore 100 million images.
This sounds completely impossible until you realize that Instagram users upload 58 million images every day. Once you start to think about supposedly intractable research problems from a web or social networking perspective, new possibilities open. Imagine, for example, if there were a simple way for every single woman on Earth to upload and share her de-identified mammogram? Or more generally, imagine a world in which patients have the tools to globally share de-identified health data, if they want to. That’s exactly the idea behind CancerBase – let’s just give people those tools and see what happens.
How much data and how many people are needed to make this viable?
We think we are going to need several tens-of-thousands of members. There are approximately 50 million people on Earth with a cancer diagnosed in the last five years, and 200 million more people have an immediate family member with cancer. Almost 2 billion people are active on Twitter and Facebook – a quarter of the world’s population. If just a few percent of those people sign up, we could do something no one on Earth has done before.
Are there hopes to create a “developer community,” people who find ways to use your data that you didn’t even think about or have the time to work on?
Definitely. As much as we think we can predict what these data are useful for, we don’t really know. By making the anonymized data available to everyone within one second, they might start to do things that we never dreamed of. The more eyes look at these data, the better off everyone will be. The dream is to have cancer-relevant medical data flow unimpeded around the world in seconds, so that everyone, wherever they are, can see and use this information.
SOURCE
https://engineering.stanford.edu/news/how-data-can-help-us-understand-cancer-and-its-treatment
Leave a Reply