Funding, Deals & Partnerships: BIOLOGICS & MEDICAL DEVICES; BioMed e-Series; Medicine and Life Sciences Scientific Journal – http://PharmaceuticalIntelligence.com
The Use of ChatGPT in the World of BioInformatics and Cancer Research and Development of BioGPT by MIT
Curator: Stephen J. Williams, Ph.D.
Chatbots are being used in multiple interdisciplanary areas in research and medicine and so it was a natural progression to incorporate artificial intelligence (AI), natural language processing (NLP), and chatbot technology like ChatGPT into bioinformatic analysis.
“In domains like informatics, management and marketing, media and communication science, languages and philosophy, psychology and sociology, engineering, design, and human-computer interaction, the fast expanding body of chatbot study is clearly interdisciplinary.
The field of bioinformatics is a natural fit for incorporting this technology. The curated information below shows some use examples of this technology in bioinformatics related to cancer research.
How ChatGPT can be used in bioinformatics research?
ChatGPT is a variant of the GPT (Generative Pre-trained Transformer) language model that was designed to generate human-like text in a conversational setting. It is not directly related to bioinformatics, which is the field of study that deals with the storage, retrieval, and analysis of biological data, particularly in the context of genetics and genomics. However, ChatGPT could potentially be used in a bioinformatics context as a tool for generating natural language descriptions of genetic or genomic data, or for generating responses to queries about such data.
For example, a researcher could use ChatGPT to generate descriptions of genetic variants or gene expression patterns based on input data. This could be useful for generating summaries of results for reports or papers, or for generating explanations of complex concepts for non-expert audiences. ChatGPT could also be used to generate responses to queries about genetic or genomic data, such as “What is the function of gene X?” or “What are the potential consequences of variant Y?”
It’s worth noting that ChatGPT is just one of many tools and techniques that can be used in bioinformatics, and it is unlikely to be the most suitable or effective option in every situation. There are many specialized tools and resources available for working with biological data, and it is often necessary to use a combination of these tools to fully analyze and understand the data.
Generating descriptions of genetic or genomic data: ChatGPT could be used to generate natural language descriptions of genetic or genomic data based on input data. For example, suppose a researcher has a dataset containing information about gene expression levels in different tissues. The researcher could use ChatGPT to generate a description of the data, such as:
“Gene X is highly expressed in the liver and kidney, with moderate expression in the brain and heart. Gene Y, on the other hand, shows low expression in all tissues except for the lung, where it is highly expressed.”
Thereby ChatGPT, at its simplest level, could be used to ask general questions like “What is the function of gene product X?” and a ChatGPT could give a reasonable response without the scientist having to browse through even highly curated databases lie GeneCards or UniProt or GenBank. Or even “What are potential interactors of Gene X, validated by yeast two hybrid?” without even going to the curated InterActome databases or using expensive software like Genie.
Summarizing results: ChatGPT could be used to generate summaries of results from genetic or genomic studies. For example, a researcher might use ChatGPT to generate a summary of a study that found a association between a particular genetic variant and a particular disease. The summary might look something like this:
“Our study found that individuals with the variant form of gene X are more likely to develop disease Y. Further analysis revealed that this variant is associated with changes in gene expression that may contribute to the development of the disease.”
It’s worth noting that ChatGPT is just one tool that could potentially be used in these types of applications, and it is likely to be most effective when used in combination with other bioinformatics tools and resources. For example, a researcher might use ChatGPT to generate a summary of results, but would also need to use other tools to analyze the data and confirm the findings.
ChatGPT is a variant of the GPT (Generative Pre-training Transformer) language model that is designed for open-domain conversation. It is not specifically designed for generating descriptions of genetic variants or gene expression patterns, but it can potentially be used for this purpose if you provide it with a sufficient amount of relevant training data and fine-tune it appropriately.
To use ChatGPT to generate descriptions of genetic variants or gene expression patterns, you would first need to obtain a large dataset of examples of descriptions of genetic variants or gene expression patterns. You could use this dataset to fine-tune the ChatGPT model on the task of generating descriptions of genetic variants or gene expression patterns.
Here’s an example of how you might use ChatGPT to generate a description of a genetic variant:
First, you would need to pre-process your dataset of descriptions of genetic variants to prepare it for use with ChatGPT. This might involve splitting the descriptions into individual sentences or phrases, and encoding them using a suitable natural language processing (NLP) library or tool.
Next, you would need to fine-tune the ChatGPT model on the task of generating descriptions of genetic variants. This could involve using a tool like Hugging Face’s Transformers library to load the ChatGPT model and your pre-processed dataset, and then training the model on the task of generating descriptions of genetic variants using an appropriate optimization algorithm.
Once the model has been fine-tuned, you can use it to generate descriptions of genetic variants by providing it with a prompt or seed text and asking it to generate a response. For example, you might provide the model with the prompt “Generate a description of a genetic variant associated with increased risk of breast cancer,” and ask it to generate a response. The model should then generate a description of a genetic variant that is associated with increased risk of breast cancer.
It’s worth noting that generating high-quality descriptions of genetic variants or gene expression patterns is a challenging task, and it may be difficult to achieve good results using a language model like ChatGPT without a large amount of relevant training data and careful fine-tuning.
To train a language model like chatGPT to extract information about specific genes or diseases from research papers, you would need to follow these steps:
Gather a large dataset of research papers that contain information about the specific genes or diseases you are interested in. This dataset should be diverse and representative of the types of papers you want the model to be able to extract information from.
Preprocess the text data in the research papers by tokenizing the text and creating a vocabulary. You may also want to consider lemmatizing or stemming the text to reduce the dimensionality of the dataset.
Train the language model on the preprocessed text data. You may want to fine-tune a pre-trained model such as chatGPT on your specific dataset, or you can train a new model from scratch.
ChatGPT could also be useful for sequence analysis
A few examples of sequence analysis a ChatGPT could be useful include:
Protein structure
Identifying functional regions of a protein
Predicting protein-protein interactions
Identifying protein homologs
Generating Protein alignments
All this could be done without having access to UNIX servers or proprietary software or knowing GCG coding
ChatGPT in biomedical research
There are several potential ways that ChatGPT or other natural language processing (NLP) models could be applied in biomedical research:
Text summarization: ChatGPT or other NLP models could be used to summarize large amounts of text, such as research papers or clinical notes, in order to extract key information and insights more quickly.
Data extraction: ChatGPT or other NLP models could be used to extract structured data from unstructured text sources, such as research papers or clinical notes. For example, the model could be trained to extract information about specific genes or diseases from research papers, and then used to create a database of this information for further analysis.
Literature review: ChatGPT or other NLP models could be used to assist with literature review tasks, such as identifying relevant papers, extracting key information from papers, or summarizing the main findings of a group of papers.
Predictive modeling: ChatGPT or other NLP models could be used to build predictive models based on large amounts of text data, such as electronic health records or research papers. For example, the model could be trained to predict the likelihood of a patient developing a particular disease based on their medical history and other factors.
It’s worth noting that while NLP models like ChatGPT have the potential to be useful tools in biomedical research, they are only as good as the data they are trained on, and it is important to carefully evaluate the quality and reliability of any results generated by these models.
ChatGPT in text mining of biomedical data
ChatGPT could potentially be used for text mining in the biomedical field in a number of ways. Here are a few examples:
Extracting information from scientific papers: ChatGPT could be trained on a large dataset of scientific papers in the biomedical field, and then used to extract specific pieces of information from these papers, such as the names of compounds, their structures, and their potential uses.
Generating summaries of scientific papers: ChatGPT could be used to generate concise summaries of scientific papers in the biomedical field, highlighting the main findings and implications of the research.
Identifying trends and patterns in scientific literature: ChatGPT could be used to analyze large datasets of scientific papers in the biomedical field and identify trends and patterns in the data, such as emerging areas of research or common themes among different papers.
Generating questions for further research: ChatGPT could be used to suggest questions for further research in the biomedical field based on existing scientific literature, by identifying gaps in current knowledge or areas where further investigation is needed.
Generating hypotheses for scientific experiments: ChatGPT could be used to generate hypotheses for scientific experiments in the biomedical field based on existing scientific literature and data, by identifying potential relationships or associations that could be tested in future research.
PLEASE WATCH VIDEO
In this video, a bioinformatician describes the ways he uses ChatGPT to increase his productivity in writing bioinformatic code and conducting bioinformatic analyses.
He describes a series of uses of ChatGPT in his day to day work as a bioinformatian:
Using ChatGPT as a search engine: He finds more useful and relevant search results than a standard Google or Yahoo search. This saves time as one does not have to pour through multiple pages to find information. However, a caveat is ChatGPT does NOT return sources, as highlighted in previous postings on this page. This feature of ChatGPT is probably why Microsoft bought OpenAI in order to incorporate ChatGPT in their Bing search engine, as well as Office Suite programs
ChatGPT to help with coding projects: Bioinformaticians will spend multiple hours searching for and altering open access available code in order to run certain function like determining the G/C content of DNA (although there are many UNIX based code that has already been established for these purposes). One can use ChatGPT to find such a code and then assist in debugging that code for any flaws
ChatGPT to document and add coding comments: When writing code it is useful to add comments periodically to assist other users to determine how the code works and also how the program flow works as well, including returned variables.
One of the comments was interesting and directed one to use BIOGPT instead of ChatGPT
0:54 oh dear. You cannot use chatgpt like that in Bioinformatics as it is rn without double checking the info from it. You should be using biogpt instead for paper summarisation. ChatGPT goes for human-like responses over precise information recal. It is quite good for debugging though and automating boring awkward scripts
The BioGPT model was proposed in BioGPT: generative pre-trained transformer for biomedical text generation and mining by Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon and Tie-Yan Liu. BioGPT is a domain-specific generative pre-trained Transformer language model for biomedical text generation and mining. BioGPT follows the Transformer language model backbone, and is pre-trained on 15M PubMed abstracts from scratch.
The abstract from the paper is the following:
Pre-trained language models have attracted increasing attention in the biomedical domain, inspired by their great success in the general natural language domain. Among the two main branches of pre-trained language models in the general language domain, i.e. BERT (and its variants) and GPT (and its variants), the first one has been extensively studied in the biomedical domain, such as BioBERT and PubMedBERT. While they have achieved great success on a variety of discriminative downstream biomedical tasks, the lack of generation ability constrains their application scope. In this paper, we propose BioGPT, a domain-specific generative Transformer language model pre-trained on large-scale biomedical literature. We evaluate BioGPT on six biomedical natural language processing tasks and demonstrate that our model outperforms previous models on most tasks. Especially, we get 44.98%, 38.42% and 40.76% F1 score on BC5CDR, KD-DTI and DDI end-to-end relation extraction tasks, respectively, and 78.2% accuracy on PubMedQA, creating a new record. Our case study on text generation further demonstrates the advantage of BioGPT on biomedical literature to generate fluent descriptions for biomedical terms.
Tips:
BioGPT is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left.
BioGPT was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next token in a sequence. Leveraging this feature allows BioGPT to generate syntactically coherent text as it can be observed in the run_generation.py example script.
The model can take the past_key_values (for PyTorch) as input, which is the previously computed key/value attention pairs. Using this (past_key_values or past) value prevents the model from re-computing pre-computed values in the context of text generation. For PyTorch, see past_key_values argument of the BioGptForCausalLM.forward() method for more information on its usage.
This model was contributed by kamalkraj. The original code can be found here.
BioGPT is MIT-licensed. The license applies to the pre-trained models as well.
Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
As of right now this does not seem Open Access, however a sign up is required!
We provide our pre-trained BioGPT model checkpoints along with fine-tuned checkpoints for downstream tasks, available both through URL download as well as through the Hugging Face 🤗 Hub.
Luo R, Sun L, Xia Y, Qin T, Zhang S, Poon H, Liu TY.Brief Bioinform. 2022 Nov 19;23(6):bbac409. doi: 10.1093/bib/bbac409.PMID: 36156661
We evaluate BioGPT on six biomedical natural language processing tasks and demonstrate that our model outperforms previous models on most tasks. …Our case study on text generation further demonstrates the advantage of BioGPT on biomedical literature to generate fl …
Here they show the advantages of BioGPT over BioBERT and ChatGPT in an example of text extraction and question answer
Table 7. Example of generated text
Input Model Text
Bicalutamide GPT-2 Bicalutamide, a synthetic bicalutamide derivative, was found to increase the cellular proliferation
of C. elegans cells. This effect was not observed with the other synthetic bicalutamide derivatives. This is in line with our previous work which showed that bicalutamide increases the proliferation of C. elegans cells but not in the presence of synthetic bicalutamide.
BioGPT Bicalutamide (Casodex) is an androgen receptor (AR) antagonist approved for the treatment of metastatic castration-resistant prostate cancer (mCRPC) in patients who have progressed on or are ineligible for docetaxel chemotherapy, as well as for the treatment of early-stage prostate cancer in men who have not progressed on or are ineligible for docetaxel chemotherapy, as well as for the treatment of metastatic castration-sensitive prostate cancer (mCSPC) in men who have not progressed on or are ineligible for docetaxel chemotherapy.
Janus kinase 3 (JAK-3) GPT-2 Janus kinase 3 (JAK-3) mediates the conversion of glucose to glycogen in skeletal muscle. The
increase in glucose uptake by skeletal muscle is believed to be associated with a reduction in muscle protein breakdown and with the release of ketone bodies, which contribute to the metabolism of glucose.
BioGPT Janus kinase 3 (JAK-3) is a member of the Janus kinase (JAK) family of non-receptor tyrosine kinases and plays an important role in the regulation of cell proliferation, differentiation, survival, migration and angiogenesis.
Apricitabine GPT-2 Apricitabine, duloxetine and citalopram) were investigated in patients with a history of
treatment-resistant depression (n = 10). We found that treatment with the three drugs significantly decreased depressive symptoms, but not depressive symptoms over the next 12 weeks. We conclude that antidepressant drugs that reduce depressive symptoms and reduce depressive symptoms over time may be beneficial in patients with recurrent depression.
BioGPT Apricitabine is an oral prodrug of 5-aza-2 ’-deoxycytidine (5-aza-CdR), a DNA methyltransferase (DNMT) inhibitor, which has been approved by the US Food and Drug Administration (FDA) for the treatment of myelodysplastic syndrome (MDS) and acute myeloid leukemia (AML) in combination with low-dose cytarabine (Ara-C) and granulocyte colony-stimulating factor (G-CSF) for patients with intermediate-2 or high-risk MDS or AML.
Xylazine GPT-2 Xylazine, the ‘active ingredient’ in ‘bath salts’ (see above) has been linked to numerous deaths,
including those of former University of California, Berkeley, professor Andrew Anglin, who posted
a video of himself having a bath filled with it.
BioGPT Xylazine is an alpha 2-adrenoceptor agonist which has been used as a sedative and analgesic in veterinary medicine for many years, but its effects on the cardiovascular system have not been extensively studied in the dog, and its effects on the central nervous system (CNS) have not been well characterized in the dog, despite the fact that xylazine has been widely used as a sedative and analgesic in veterinary medicine for more than 30 years.
Psoralen GPT-2 Psoralen, the city of the spiritualist, and had been given to him by God, who wished to show him
the way of salvation, and to draw him back to the true life of Christ, and to bring him into communion with his fellow-man.
BioGPT Psoralen photochemotherapy (PUVA) is a well-established treatment for psoriasis, but its use is limited by the risk of skin cancer, particularly squamous cell carcinoma (SCC) of the head and neck (H & N), which is the most common site of skin cancer in the United States (US)
CP-673451 GPT-2 CP-673451) and 4-5-6-3-1-2-3-4-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-21-22-23-24-25-26-27-
BioGPT CP-673451 is a potent, selective, and orally active inhibitor of human neutrophil elastase (HNE) and human cathepsin G (CatG) with in vitro and in vivo anti-inflammatory activity in a variety of animal models of inflammation and in a model of acute lung injury (ALI) in the rat induced by intratracheal instillation of lipopolysaccharide (LPS) and tumor necrosis factor-alpha (TNF-alpha), a model of acute lung injury (ALI) in which neutrophils play an important role.
BIIB-021 GPT-2 BIIB-021A-2-1.pdf. Wright, C. (2002). ‘The War on Terror: A Legal Theory of Terrorist Attacks,’ in
L.E.R. Wright, S.R. Warrick and J.H. Rothman (Eds.), The War on Terror: Essays in Honor of Michael T. Klare (New York: Oxford University Press), 143-57.
BioGPT BIIB-021 is a novel, orally active, non-peptide bradykinin B2 receptor antagonist with potent and long-lasting anti-inflammatory activity in animal models of acute and chronic inflammation and in a rat model of adjuvant-induced arthritis (AIA), an animal model of rheumatoid arthritis (RA) and in a rat model of collagen-induced arthritis (CIA), an animal model of collagen-induced arthritis (CIA), in which arthritis is induced by immunization with bovine type II collagen (CII).
Huang L, Lin J, Li X, Song L, Zheng Z, Wong KC.Brief Bioinform. 2022 Jan 17;23(1):bbab451. doi: 10.1093/bib/bbab451.PMID: 34791012
The rapid growth in literature accumulates diverse and yet comprehensive biomedical knowledge hidden to be mined such as drug interactions. However, it is difficult to extract the heterogeneous knowledge to retrieve or even discover the latest and novel knowledge in an efficient manner. To address such a problem, we propose EGFI for extracting and consolidating drug interactions from large-scale medical literature text data. Specifically, EGFI consists of two parts: classification and generation. In the classification part, EGFI encompasses the language model BioBERT which has been comprehensively pretrained on biomedical corpus. In particular, we propose the multihead self-attention mechanism and packed BiGRU to fuse multiple semantic information for rigorous context modeling. In the generation part, EGFI utilizes another pretrained language model BioGPT-2 where the generation sentences are selected based on filtering rules.
Results: We evaluated the classification part on ‘DDIs 2013’ dataset and ‘DTIs’ dataset, achieving the F1 scores of 0.842 and 0.720 respectively. Moreover, we applied the classification part to distinguish high-quality generated sentences and verified with the existing growth truth to confirm the filtered sentences. The generated sentences that are not recorded in DrugBank and DDIs 2013 dataset demonstrated the potential of EGFI to identify novel drug relationships.
Jin Q, Yang Y, Chen Q, Lu Z.ArXiv. 2023 May 16:arXiv:2304.09667v3. Preprint.PMID: 37131884 Free PMC article.
While large language models (LLMs) have been successfully applied to various tasks, they still face challenges with hallucinations. Augmenting LLMs with domain-specific tools such as database utilities can facilitate easier and more precise access to specialized knowledge. In this paper, we present GeneGPT, a novel method for teaching LLMs to use the Web APIs of the National Center for Biotechnology Information (NCBI) for answering genomics questions. Specifically, we prompt Codex to solve the GeneTuring tests with NCBI Web APIs by in-context learning and an augmented decoding algorithm that can detect and execute API calls. Experimental results show that GeneGPT achieves state-of-the-art performance on eight tasks in the GeneTuring benchmark with an average score of 0.83, largely surpassing retrieval-augmented LLMs such as the new Bing (0.44), biomedical LLMs such as BioMedLM (0.08) and BioGPT (0.04), as well as GPT-3 (0.16) and ChatGPT (0.12). Our further analyses suggest that: (1) API demonstrations have good cross-task generalizability and are more useful than documentations for in-context learning; (2) GeneGPT can generalize to longer chains of API calls and answer multi-hop questions in GeneHop, a novel dataset introduced in this work; (3) Different types of errors are enriched in different tasks, providing valuable insights for future improvements.
PLEASE WATCH THE FOLLOWING VIDEOS ON BIOGPT
This one entitled
Microsoft’s BioGPT Shows Promise as the Best Biomedical NLP
gives a good general description of this new MIT/Microsoft project and its usefullness in scanning 15 million articles on PubMed while returning ChatGPT like answers.
Please note one of the comments which is VERY IMPORTANT
bioGPT is difficult for non-developers to use, and Microsoft researchers seem to default that all users are proficient in Python and ML.
Much like Microsoft Azure it seems this BioGPT is meant for developers who have advanced programming skill. Seems odd then to be paying programmers multiK salaries when one or two Key Opinion Leaders from the medical field might suffice but I would be sure Microsoft will figure this out.
ALSO VIEW VIDEO
This is a talk from Microsoft on BioGPT
Other Relevant Articles on Natural Language Processing in BioInformatics, Healthcare and ChatGPT for Medicine on this Open Access Scientific Journal Include
Use of Systems Biology for Design of inhibitor of Galectins as Cancer Therapeutic – Strategy and Software
Curator:Stephen J. Williams, Ph.D.
Below is a slide representation of the overall mission 4 to produce a PROTAC to inhibit Galectins 1, 3, and 9.
Using A Priori Knowledge of Galectin Receptor Interaction to Create a BioModel of Galectin 3 Binding
Now after collecting literature from PubMed on “galectin-3” AND “binding” to determine literature containing kinetic data we generate a WordCloud on the articles.
This following file contains the articles needed for BioModels generation.
From the WordCloud we can see that these corpus of articles describe galectin binding to the CRD (carbohydrate recognition domain). Interestingly there are many articles which describe van Der Waals interactions as well as electrostatic interactions. Certain carbohydrate modifictions like Lac NAc and Gal 1,4 may be important. Many articles describe the bonding as well as surface interactions. Many studies have been performed with galectin inhibitors like TDGs (thio-digalactosides) like TAZ TDG (3-deoxy-3-(4-[m-fluorophenyl]-1H-1,2,3-triazol-1-yl)-thio-digalactoside). This led to an interesting article
.
Dual thio-digalactoside-binding modes of human galectins as the structural basis for the design of potent and selective inhibitors
Human galectins are promising targets for cancer immunotherapeutic and fibrotic disease-related drugs. We report herein the binding interactions of three thio-digalactosides (TDGs) including TDG itself, TD139 (3,3′-deoxy-3,3′-bis-(4-[m-fluorophenyl]-1H-1,2,3-triazol-1-yl)-thio-digalactoside, recently approved for the treatment of idiopathic pulmonary fibrosis), and TAZTDG (3-deoxy-3-(4-[m-fluorophenyl]-1H-1,2,3-triazol-1-yl)-thio-digalactoside) with human galectins-1, -3 and -7 as assessed by X-ray crystallography, isothermal titration calorimetry and NMR spectroscopy. Five binding subsites (A-E) make up the carbohydrate-recognition domains of these galectins. We identified novel interactions between an arginine within subsite E of the galectins and an arene group in the ligands. In addition to the interactions contributed by the galactosyl sugar residues bound at subsites C and D, the fluorophenyl group of TAZTDG preferentially bound to subsite B in galectin-3, whereas the same group favored binding at subsite E in galectins-1 and -7. The characterised dual binding modes demonstrate how binding potency, reported as decreased Kd values of the TDG inhibitors from μM to nM, is improved and also offer insights to development of selective inhibitors for individual galectins.
Figures
Figure 1. Chemical structures of L3, TDG…
Figure 2. Structural comparison of the carbohydrate…
From High-Throughput Assay to Systems Biology: New Tools for Drug Discovery
Curator: Stephen J. Williams, PhD
Marc W. Kirschner*
Department of Systems Biology Harvard Medical School
Boston, Massachusetts 02115
With the new excitement about systems biology, there is understandable interest in a definition. This has proven somewhat difficult. Scientific fields, like species, arise by descent with modification, so in their earliest forms even the founders of great dynasties are only marginally different than their sister fields and species. It is only in retrospect that we can recognize the significant founding events. Before embarking on a definition of systems biology, it may be worth remembering that confusion and controversy surrounded the introduction of the term “molecular biology,” with claims that it hardly differed from biochemistry. Yet in retrospect molecular biology was new and different. It introduced both new subject matter and new technological approaches, in addition to a new style.
As a point of departure for systems biology, consider the quintessential experiment in the founding of molecular biology, the one gene one enzyme hypothesis of Beadle and Tatum. This experiment first connected the genotype directly to the phenotype on a molecular level, although efforts in that direction can certainly be found in the work of Archibald Garrod, Sewell Wright, and others. Here a protein (in this case an enzyme) is seen to be a product of a single gene, and a single function; the completion of a specific step in amino acid biosynthesis is the direct result. It took the next 30 years to fill in the gaps in this process. Yet the one gene one enzyme hypothesis looks very different to us today. What is the function of tubulin, of PI-3 kinase or of rac? Could we accurately predict the phenotype of a nonlethal mutation in these genes in a multicellular organism? Although we can connect structure to the gene, we can no longer infer its larger purpose in the cell or in the organism. There are too many purposes; what the protein does is defined by context. The context also includes a history, either developmental or physiological. Thus the behavior of the Wnt signaling pathway depends on the previous lineage, the “where and when” questions of embryonic development. Similarly the behavior of the immune system depends on previous experience in a variable environment. All of these features stress how inadequate an explanation for function we can achieve solely by trying to identify genes (by annotating them!) and characterizing their transcriptional control circuits.
That we are at a crossroads in how to explore biology is not at all clear to many. Biology is hardly in its dotage; the process of discovery seems to have been perfected, accelerated, and made universally applicable to all fields of biology. With the completion of the human genome and the genomes of other species, we have a glimpse of many more genes than we ever had before to study. We are like naturalists discovering a new continent, enthralled with the diversity itself. But we have also at the same time glimpsed the finiteness of this list of genes, a disturbingly small list. We have seen that the diversity of genes cannot approximate the diversity of functions within an organism. In response, we have argued that combinatorial use of small numbers of components can generate all the diversity that is needed. This has had its recent incarnation in the simplistic view that the rules of cis-regulatory control on DNA can directly lead to an understanding of organisms and their evolution. Yet this assumes that the gene products can be linked together in arbitrary combinations, something that is not assured in chemistry. It also downplays the significant regulatory features that involve interactions between gene products, their localization, binding, posttranslational modification, degradation, etc. The big question to understand in biology is not regulatory linkage but the nature of biological systems that allows them to be linked together in many nonlethal and even useful combinations. More and more we come to realize that understanding the conserved genes and their conserved circuits will require an understanding of their special properties that allow them to function together to generate different phenotypes in different tissues of metazoan organisms. These circuits may have certain robustness, but more important they have adaptability and versatility. The ease of putting conserved processes under regulatory control is an inherent design feature of the processes themselves. Among other things it loads the deck in evolutionary variation and makes it more feasible to generate useful phenotypes upon which selection can act.
Systems biology offers an opportunity to study how the phenotype is generated from the genotype and with it a glimpse of how evolution has crafted the phenotype. One aspect of systems biology is the development of techniques to examine broadly the level of protein, RNA, and DNA on a gene by gene basis and even the posttranslational modification and localization of proteins. In a very short time we have witnessed the development of high-throughput biology, forcing us to consider cellular processes in toto. Even though much of the data is noisy and today partially inconsistent and incomplete, this has been a radical shift in the way we tear apart problems one interaction at a time. When coupled with gene deletions by RNAi and classical methods, and with the use of chemical tools tailored to proteins and protein domains, these high-throughput techniques become still more powerful.
High-throughput biology has opened up another important area of systems biology: it has brought us out into the field again or at least made us aware that there is a world outside our laboratories. Our model systems have been chosen intentionally to be of limited genetic diversity and examined in a highly controlled and reproducible environment. The real world of ecology, evolution, and human disease is a very different place. When genetics separated from the rest of biology in the early part of the 20th century, most geneticists sought to understand heredity and chose to study traits in the organism that could be easily scored and could be used to reveal genetic mechanisms. This was later extended to powerful effect to use genetics to study cell biological and developmental mechanisms. Some geneticists, including a large school in Russia in the early 20th century, continued to study the genetics of natural populations, focusing on traits important for survival. That branch of genetics is coming back strongly with the power of phenotypic assays on the RNA and protein level. As human beings we are most concerned not with using our genetic misfortunes to unravel biology’s complexity (important as that is) but with the role of our genetics in our individual survival. The context for understanding this is still not available, even though the data are now coming in torrents, for many of the genes that will contribute to our survival will have small quantitative effects, partially masked or accentuated by other genetic and environmental conditions. To understand the genetic basis of disease will require not just mapping these genes but an understanding of how the phenotype is created in the first place and the messy interactions between genetic variation and environmental variation.
Extracts and explants are relatively accessible to synthetic manipulation. Next there is the explicit reconstruction of circuits within cells or the deliberate modification of those circuits. This has occurred for a while in biology, but the difference is that now we wish to construct or intervene with the explicit purpose of describing the dynamical features of these synthetic or partially synthetic systems. There are more and more tools to intervene and more and more tools to measure. Although these fall short of total descriptions of cells and organisms, the detailed information will give us a sense of the special life-like processes of circuits, proteins, cells in tissues, and whole organisms in their environment. This meso-scale systems biology will help establish the correspondence between molecules and large-scale physiology.
You are probably running out of patience for some definition of systems biology. In any case, I do not think the explicit definition of systems biology should come from me but should await the words of the first great modern systems biologist. She or he is probably among us now. However, if forced to provide some kind of label for systems biology, I would simply say that systems biology is the study of the behavior of complex biological organization and processes in terms of the molecular constituents. It is built on molecular biology in its special concern for information transfer, on physiology for its special concern with adaptive states of the cell and organism, on developmental biology for the importance of defining a succession of physiological states in that process, and on evolutionary biology and ecology for the appreciation that all aspects of the organism are products of selection, a selection we rarely understand on a molecular level. Systems biology attempts all of this through quantitative measurement, modeling, reconstruction, and theory. Systems biology is not a branch of physics but differs from physics in that the primary task is to understand how biology generates variation. No such imperative to create variation exists in the physical world. It is a new principle that Darwin understood and upon which all of life hinges. That sounds different enough for me to justify a new field and a new name. Furthermore, the success of systems biology is essential if we are to understand life; its success is far from assured—a good field for those seeking risk and adventure.
Biologically active small molecules have a central role in drug development, and as chemical probes and tool compounds to perturb and elucidate biological processes. Small molecules can be rationally designed for a given target, or a library of molecules can be screened against a target or phenotype of interest. Especially in the case of phenotypic screening approaches, a major challenge is to translate the compound-induced phenotype into a well-defined cellular target and mode of action of the hit compound. There is no “one size fits all” approach, and recent years have seen an increase in available target deconvolution strategies, rooted in organic chemistry, proteomics, and genetics. This review provides an overview of advances in target identification and mechanism of action studies, describes the strengths and weaknesses of the different approaches, and illustrates the need for chemical biologists to integrate and expand the existing tools to increase the probability of evolving screen hits to robust chemical probes.
5.1.5. Large-Scale Proteomics
While FITExP is based on protein expression regulation during apoptosis, a study of Ruprecht et al. showed that proteomic changes are induced both by cytotoxic and non-cytotoxic compounds, which can be detected by mass spectrometry to give information on a compound’s mechanism of action. They developed a large-scale proteome-wide mass spectrometry analysis platform for MOA studies, profiling five lung cancer cell lines with over 50 drugs. Aggregation analysis over the different cell lines and the different compounds showed that one-quarter of the drugs changed the abundance of their protein target. This approach allowed target confirmation of molecular degraders such as PROTACs or molecular glues. Finally, this method yielded unexpected off-target mechanisms for the MAP2K1/2 inhibitor PD184352 and the ALK inhibitor ceritinib [97]. While such a mapping approach clearly provides a wealth of information, it might not be easily attainable for groups that are not equipped for high-throughput endeavors.
All-in-all, mass spectrometry methods have gained a lot of traction in recent years and have been successfully applied for target deconvolution and MOA studies of small molecules. As with all high-throughput methods, challenges lie in the accessibility of the instruments (both from a time and cost perspective) and data analysis of complex and extensive data sets.
5.2. Genetic Approaches
Both label-based and mass spectrometry proteomic approaches are based on the physical interaction between a small molecule and a protein target, and focus on the proteome for target deconvolution. It has been long realized that genetics provides an alternative avenue to understand a compound’s action, either through precise modification of protein levels, or by inducing protein mutations. First realized in yeast as a genetically tractable organism over 20 years ago, recent advances in genetic manipulation of mammalian cells have opened up important opportunities for target identification and MOA studies through genetic screening in relevant cell types [98]. Genetic approaches can be roughly divided into two main areas, with the first centering on the identification of mutations that confer compound resistance (Figure 3a), and the second on genome-wide perturbation of gene function and the concomitant changes in sensitivity to the compound (Figure 3b). While both methods can be used to identify or confirm drug targets, the latter category often provides many additional insights in the compound’s mode of action.
Figure 3. Genetic methods for target identification and mode of action studies. Schematic representations of (a) resistance cloning, and (b) chemogenetic interaction screens.
5.2.1. Resistance Cloning
The “gold standard” in drug target confirmation is to identify mutations in the presumed target protein that render it insensitive to drug treatment. Conversely, different groups have sought to use this principle as a target identification method based on the concept that cells grown in the presence of a cytotoxic drug will either die or develop mutations that will make them resistant to the compound. With recent advances in deep sequencing it is now possible to then scan the transcriptome [99] or genome [100] of the cells for resistance-inducing mutations. Genes that are mutated are then hypothesized to encode the protein target. For this approach to be successful, there are two initial requirements: (1) the compound needs to be cytotoxic for resistant clones to arise, and (2) the cell line needs to be genetically unstable for mutations to occur in a reasonable timeframe.
In 2012, the Kapoor group demonstrated in a proof-of-concept study that resistance cloning in mammalian cells, coupled to transcriptome sequencing (RNA-seq), yields the known polo-like kinase 1 (PLK1) target of the small molecule BI 2536. For this, they used the cancer cell line HCT-116, which is deficient in mismatch repair and consequently prone to mutations. They generated and sequenced multiple resistant clones, and clustered the clones based on similarity. PLK1 was the only gene that was mutated in multiple groups. Of note, one of the groups did not contain PLK1 mutations, but rather developed resistance through upregulation of ABCBA1, a drug efflux transporter, which is a general and non-specific resistance mechanism [101]. In a following study, they optimized their pipeline “DrugTargetSeqR”, by counter-screening for these types of multidrug resistance mechanisms so that these clones were excluded from further analysis (Figure 3a). Furthermore, they used CRISPR/Cas9-mediated gene editing to determine which mutations were sufficient to confer drug resistance, and as independent validation of the biochemical relevance of the obtained hits [102].
While HCT-116 cells are a useful model cell line for resistance cloning because of their genomic instability, they may not always be the cell line of choice, depending on the compound and process that is studied. Povedana et al. used CRISPR/Cas9 to engineer mismatch repair deficiencies in Ewing sarcoma cells and small cell lung cancer cells. They found that deletion of MSH2 results in hypermutations in these normally mutationally silent cells, resulting in the formation of resistant clones in the presence of bortezomib, MLN4924, and CD437, which are all cytotoxic compounds [103]. Recently, Neggers et al. reasoned that CRISPR/Cas9-induced non-homologous end-joining repair could be a viable strategy to create a wide variety of functional mutants of essential genes through in-frame mutations. Using a tiled sgRNA library targeting 75 target genes of investigational neoplastic drugs in HAP1 and K562 cells, they generated several KPT-9274 (an anticancer agent with unknown target)-resistant clones, and subsequent deep sequencing showed that the resistant clones were enriched in NAMPT sgRNAs. Direct target engagement was confirmed by co-crystallizing the compound with NAMPT [104]. In addition to these genetic mutation strategies, an alternative method is to grow the cells in the presence of a mutagenic chemical to induce higher mutagenesis rates [105,106].
When there is already a hypothesis on the pathway involved in compound action, the resistance cloning methodology can be extended to non-cytotoxic compounds. Sekine et al. developed a fluorescent reporter model for the integrated stress response, and used this cell line for target deconvolution of a small molecule inhibitor towards this pathway (ISRIB). Reporter cells were chemically mutagenized, and ISRIB-resistant clones were isolated by flow cytometry, yielding clones with various mutations in the delta subunit of guanine nucleotide exchange factor eIF2B [107].
While there are certainly successful examples of resistance cloning yielding a compound’s direct target as discussed above, resistance could also be caused by mutations or copy number alterations in downstream components of a signaling pathway. This is illustrated by clinical examples of acquired resistance to small molecules, nature’s way of “resistance cloning”. For example, resistance mechanisms in Hedgehog pathway-driven cancers towards the Smoothened inhibitor vismodegib include compound-resistant mutations in Smoothened, but also copy number changes in downstream activators SUFU and GLI2 [108]. It is, therefore, essential to conduct follow-up studies to confirm a direct interaction between a compound and the hit protein, as well as a lack of interaction with the mutated protein.
5.2.3. “Chemogenomics”: Examples of Gene-Drug Interaction Screens
When genetic perturbations are combined with small molecule drugs in a chemogenetic interaction screen, the effect of a gene’s perturbation on compound action is studied. Gene perturbation can render the cells resistant to the compound (suppressor interaction), or conversely, result in hypersensitivity and enhanced compound potency (synergistic interaction) [5,117,121]. Typically, cells are treated with the compound at a sublethal dose, to ascertain that both types of interactions can be found in the final dataset, and often it is necessary to use a variety of compound doses (i.e., LD20, LD30, LD50) and timepoints to obtain reliable insights (Figure 3b).
An early example of successful coupling of a phenotypic screen and downstream genetic screening for target identification is the study of Matheny et al. They identified STF-118804 as a compound with antileukemic properties. Treatment of MV411 cells, stably transduced with a high complexity, genome-wide shRNA library, with STF-118804 (4 rounds of increasing concentration) or DMSO control resulted in a marked depletion of cells containing shRNAs against nicotinamide phosphoribosyl transferase (NAMPT) [122].
The Bassik lab subsequently directly compared the performance of shRNA-mediated knockdown versus CRISPR/Cas9-knockout screens for the target elucidation of the antiviral drug GSK983. The data coming out of both screens were complementary, with the shRNA screen resulting in hits leading to the direct compound target and the CRISPR screen giving information on cellular mechanisms of action of the compound. A reason for this is likely the level of protein depletion that is reached by these methods: shRNAs lead to decreased protein levels, which is advantageous when studying essential genes. However, knockdown may not result in a phenotype for non-essential genes, in which case a full CRISPR-mediated knockout is necessary to observe effects [123].
Another NAMPT inhibitor was identified in a CRISPR/Cas9 “haplo-insufficiency (HIP)”-like approach [124]. Haploinsuffiency profiling is a well-established system in yeast which is performed in a ~50% protein background by heterozygous deletions [125]. As there is no control over CRISPR-mediated loss of alleles, compound treatment was performed at several timepoints after addition of the sgRNA library to HCT116 cells stably expressing Cas9, in the hope that editing would be incomplete at early timepoints, resulting in residual protein levels. Indeed, NAMPT was found to be the target of phenotypic hit LB-60-OF61, especially at earlier timepoints, confirming the hypothesis that some level of protein needs to be present to identify a compound’s direct target [124]. This approach was confirmed in another study, thereby showing that direct target identification through CRISPR-knockout screens is indeed possible [126].
An alternative strategy was employed by the Weissman lab, where they combined genome-wide CRISPR-interference and -activation screens to identify the target of the phase 3 drug rigosertib. They focused on hits that had opposite action in both screens, as in sensitizing in one but protective in the other, which were related to microtubule stability. In a next step, they created chemical-genetic profiles of a variety of microtubule destabilizing agents, rationalizing that compounds with the same target will have similar drug-gene interactions. For this, they made a focused library of sgRNAs, based on the most high-ranking hits in the rigosertib genome-wide CRISPRi screen, and compared the focused screen results of the different compounds. The profile for rigosertib clustered well with that of ABT-571, and rigorous target validation studies confirmed rigosertib binding to the colchicine binding site of tubulin—the same site as occupied by ABT-571 [127].
From the above examples, it is clear that genetic screens hold a lot of promise for target identification and MOA studies for small molecules. The CRISPR screening field is rapidly evolving, sgRNA libraries are continuously improving and increasingly commercially available, and new tools for data analysis are being developed [128]. The challenge lies in applying these screens to study compounds that are not cytotoxic, where finding the right dosage regimen will not be trivial.
SYSTEMS BIOLOGY AND CANCER RESEARCH & DRUG DISCOVERY
Integrative Analysis of Next-Generation Sequencing for Next-Generation Cancer Research toward Artificial Intelligence
The rapid improvement of next-generation sequencing (NGS) technologies and their application in large-scale cohorts in cancer research led to common challenges of big data. It opened a new research area incorporating systems biology and machine learning. As large-scale NGS data accumulated, sophisticated data analysis methods became indispensable. In addition, NGS data have been integrated with systems biology to build better predictive models to determine the characteristics of tumors and tumor subtypes. Therefore, various machine learning algorithms were introduced to identify underlying biological mechanisms. In this work, we review novel technologies developed for NGS data analysis, and we describe how these computational methodologies integrate systems biology and omics data. Subsequently, we discuss how deep neural networks outperform other approaches, the potential of graph neural networks (GNN) in systems biology, and the limitations in NGS biomedical research. To reflect on the various challenges and corresponding computational solutions, we will discuss the following three topics: (i) molecular characteristics, (ii) tumor heterogeneity, and (iii) drug discovery. We conclude that machine learning and network-based approaches can add valuable insights and build highly accurate models. However, a well-informed choice of learning algorithm and biological network information is crucial for the success of each specific research question
1. Introduction
The development and widespread use of high-throughput technologies founded the era of big data in biology and medicine. In particular, it led to an accumulation of large-scale data sets that opened a vast amount of possible applications for data-driven methodologies. In cancer, these applications range from fundamental research to clinical applications: molecular characteristics of tumors, tumor heterogeneity, drug discovery and potential treatments strategy. Therefore, data-driven bioinformatics research areas have tailored data mining technologies such as systems biology, machine learning, and deep learning, elaborated in this review paper (see Figure 1 and Figure 2). For example, in systems biology, data-driven approaches are applied to identify vital signaling pathways [1]. This pathway-centric analysis is particularly crucial in cancer research to understand the characteristics and heterogeneity of the tumor and tumor subtypes. Consequently, this high-throughput data-based analysis enables us to explore characteristics of cancers with a systems biology and a systems medicine point of view [2].Combining high-throughput techniques, especially next-generation sequencing (NGS), with appropriate analytical tools has allowed researchers to gain a deeper systematic understanding of cancer at various biological levels, most importantly genomics, transcriptomics, and epigenetics [3,4]. Furthermore, more sophisticated analysis tools based on computational modeling are introduced to decipher underlying molecular mechanisms in various cancer types. The increasing size and complexity of the data required the adaptation of bioinformatics processing pipelines for higher efficiency and sophisticated data mining methodologies, particularly for large-scale, NGS datasets [5]. Nowadays, more and more NGS studies integrate a systems biology approach and combine sequencing data with other types of information, for instance, protein family information, pathway, or protein–protein interaction (PPI) networks, in an integrative analysis. Experimentally validated knowledge in systems biology may enhance analysis models and guides them to uncover novel findings. Such integrated analyses have been useful to extract essential information from high-dimensional NGS data [6,7]. In order to deal with the increasing size and complexity, the application of machine learning, and specifically deep learning methodologies, have become state-of-the-art in NGS data analysis.
Figure 1. Next-generation sequencing data can originate from various experimental and technological conditions. Depending on the purpose of the experiment, one or more of the depicted omics types (Genomics, Transcriptomics, Epigenomics, or Single-Cell Omics) are analyzed. These approaches led to an accumulation of large-scale NGS datasets to solve various challenges of cancer research, molecular characterization, tumor heterogeneity, and drug target discovery. For instance, The Cancer Genome Atlas (TCGA) dataset contains multi-omics data from ten-thousands of patients. This dataset facilitates a variety of cancer researches for decades. Additionally, there are also independent tumor datasets, and, frequently, they are analyzed and compared with the TCGA dataset. As the large scale of omics data accumulated, various machine learning techniques are applied, e.g., graph algorithms and deep neural networks, for dimensionality reduction, clustering, or classification. (Created with BioRender.com.)
Figure 2. (a) A multitude of different types of data is produced by next-generation sequencing, for instance, in the fields of genomics, transcriptomics, and epigenomics. (b) Biological networks for biomarker validation: The in vivo or in vitro experiment results are considered ground truth. Statistical analysis on next-generation sequencing data produces candidate genes. Biological networks can validate these candidate genes and highlight the underlying biological mechanisms (Section 2.1). (c) De novo construction of Biological Networks: Machine learning models that aim to reconstruct biological networks can incorporate prior knowledge from different omics data. Subsequently, the model will predict new unknown interactions based on new omics information (Section 2.2). (d) Network-based machine learning: Machine learning models integrating biological networks as prior knowledge to improve predictive performance when applied to different NGS data (Section 2.3). (Created with BioRender.com).
Therefore, a large number of studies integrate NGS data with machine learning and propose a novel data-driven methodology in systems biology [8]. In particular, many network-based machine learning models have been developed to analyze cancer data and help to understand novel mechanisms in cancer development [9,10]. Moreover, deep neural networks (DNN) applied for large-scale data analysis improved the accuracy of computational models for mutation prediction [11,12], molecular subtyping [13,14], and drug repurposing [15,16].
2. Systems Biology in Cancer Research
Genes and their functions have been classified into gene sets based on experimental data. Our understandings of cancer concentrated into cancer hallmarks that define the characteristics of a tumor. This collective knowledge is used for the functional analysis of unseen data.. Furthermore, the regulatory relationships among genes were investigated, and, based on that, a pathway can be composed. In this manner, the accumulation of public high-throughput sequencing data raised many big-data challenges and opened new opportunities and areas of application for computer science. Two of the most vibrantly evolving areas are systems biology and machine learning which tackle different tasks such as understanding the cancer pathways [9], finding crucial genes in pathways [22,53], or predicting functions of unidentified or understudied genes [54]. Essentially, those models include prior knowledge to develop an analysis and enhance interpretability for high-dimensional data [2]. In addition to understanding cancer pathways with in silico analysis, pathway activity analysis incorporating two different types of data, pathways and omics data, is developed to understand heterogeneous characteristics of the tumor and cancer molecular subtyping. Due to its advantage in interpretability, various pathway-oriented methods are introduced and become a useful tool to understand a complex diseases such as cancer [55,56,57].
In this section, we will discuss how two related research fields, namely, systems biology and machine learning, can be integrated with three different approaches (see Figure 2), namely, biological network analysis for biomarker validation, the use of machine learning with systems biology, and network-based models.
2.1. Biological Network Analysis for Biomarker Validation
The detection of potential biomarkers indicative of specific cancer types or subtypes is a frequent goal of NGS data analysis in cancer research. For instance, a variety of bioinformatics tools and machine learning models aim at identify lists of genes that are significantly altered on a genomic, transcriptomic, or epigenomic level in cancer cells. Typically, statistical and machine learning methods are employed to find an optimal set of biomarkers, such as single nucleotide polymorphisms (SNPs), mutations, or differentially expressed genes crucial in cancer progression. Traditionally, resource-intensive in vitro analysis was required to discover or validate those markers. Therefore, systems biology offers in silico solutions to validate such findings using biological pathways or gene ontology information (Figure 2b) [58]. Subsequently, gene set enrichment analysis (GSEA) [50] or gene set analysis (GSA) [59] can be used to evaluate whether these lists of genes are significantly associated with cancer types and their specific characteristics. GSA, for instance, is available via web services like DAVID [60] and g:Profiler [61]. Moreover, other applications use gene ontology directly [62,63]. In addition to gene-set-based analysis, there are other methods that focuse on the topology of biological networks. These approaches evaluate various network structure parameters and analyze the connectivity of two genes or the size and interconnection of their neighbors [64,65]. According to the underlying idea, the mutated gene will show dysfunction and can affect its neighboring genes. Thus, the goal is to find abnormalities in a specific set of genes linked with an edge in a biological network. For instance, KeyPathwayMiner can extract informative network modules in various omics data [66]. In summary, these approaches aim at predicting the effect of dysfunctional genes among neighbors according to their connectivity or distances from specific genes such as hubs [67,68]. During the past few decades, the focus of cancer systems biology extended towards the analysis of cancer-related pathways since those pathways tend to carry more information than a gene set. Such analysis is called Pathway Enrichment Analysis (PEA) [69,70]. The use of PEA incorporates the topology of biological networks. However, simultaneously, the lack of coverage issue in pathway data needs to be considered. Because pathway data does not cover all known genes yet, an integration analysis on omics data can significantly drop in genes when incorporated with pathways. Genes that can not be mapped to any pathway are called ‘pathway orphan.’ In this manner, Rahmati et al. introduced a possible solution to overcome the ‘pathway orphan’ issue [71]. At the bottom line, regardless of whether researchers consider gene-set or pathway-based enrichment analysis, the performance and accuracy of both methods are highly dependent on the quality of the external gene-set and pathway data [72].
2.2. De Novo Construction of Biological Networks
While the known fraction of existing biological networks barely scratches the surface of the whole system of mechanisms occurring in each organism, machine learning models can improve on known network structures and can guide potential new findings [73,74]. This area of research is called de novo network construction (Figure 2c), and its predictive models can accelerate experimental validation by lowering time costs [75,76]. This interplay between in silico biological networks building and mining contributes to expanding our knowledge in a biological system. For instance, a gene co-expression network helps discover gene modules having similar functions [77]. Because gene co-expression networks are based on expressional changes under specific conditions, commonly, inferring a co-expression network requires many samples. The WGCNA package implements a representative model using weighted correlation for network construction that leads the development of the network biology field [78]. Due to NGS developments, the analysis of gene co-expression networks subsequently moved from microarray-based to RNA-seq based experimental data [79]. However, integration of these two types of data remains tricky. Ballouz et al. compared microarray and NGS-based co-expression networks and found the existence of a bias originating from batch effects between the two technologies [80]. Nevertheless, such approaches are suited to find disease-specific co-expressional gene modules. Thus, various studies based on the TCGA cancer co-expression network discovered characteristics of prognostic genes in the network [81]. Accordingly, a gene co-expression network is a condition-specific network rather than a general network for an organism. Gene regulatory networks can be inferred from the gene co-expression network when various data from different conditions in the same organism are available. Additionally, with various NGS applications, we can obtain multi-modal datasets about regulatory elements and their effects, such as epigenomic mechanisms on transcription and chromatin structure. Consequently, a gene regulatory network can consist of solely protein-coding genes or different regulatory node types such as transcription factors, inhibitors, promoter interactions, DNA methylations, and histone modifications affecting the gene expression system [82,83]. More recently, researchers were able to build networks based on a particular experimental setup. For instance, functional genomics or CRISPR technology enables the high-resolution regulatory networks in an organism [84]. Other than gene co-expression or regulatory networks, drug target, and drug repurposing studies are active research areas focusing on the de novo construction of drug-to-target networks to allow the potential repurposing of drugs [76,85].
2.3. Network Based Machine Learning
A network-based machine learning model directly integrates the insights of biological networks within the algorithm (Figure 2d) to ultimately improve predictive performance concerning cancer subtyping or susceptibility to therapy. Following the establishment of high-quality biological networks based on NGS technologies, these biological networks were suited to be integrated into advanced predictive models. In this manner, Zhang et al., categorized network-based machine learning approaches upon their usage into three groups: (i) model-based integration, (ii) pre-processing integration, and (iii) post-analysis integration [7]. Network-based models map the omics data onto a biological network, and proper algorithms travel the network while considering both values of nodes and edges and network topology. In the pre-processing integration, pathway or other network information is commonly processed based on its topological importance. Meanwhile, in the post-analysis integration, omics data is processed solely before integration with a network. Subsequently, omics data and networks are merged and interpreted. The network-based model has advantages in multi-omics integrative analysis. Due to the different sensitivity and coverage of various omics data types, a multi-omics integrative analysis is challenging. However, focusing on gene-level or protein-level information enables a straightforward integration [86,87]. Consequently, when different machine learning approaches tried to integrate two or more different data types to find novel biological insights, one of the solutions is reducing the search space to gene or protein level and integrated heterogeneous datatypes [25,88].
In summary, using network information opens new possibilities for interpretation. However, as mentioned earlier, several challenges remain, such as the coverage issue. Current databases for biological networks do not cover the entire set of genes, transcripts, and interactions. Therefore, the use of networks can lead to loss of information for gene or transcript orphans. The following section will focus on network-based machine learning models and their application in cancer genomics. We will put network-based machine learning into the perspective of the three main areas of application, namely, molecular characterization, tumor heterogeneity analysis, and cancer drug discovery.
3. Network-Based Learning in Cancer Research
As introduced previously, the integration of machine learning with the insights of biological networks (Figure 2d) ultimately aims at improving predictive performance and interpretability concerning cancer subtyping or treatment susceptibility.
3.1. Molecular Characterization with Network Information
Various network-based algorithms are used in genomics and focus on quantifying the impact of genomic alteration. By employing prior knowledge in biological network algorithms, performance compared to non-network models can be improved. A prominent example is HotNet. The algorithm uses a thermodynamics model on a biological network and identifies driver genes, or prognostic genes, in pan-cancer data [89]. Another study introduced a network-based stratification method to integrate somatic alterations and expression signatures with network information [90]. These approaches use network topology and network-propagation-like algorithms. Network propagation presumes that genomic alterations can affect the function of neighboring genes. Two genes will show an exclusive pattern if two genes complement each other, and the function carried by those two genes is essential to an organism [91]. This unique exclusive pattern among genomic alteration is further investigated in cancer-related pathways. Recently, Ku et al. developed network-centric approaches and tackled robustness issues while studying synthetic lethality [92]. Although synthetic lethality was initially discovered in model organisms of genetics, it helps us to understand cancer-specific mutations and their functions in tumor characteristics [91].
Furthermore, in transcriptome research, network information is used to measure pathway activity and its application in cancer subtyping. For instance, when comparing the data of two or more conditions such as cancer types, GSEA as introduced in Section 2 is a useful approach to get an overview of systematic changes [50]. It is typically used at the beginning of a data evaluation [93]. An experimentally validated gene set can provide information about how different conditions affect molecular systems in an organism. In addition to the gene sets, different approaches integrate complex interaction information into GSEA and build network-based models [70]. In contrast to GSEA, pathway activity analysis considers transcriptome data and other omics data and structural information of a biological network. For example, PARADIGM uses pathway topology and integrates various omics in the analysis to infer a patient-specific status of pathways [94]. A benchmark study with pan-cancer data recently reveals that using network structure can show better performance [57]. In conclusion, while the loss of data is due to the incompleteness of biological networks, their integration improved performance and increased interpretability in many cases.
3.2. Tumor Heterogeneity Study with Network Information
The tumor heterogeneity can originate from two directions, clonal heterogeneity and tumor impurity. Clonal heterogeneity covers genomic alterations within the tumor [95]. While de novo mutations accumulate, the tumor obtains genomic alterations with an exclusive pattern. When these genomic alterations are projected on the pathway, it is possible to observe exclusive relationships among disease-related genes. For instance, the CoMEt and MEMo algorithms examine mutual exclusivity on protein–protein interaction networks [96,97]. Moreover, the relationship between genes can be essential for an organism. Therefore, models analyzing such alterations integrate network-based analysis [98].
In contrast, tumor purity is dependent on the tumor microenvironment, including immune-cell infiltration and stromal cells [99]. In tumor microenvironment studies, network-based models are applied, for instance, to find immune-related gene modules. Although the importance of the interaction between tumors and immune cells is well known, detailed mechanisms are still unclear. Thus, many recent NGS studies employ network-based models to investigate the underlying mechanism in tumor and immune reactions. For example, McGrail et al. identified a relationship between the DNA damage response protein and immune cell infiltration in cancer. The analysis is based on curated interaction pairs in a protein–protein interaction network [100]. Most recently, Darzi et al. discovered a prognostic gene module related to immune cell infiltration by using network-centric approaches [101]. Tu et al. presented a network-centric model for mining subnetworks of genes other than immune cell infiltration by considering tumor purity [102].
3.3. Drug Target Identification with Network Information
In drug target studies, network biology is integrated into pharmacology [103]. For instance, Yamanishi et al. developed novel computational methods to investigate the pharmacological space by integrating a drug-target protein network with genomics and chemical information. The proposed approaches investigated such drug-target network information to identify potential novel drug targets [104]. Since then, the field has continued to develop methods to study drug target and drug response integrating networks with chemical and multi-omic datasets. In a recent survey study by Chen et al., the authors compared 13 computational methods for drug response prediction. It turned out that gene expression profiles are crucial information for drug response prediction [105].
Moreover, drug-target studies are often extended to drug-repurposing studies. In cancer research, drug-repurposing studies aim to find novel interactions between non-cancer drugs and molecular features in cancer. Drug-repurposing (or repositioning) studies apply computational approaches and pathway-based models and aim at discovering potential new cancer drugs with a higher probability than de novo drug design [16,106]. Specifically, drug-repurposing studies can consider various areas of cancer research, such as tumor heterogeneity and synthetic lethality. As an example, Lee et al. found clinically relevant synthetic lethality interactions by integrating multiple screening NGS datasets [107]. This synthetic lethality and related-drug datasets can be integrated for an effective combination of anticancer therapeutic strategy with non-cancer drug repurposing.
4. Deep Learning in Cancer Research
DNN models develop rapidly and become more sophisticated. They have been frequently used in all areas of biomedical research. Initially, its development was facilitated by large-scale imaging and video data. While most data sets in the biomedical field would not typically be considered big data, the rapid data accumulation enabled by NGS made it suitable for the application of DNN models requiring a large amount of training data [108]. For instance, in 2019, Samiei et al. used TCGA-based large-scale cancer data as benchmark datasets for bioinformatics machine learning research such as Image-Net in the computer vision field [109]. Subsequently, large-scale public cancer data sets such as TCGA encouraged the wide usage of DNNs in the cancer domain [110]. Over the last decade, these state-of-the-art machine learning methods have been incorporated in many different biological questions [111].
In addition to public cancer databases such as TCGA, the genetic information of normal tissues is stored in well-curated databases such as GTEx [112] and 1000Genomes [113]. These databases are frequently used as control or baseline training data for deep learning [114]. Moreover, other non-curated large-scale data sources such as GEO (https://www.ncbi.nlm.nih.gov/geo/, accessed on 20 May 2021) can be leveraged to tackle critical aspects in cancer research. They store a large-scale of biological data produced under various experimental setups (Figure 1). Therefore, an integration of GEO data and other data requires careful preprocessing. Overall, an increasing amount of datasets facilitate the development of current deep learning in bioinformatics research [115].
4.1. Challenges for Deep Learning in Cancer Research
Many studies in biology and medicine used NGS and produced large amounts of data during the past few decades, moving the field to the big data era. Nevertheless, researchers still face a lack of data in particular when investigating rare diseases or disease states. Researchers have developed a manifold of potential solutions to overcome this lack of data challenges, such as imputation, augmentation, and transfer learning (Figure 3b). Data imputation aims at handling data sets with missing values [116]. It has been studied on various NGS omics data types to recover missing information [117]. It is known that gene expression levels can be altered by different regulatory elements, such as DNA-binding proteins, epigenomic modifications, and post-transcriptional modifications. Therefore, various models integrating such regulatory schemes have been introduced to impute missing omics data [118,119]. Some DNN-based models aim to predict gene expression changes based on genomics or epigenomics alteration. For instance, TDimpute aims at generating missing RNA-seq data by training a DNN on methylation data. They used TCGA and TARGET (https://ocg.cancer.gov/programs/target/data-matrix, accessed on 20 May 2021) data as proof of concept of the applicability of DNN for data imputation in a multi-omics integration study [120]. Because this integrative model can exploit information in different levels of regulatory mechanisms, it can build a more detailed model and achieve better performance than a model build on a single-omics dataset [117,121]. The generative adversarial network (GAN) is a DNN structure for generating simulated data that is different from the original data but shows the same characteristics [122]. GANs can impute missing omics data from other multi-omics sources. Recently, the GAN algorithm is getting more attention in single-cell transcriptomics because it has been recognized as a complementary technique to overcome the limitation of scRNA-seq [123]. In contrast to data imputation and generation, other machine learning approaches aim to cope with a limited dataset in different ways. Transfer learning or few-shot learning, for instance, aims to reduce the search space with similar but unrelated datasets and guide the model to solve a specific set of problems [124]. These approaches train models with data of similar characteristics and types but different data to the problem set. After pre-training the model, it can be fine-tuned with the dataset of interest [125,126]. Thus, researchers are trying to introduce few-shot learning models and meta-learning approaches to omics and translational medicine. For example, Select-ProtoNet applied the ProtoTypical Network [127] model to TCGA transcriptome data and classified patients into two groups according to their clinical status [128]. AffinityNet predicts kidney and uterus cancer subtypes with gene expression profiles [129].
Figure 3. (a) In various studies, NGS data transformed into different forms. The 2-D transformed form is for the convolution layer. Omics data is transformed into pathway level, GO enrichment score, or Functional spectra. (b) DNN application on different ways to handle lack of data. Imputation for missing data in multi-omics datasets. GAN for data imputation and in silico data simulation. Transfer learning pre-trained the model with other datasets and fine-tune. (c) Various types of information in biology. (d) Graph neural network examples. GCN is applied to aggregate neighbor information. (Created with BioRender.com).
4.2. Molecular Charactization with Network and DNN Model
DNNs have been applied in multiple areas of cancer research. For instance, a DNN model trained on TCGA cancer data can aid molecular characterization by identifying cancer driver genes. At the very early stage, Yuan et al. build DeepGene, a cancer-type classifier. They implemented data sparsity reduction methods and trained the DNN model with somatic point mutations [130]. Lyu et al. [131] and DeepGx [132] embedded a 1-D gene expression profile to a 2-D array by chromosome order to implement the convolution layer (Figure 3a). Other algorithms, such as the deepDriver, use k-nearest neighbors for the convolution layer. A predefined number of neighboring gene mutation profiles was the input for the convolution layer. It employed this convolution layer in a DNN by aggregating mutation information of the k-nearest neighboring genes [11]. Instead of embedding to a 2-D image, DeepCC transformed gene expression data into functional spectra. The resulting model was able to capture molecular characteristics by training cancer subtypes [14].
Another DNN model was trained to infer the origin of tissue from single-nucleotide variant (SNV) information of metastatic tumor. The authors built a model by using the TCGA/ICGC data and analyzed SNV patterns and corresponding pathways to predict the origin of cancer. They discovered that metastatic tumors retained their original cancer’s signature mutation pattern. In this context, their DNN model obtained even better accuracy than a random forest model [133] and, even more important, better accuracy than human pathologists [12].
4.3. Tumor Heterogeneity with Network and DNN Model
As described in Section 4.1, there are several issues because of cancer heterogeneity, e.g., tumor microenvironment. Thus, there are only a few applications of DNN in intratumoral heterogeneity research. For instance, Menden et al. developed ’Scaden’ to deconvolve cell types in bulk-cell sequencing data. ’Scaden’ is a DNN model for the investigation of intratumor heterogeneity. To overcome the lack of training datasets, researchers need to generate in silico simulated bulk-cell sequencing data based on single-cell sequencing data [134]. It is presumed that deconvolving cell types can be achieved by knowing all possible expressional profiles of the cell [36]. However, this information is typically not available. Recently, to tackle this problem, single-cell sequencing-based studies were conducted. Because of technical limitations, we need to handle lots of missing data, noises, and batch effects in single-cell sequencing data [135]. Thus, various machine learning methods were developed to process single-cell sequencing data. They aim at mapping single-cell data onto the latent space. For example, scDeepCluster implemented an autoencoder and trained it on gene-expression levels from single-cell sequencing. During the training phase, the encoder and decoder work as denoiser. At the same time, they can embed high-dimensional gene-expression profiles to lower-dimensional vectors [136]. This autoencoder-based method can produce biologically meaningful feature vectors in various contexts, from tissue cell types [137] to different cancer types [138,139].
4.4. Drug Target Identification with Networks and DNN Models
In addition to NGS datasets, large-scale anticancer drug assays enabled the training train of DNNs. Moreover, non-cancer drug response assay datasets can also be incorporated with cancer genomic data. In cancer research, a multidisciplinary approach was widely applied for repurposing non-oncology drugs to cancer treatment. This drug repurposing is faster than de novo drug discovery. Furthermore, combination therapy with a non-oncology drug can be beneficial to overcome the heterogeneous properties of tumors [85]. The deepDR algorithm integrated ten drug-related networks and trained deep autoencoders. It used a random-walk-based algorithm to represent graph information into feature vectors. This approach integrated network analysis with a DNN model validated with an independent drug-disease dataset [15].
The authors of CDRscan did an integrative analysis of cell-line-based assay datasets and other drug and genomics datasets. It shows that DNN models can enhance the computational model for improved drug sensitivity predictions [140]. Additionally, similar to previous network-based models, the multi-omics application of drug-targeted DNN studies can show higher prediction accuracy than the single-omics method. MOLI integrated genomic data and transcriptomic data to predict the drug responses of TCGA patients [141].
4.5. Graph Neural Network Model
In general, the advantage of using a biological network is that it can produce more comprehensive and interpretable results from high-dimensional omics data. Furthermore, in an integrative multi-omics data analysis, network-based integration can improve interpretability over traditional approaches. Instead of pre-/post-integration of a network, recently developed graph neural networks use biological networks as the base structure for the learning network itself. For instance, various pathways or interactome information can be integrated as a learning structure of a DNN and can be aggregated as heterogeneous information. In a GNN study, a convolution process can be done on the provided network structure of data. Therefore, the convolution on a biological network made it possible for the GNN to focus on the relationship among neighbor genes. In the graph convolution layer, the convolution process integrates information of neighbor genes and learns topological information (Figure 3d). Consequently, this model can aggregate information from far-distant neighbors, and thus can outperform other machine learning models [142].
In the context of the inference problem of gene expression, the main question is whether the gene expression level can be explained by aggregating the neighboring genes. A single gene inference study by Dutil et al. showed that the GNN model outperformed other DNN models [143]. Moreover, in cancer research, such GNN models can identify cancer-related genes with better performance than other network-based models, such as HotNet2 and MutSigCV [144]. A recent GNN study with a multi-omics integrative analysis identified 165 new cancer genes as an interactive partner for known cancer genes [145]. Additionally, in the synthetic lethality area, dual-dropout GNN outperformed previous bioinformatics tools for predicting synthetic lethality in tumors [146]. GNNs were also able to classify cancer subtypes based on pathway activity measures with RNA-seq data. Lee et al. implemented a GNN for cancer subtyping and tested five cancer types. Thus, the informative pathway was selected and used for subtype classification [147]. Furthermore, GNNs are also getting more attention in drug repositioning studies. As described in Section 3.3, drug discovery requires integrating various networks in both chemical and genomic spaces (Figure 3d). Chemical structures, protein structures, pathways, and other multi-omics data were used in drug-target identification and repurposing studies (Figure 3c). Each of the proposed applications has a specialty in the different purposes of drug-related tasks. Sun et al. summarized GNN-based drug discovery studies and categorized them into four classes: molecular property and activity prediction, interaction prediction, synthesis prediction, and de novo drug design. The authors also point out four challenges in the GNN-mediated drug discovery. At first, as we described before, there is a lack of drug-related datasets. Secondly, the current GNN models can not fully represent 3-D structures of chemical molecules and protein structures. The third challenge is integrating heterogeneous network information. Drug discovery usually requires a multi-modal integrative analysis with various networks, and GNNs can improve this integrative analysis. Lastly, although GNNs use graphs, stacked layers still make it hard to interpret the model [148].
4.6. Shortcomings in AI and Revisiting Validity of Biological Networks as Prior Knowledge
The previous sections reviewed a variety of DNN-based approaches that present a good performance on numerous applications. However, it is hardly a panacea for all research questions. In the following, we will discuss potential limitations of the DNN models. In general, DNN models with NGS data have two significant issues: (i) data requirements and (ii) interpretability. Usually, deep learning needs a large proportion of training data for reasonable performance which is more difficult to achieve in biomedical omics data compared to, for instance, image data. Today, there are not many NGS datasets that are well-curated and -annotated for deep learning. This can be an answer to the question of why most DNN studies are in cancer research [110,149]. Moreover, the deep learning models are hard to interpret and are typically considered as black-boxes. Highly stacked layers in the deep learning model make it hard to interpret its decision-making rationale. Although the methodology to understand and interpret deep learning models has been improved, the ambiguity in the DNN models’ decision-making hindered the transition between the deep learning model and translational medicine [149,150].
As described before, biological networks are employed in various computational analyses for cancer research. The studies applying DNNs demonstrated many different approaches to use prior knowledge for systematic analyses. Before discussing GNN application, the validity of biological networks in a DNN model needs to be shown. The LINCS program analyzed data of ’The Connectivity Map (CMap) project’ to understand the regulatory mechanism in gene expression by inferring the whole gene expression profiles from a small set of genes (https://lincsproject.org/, accessed on 20 May 2021) [151,152]. This LINCS program found that the gene expression level is inferrable with only nearly 1000 genes. They called this gene list ’landmark genes’. Subsequently, Chen et al. started with these 978 landmark genes and tried to predict other gene expression levels with DNN models. Integrating public large-scale NGS data showed better performance than the linear regression model. The authors conclude that the performance advantage originates from the DNN’s ability to model non-linear relationships between genes [153].
Following this study, Beltin et al. extensively investigated various biological networks in the same context of the inference of gene expression level. They set up a simplified representation of gene expression status and tried to solve a binary classification task. To show the relevance of a biological network, they compared various gene expression levels inferred from a different set of genes, neighboring genes in PPI, random genes, and all genes. However, in the study incorporating TCGA and GTEx datasets, the random network model outperformed the model build on a known biological network, such as StringDB [154]. While network-based approaches can add valuable insights to analysis, this study shows that it cannot be seen as the panacea, and a careful evaluation is required for each data set and task. In particular, this result may not represent biological complexity because of the oversimplified problem setup, which did not consider the relative gene-expressional changes. Additionally, the incorporated biological networks may not be suitable for inferring gene expression profiles because they consist of expression-regulating interactions, non-expression-regulating interactions, and various in vivo and in vitro interactions.
“ However, although recently sophisticated applications of deep learning showed improved accuracy, it does not reflect a general advancement. Depending on the type of NGS data, the experimental design, and the question to be answered, a proper approach and specific deep learning algorithms need to be considered. Deep learning is not a panacea. In general, to employ machine learning and systems biology methodology for a specific type of NGS data, a certain experimental design, a particular research question, the technology, and network data have to be chosen carefully.”
Hoadley, K.A.; Yau, C.; Wolf, D.M.; Cherniack, A.D.; Tamborero, D.; Ng, S.; Leiserson, M.D.; Niu, B.; McLellan, M.D.; Uzunangelov, V.; et al. Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin. Cell2014, 158, 929–944. [Google Scholar] [CrossRef] [PubMed]
Hutter, C.; Zenklusen, J.C. The cancer genome atlas: Creating lasting value beyond its data. Cell2018, 173, 283–285. [Google Scholar] [CrossRef]
Chuang, H.Y.; Lee, E.; Liu, Y.T.; Lee, D.; Ideker, T. Network-based classification of breast cancer metastasis. Mol. Syst. Biol.2007, 3, 140. [Google Scholar] [CrossRef]
Zhang, W.; Chien, J.; Yong, J.; Kuang, R. Network-based machine learning and graph theory algorithms for precision oncology. NPJ Precis. Oncol.2017, 1, 25. [Google Scholar] [CrossRef] [PubMed]
Ngiam, K.Y.; Khor, W. Big data and machine learning algorithms for health-care delivery. Lancet Oncol.2019, 20, e262–e273. [Google Scholar] [CrossRef]
Creixell, P.; Reimand, J.; Haider, S.; Wu, G.; Shibata, T.; Vazquez, M.; Mustonen, V.; Gonzalez-Perez, A.; Pearson, J.; Sander, C.; et al. Pathway and network analysis of cancer genomes. Nat. Methods2015, 12, 615. [Google Scholar]
Reyna, M.A.; Haan, D.; Paczkowska, M.; Verbeke, L.P.; Vazquez, M.; Kahraman, A.; Pulido-Tamayo, S.; Barenboim, J.; Wadi, L.; Dhingra, P.; et al. Pathway and network analysis of more than 2500 whole cancer genomes. Nat. Commun.2020, 11, 729. [Google Scholar] [CrossRef]
Luo, P.; Ding, Y.; Lei, X.; Wu, F.X. deepDriver: Predicting cancer driver genes based on somatic mutations using deep convolutional neural networks. Front. Genet.2019, 10, 13. [Google Scholar] [CrossRef]
Jiao, W.; Atwal, G.; Polak, P.; Karlic, R.; Cuppen, E.; Danyi, A.; De Ridder, J.; van Herpen, C.; Lolkema, M.P.; Steeghs, N.; et al. A deep learning system accurately classifies primary and metastatic cancers using passenger mutation patterns. Nat. Commun.2020, 11, 728. [Google Scholar] [CrossRef]
Chaudhary, K.; Poirion, O.B.; Lu, L.; Garmire, L.X. Deep learning–based multi-omics integration robustly predicts survival in liver cancer. Clin. Cancer Res.2018, 24, 1248–1259. [Google Scholar] [CrossRef]
Gao, F.; Wang, W.; Tan, M.; Zhu, L.; Zhang, Y.; Fessler, E.; Vermeulen, L.; Wang, X. DeepCC: A novel deep learning-based framework for cancer molecular subtype classification. Oncogenesis2019, 8, 44. [Google Scholar] [CrossRef]
Zeng, X.; Zhu, S.; Liu, X.; Zhou, Y.; Nussinov, R.; Cheng, F. deepDR: A network-based deep learning approach to in silico drug repositioning. Bioinformatics2019, 35, 5191–5198. [Google Scholar] [CrossRef]
Issa, N.T.; Stathias, V.; Schürer, S.; Dakshanamurthy, S. Machine and deep learning approaches for cancer drug repurposing. In Seminars in Cancer Biology; Elsevier: Amsterdam, The Netherlands, 2020. [Google Scholar]
The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium. Pan-cancer analysis of whole genomes. Nature2020, 578, 82. [Google Scholar] [CrossRef] [PubMed]
King, M.C.; Marks, J.H.; Mandell, J.B. Breast and ovarian cancer risks due to inherited mutations in BRCA1 and BRCA2. Science2003, 302, 643–646. [Google Scholar] [CrossRef] [PubMed]
Courtney, K.D.; Corcoran, R.B.; Engelman, J.A. The PI3K pathway as drug target in human cancer. J. Clin. Oncol.2010, 28, 1075. [Google Scholar] [CrossRef] [PubMed]
Parker, J.S.; Mullins, M.; Cheang, M.C.; Leung, S.; Voduc, D.; Vickery, T.; Davies, S.; Fauron, C.; He, X.; Hu, Z.; et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. J. Clin. Oncol.2009, 27, 1160. [Google Scholar] [CrossRef]
Yersal, O.; Barutca, S. Biological subtypes of breast cancer: Prognostic and therapeutic implications. World J. Clin. Oncol.2014, 5, 412. [Google Scholar] [CrossRef] [PubMed]
Zhao, L.; Lee, V.H.; Ng, M.K.; Yan, H.; Bijlsma, M.F. Molecular subtyping of cancer: Current status and moving toward clinical applications. Brief. Bioinform.2019, 20, 572–584. [Google Scholar] [CrossRef] [PubMed]
Jones, P.A.; Issa, J.P.J.; Baylin, S. Targeting the cancer epigenome for therapy. Nat. Rev. Genet.2016, 17, 630. [Google Scholar] [CrossRef] [PubMed]
Huang, S.; Chaudhary, K.; Garmire, L.X. More is better: Recent progress in multi-omics data integration methods. Front. Genet.2017, 8, 84. [Google Scholar] [CrossRef]
Chin, L.; Andersen, J.N.; Futreal, P.A. Cancer genomics: From discovery science to personalized medicine. Nat. Med.2011, 17, 297. [Google Scholar] [CrossRef] [PubMed]
Use of Systems Biology in Anti-Microbial Drug Development
Genomics, Computational Biology and Drug Discovery for Mycobacterial Infections: Fighting the Emergence of Resistance. Asma Munir, Sundeep Chaitanya Vedithi, Amanda K. Chaplin and Tom L. Blundell. Front. Genet., 04 September 2020 | https://doi.org/10.3389/fgene.2020.00965
In an earlier review article (Waman et al., 2019), we discussed various computational approaches and experimental strategies for drug target identification and structure-guided drug discovery. In this review we discuss the impact of the era of precision medicine, where the genome sequences of pathogens can give clues about the choice of existing drugs, and repurposing of others. Our focus is directed toward combatting antimicrobial drug resistance with emphasis on tuberculosis and leprosy. We describe structure-guided approaches to understanding the impacts of mutations that give rise to antimycobacterial resistance and the use of this information in the design of new medicines.
Genome Sequences and Proteomic Structural Databases
In recent years, there have been many focused efforts to define the amino-acid sequences of the M. tuberculosis pan-genome and then to define the three-dimensional structures and functional interactions of these gene products. This work has led to essential genes of the bacteria being revealed and to a better understanding of the genetic diversity in different strains that might lead to a selective advantage (Coll et al., 2018). This will help with our understanding of the mode of antibiotic resistance within these strains and aid structure-guided drug discovery. However, only ∼10% of the ∼4128 proteins have structures determined experimentally.
Several databases have been developed to integrate the genomic and/or structural information linked to drug resistance in Mycobacteria (Table 1). These invaluable resources can contribute to better understanding of molecular mechanisms involved in drug resistance and improvement in the selection of potential drug targets.
There is a dearth of information related to structural aspects of proteins from M. leprae and their oligomeric and hetero-oligomeric organization, which has limited the understanding of physiological processes of the bacillus. The structures of only 12 proteins have been solved and deposited in the protein data bank (PDB). However, the high sequence similarity in protein coding genes between M. leprae and M. tuberculosis allows computational methods to be used for comparative modeling of the proteins of M. leprae. Mainly monomeric models using single template modeling have been defined and deposited in the Swiss Model repository (Bienert et al., 2017), in Modbase (Pieper et al., 2014), and in a collection with other infectious disease agents (Sosa et al., 2018). There is a need for multi-template modeling and building homo- and hetero-oligomeric complexes to better understand the interfaces, druggability and impacts of mutations.
We are now exploiting Vivace, a multi-template modeling pipeline developed in our lab for modeling the proteomes of M. tuberculosis (CHOPIN, see above) and M. abscessus [Mabellini Database (Skwark et al., 2019)], to model the proteome of M. leprae. We emphasize the need for understanding the protein interfaces that are critical to function. An example of this is that of the RNA-polymerase holoenzyme complex from M. leprae. We first modeled the structure of this hetero-hexamer complex and later deciphered the binding patterns of rifampin (Vedithi et al., 2018; Figures 1A,B). Rifampin is a known drug to treat tuberculosis and leprosy. Owing to high rifampin resistance in tuberculosis and emerging resistance in leprosy, we used an approach known as “Computational Saturation Mutagenesis”, to identify sites on the protein that are less impacted by mutations. In this study, we were able to understand the association between predicted impacts of mutations on the structure and phenotypic rifampin-resistance outcomes in leprosy.
FIGURE 2
Figure 2.(A) Stability changes predicted by mCSM for systematic mutations in the ß-subunit of RNA polymerase in M. leprae. The maximum destabilizing effect from among all 19 possible mutations at each residue position is considered as a weighting factor for the color map that gradients from red (high destabilizing effects) to white (neutral to stabilizing effects) (Vedithi et al., 2020). (B) One of the known mutations in the ß-subunit of RNA polymerase, the S437H substitution which resulted in a maximum destabilizing effect [-1.701 kcal/mol (mCSM)] among all 19 possibilities this position. In the mutant, histidine (residue in green) forms hydrogen bonds with S434 and Q438, aromatic interactions with F431, and other ring-ring and π interactions with the surrounding residues which can impact the shape of the rifampin binding pocket and rifampin affinity to the ß-subunit [-0.826 log(affinity fold change) (mCSM-lig)]. Orange dotted lines represent weak hydrogen bond interactions. Ring-ring and intergroup interactions are depicted in cyan. Aromatic interactions are represented in sky-blue and carbonyl interactions in pink dotted lines. Green dotted lines represent hydrophobic interactions (Vedithi et al., 2020).
Examples of Understanding and Combatting Resistance
The availability of whole genome sequences in the present era has greatly enhanced the understanding of emergence of drug resistance in infectious diseases like tuberculosis. The data generated by the whole genome sequencing of clinical isolates can be screened for the presence of drug-resistant mutations. A preliminary in silico analysis of mutations can then be used to prioritize experimental work to identify the nature of these mutations.
FIGURE 3
Figure 3.(A) Mechanism of isoniazid activation and INH-NAD adduct formation. (B) Mutations mapped (Munir et al., 2019) on the structure of KatG (PDB ID:1SJ2; Bertrand et al., 2004).
Other articles related to Computational Biology, Systems Biology, and Bioinformatics on this online journal include:
Few cancer mechanisms are as devastating as the generation of cancer stem cells, which arise in leukemia from white blood cell precursors. The mechanisms of this transition have been obscure, but the consequences are all too clear. Leukemia stem cells promote an aggressive, therapy-resistant form of disease called blast crisis.
Delving into the mechanisms by which leukemia stem cells are primed, a team of scientists at the University of California, San Diego (UCSD), uncovered a misfiring RNA-editing system. The main problem the scientists found was an enzyme called ADAR1 (adenosine deaminase acting on RNA1), which mediates post-transcriptional adenosine-to-inosine (A-to-I) RNA editing.
ADAR1 can edit the sequence of microRNAs (miRNAs), small pieces of genetic material. By swapping out just one miRNA building block for another, ADAR1 alters the carefully orchestrated system cells use to control which genes are turned on or off at which times.
ADAR1 is known to promote cancer progression and resistance to therapy. To study ADAR1, the UCSD team used human blast crisis chronic myeloid leukemia (CML) cells in the lab, and mice transplanted with these cells, to determine the enzyme’s role in governing leukemia stem cells.
The scientists, led by Catriona Jamieson, M.D., Ph.D., published their work June 9 in Cell Stem Cell, in an article entitled, “ADAR1 Activation Drives Leukemia Stem Cell Self-Renewal by Impairing Let-7 Biogenesis.” The article presented the first mechanistic link between pro-cancer inflammatory signals and RNA editing–driven reprogramming of precursor cells into leukemia stem cells.
The article describes how ADAR1-mediated A-to-I RNA editing is activated by Janus kinase 2 (JAK2) signaling and BCR-ABL1 signaling. Also, it indicated, in a model of blast crisis (BC) CML, that combined JAK2 and BCR-ABL1 inhibition prevents leukemia stem cell self-renewal commensurate with ADAR1 downregulation.
Essentially, the scientists were able to trace a series of molecular events: First, white blood cells with a leukemia-promoting gene mutation become more sensitive to signs of inflammation. That inflammatory response activates ADAR1. Then, hyper-ADAR1 editing slows down the miRNAs known as let-7. Ultimately, this activity increases cellular regeneration, or self-renewal, turning white blood cell precursors into leukemia stem cells.
“Lentiviral ADAR1 wild-type, but not an editing-defective ADAR1E912A mutant, induces self-renewal gene expression and impairs biogenesis of stem cell regulatory let-7 microRNAs,” wrote the author of the Cell Stem Cell article. “Combined RNA sequencing, qRT-PCR, CLIP-ADAR1, and pri-let-7 mutagenesis data suggest that ADAR1 promotes LSC generation via let-7 pri-microRNA editing andLIN28B upregulation.”
After learning how the ADAR1 system works, Dr. Jamieson’s team looked for a way to stop it. By inhibiting sensitivity to inflammation or inhibiting ADAR1 with a small-molecule tool compound, the researchers were able to counter ADAR1’s effect on leukemia stem cell self-renewal and restore let-7. Self-renewal of blast crisis CML cells was reduced by approximately 40% when treated with the small molecule called 8-Aza as compared to untreated cells.
“A small-molecule tool compound antagonizes ADAR1’s effect on LSC self-renewal in stromal co-cultures and restores let-7 biogenesis,” the study’s authors noted. “Thus, ADAR1 activation represents a unique therapeutic vulnerability in LSCs with active JAK2 signaling.”
“In this study, we showed that cancer stem cells co-opt a RNA editing system to clone themselves. What’s more, we found a method to dial it down,” said Dr. Catriona Jamieson. “Based on this research, we believe that detecting ADAR1 activity will be important for predicting cancer progression.
“In addition, inhibiting this enzyme represents a unique therapeutic vulnerability in cancer stem cells with active inflammatory signaling that may respond to pharmacologic inhibitors of inflammation sensitivity or selective ADAR1 inhibitors that are currently being developed.”
Maria Anna Zipeto, Angela C. Court, Anil Sadarangani, Nathaniel P. Delos Santos, Larisa Balaian, Hye-Jung Chun, Gabriel Pineda, Sheldon R. Morris, Cayla N. Mason, Ifat Geron, Christian Barrett, Daniel J. Goff, Russell Wall, Maurizio Pellecchia, Mark Minden, Kelly A. Frazer, Marco A. Marra, Leslie A. Crews, Qingfei Jiang, Catriona H.M. Jamieson
•JAK2 and BCR-ABL1 signaling converge on ADAR1 activation through STAT5a
•ADAR1-mediated microRNA editing impairs let-7 biogenesis and enhances LSC self-renewal
•JAK2 and BCR-ABL1 inhibition reduces ADAR1 expression and prevents LSC self-renewal
Post-transcriptional adenosine-to-inosine RNA editing mediated by adenosine deaminase acting on RNA1 (ADAR1) promotes cancer progression and therapeutic resistance. However, ADAR1 editase-dependent mechanisms governing leukemia stem cell (LSC) generation have not been elucidated. In blast crisis chronic myeloid leukemia (BC CML), we show that increased JAK2 signaling and BCR-ABL1 amplification activate ADAR1. In a humanized BC CML mouse model, combined JAK2 and BCR-ABL1 inhibition prevents LSC self-renewal commensurate with ADAR1 downregulation. Lentiviral ADAR1 wild-type, but not an editing-defective ADAR1E912A mutant, induces self-renewal gene expression and impairs biogenesis of stem cell regulatory let-7 microRNAs. Combined RNA sequencing, qRT-PCR, CLIP-ADAR1, and pri-let-7 mutagenesis data suggest that ADAR1 promotes LSC generation via let-7 pri-microRNA editing and LIN28Bupregulation. A small-molecule tool compound antagonizes ADAR1’s effect on LSC self-renewal in stromal co-cultures and restores let-7 biogenesis. Thus, ADAR1 activation represents a unique therapeutic vulnerability in LSCs with active JAK2 signaling.
Chronic Myeloid Leukemia: Biology and Pathophysiology, excluding Therapy Program: Oral and Poster Abstracts Session: 631. Chronic Myeloid Leukemia: Biology and Pathophysiology, excluding Therapy: Poster III
Monday, December 7, 2015, 6:00 PM-8:00 PM
Hall A, Level 2 (Orange County Convention Center)
Maria Anna Zipeto, Ph.D1*, Angela Court Recart2*, Nathaniel Delos Santos3*, Qingfei Jiang, PhD4*, Leslie A Crews, PhD3* and Catriona HM Jamieson, MD, PhD3
1Sanford Consortium for Regenerative Medicine, University of California San Diego, La Jolla, CA 2University of California San Diego, LA JOLLA, CA 3Division of Regenerative Medicine, University of California, San Diego, La Jolla, CA 4University of California San Diego, La Jolla, CA
BackgroundIn advanced human malignancies, RNA sequencing (RNA-seq) has uncovered deregulation of adenosine deaminase acting on RNA (ADAR) editases that promote therapeutic resistance and leukemia stem cell (LSC) generation. Chronic myeloid leukemia (CML), an important paradigm for understanding LSC evolution, is initiated by BCR-ABL1 oncogene expression in hematopoietic stem cells (HSCs) but undergoes blast crisis (BC) transformation following aberrant self-renewal acquisition by myeloid progenitors harboring cytokine-responsive ADAR1 p150 overexpression. Emerging evidence suggests that adenosine to inosine editing at the level of primary (pri) or precursor (pre)-microRNA (miRNA), alters miRNA biogenesis and impairs biogenesis. However, relatively little is known about the role of inflammatory niche-driven ADAR1 miRNA editing in malignant reprogramming of progenitors into self-renewing LSCs.
Methods
Primary normal and CML progenitors were FACS-purified and RNA-Seq analysis as well as qRT-PCR validation were performed according to published methods (Jiang, 2013). MiRNAs were extracted from purified CD34+ cells derived from CP, BC CML and cord blood by RNeasy microKit (QIAGEN) and let-7 expression was evaluated by qRT-PCR using miScript Primer assay (QIAGEN). CD34+ cord blood (n=3) were transduced with lentiviral human JAK2, let-7a, wt-ADAR1 and mutant ADAR1, which lacks a functional deaminase domain. Because STAT signaling triggers ADAR1 transcriptional activation and both BCR-ABL1 and JAK2 activate STAT5a, nanoproteomics analysis of STAT5a levels was performed. Engrafted immunocompromised RAG2-/-γc-/- mice were treated with a JAK2 inhibitor, SAR302503, alone or in combination with a potent BCR-ABL1 TKI Dasatinib, for two weeks followed by FACS analysis of human progenitor engraftment in hematopoietic tissues and serial transplantation.
Results
RNA-seq and qRT-PCR analysis in FACS purified BC CML progenitors revealed an over-representation of inflammatory pathway activation and higher levels of JAK2-dependent inflammatory cytokine receptors, when compared to normal and chronic phase (CP) progenitors. Moreover, RNA-seq and qRT-PCR analysis showed decreased levels of mature let-7 family of stem cell regulatory miRNA in BC compared to normal and CP progenitors. Lentiviral human JAK2 transduction of CD34+ progenitors led to an increase of ADAR1 transcript levels and to a reduction in let-7 family members. Interestingly, lentiviral human JAK2 transduction of normal progenitors enhanced ADAR1 activity, as revealed by RNA editing-specific qRT-PCR and RNA-seq analysis. Moreover, qRT-PCR analysis of CD34+ progenitors transduced with wt-ADAR1, but not mutant ADAR1 lacking functional deaminase activity, reduced let-7 miRNA levels. These data suggested that ADAR1 impairs let-7 family biogenesis in a RNA editing dependent manner. Interestingly, RNA-seq analysis confirmed higher frequency of A-to-I editing events in pri- and pre-let-7 family members in CD34+ BC compared to CP progenitors, as well as normal progenitors transduced with human JAK2 and ADAR1-wt, but not mutant ADAR1. Lentiviral ADAR1 overexpression enhanced CP CML progenitor self-renewal and decreased levels of some members of the let-7 family. In contrast, lentiviral transduction of human let-7a significantly reduced self-renewal of progenitors. In vivo treatments with Dasatinib in combination with a JAK2 inhibitor, significantly reduced self-renewal of BCR-ABL1 expressing BC progenitors in the bone marrow thereby prolonging survival of serially transplanted mice. Finally, a reduction in ADAR1 p150 transcripts was also noted following combination treatment only suggesting a role for ADAR1 in CSC propagation.
Conclusion
This is the first demonstration that intrinsic BCR-ABL oncogenic signaling and extrinsic cytokines signaling through JAK2 converge on activation of ADAR1 that drives LSC generation by impairing let-7 miRNA biogenesis. Targeted reversal of ADAR1-mediated miRNA editing may enhance eradication of inflammatory niche resident cancer stem cells in a broad array of malignancies, including JAK2-driven myeloproliferative neoplasms.
Disclosures:Jamieson:J&J: Research Funding ; GSK: Research Funding .
Interferon Receptor Signaling in Malignancy: a Network of Cellular Pathways Defining Biological Outcomes
Interferons (IFNs) are cytokines with important anti-proliferative activity and exhibit key roles in immune surveillance against malignancies. Early work initiated over 3 decades ago led to the discovery of IFN receptor activated Jak-Stat pathways and provided important insights into mechanisms for transcriptional activation of interferon stimulated genes (ISGs) that mediate IFN-biological responses. Since then, additional evidence has established critical roles for other receptor activated signaling pathways in the induction of IFN-activities. These include MAPK pathways, mTOR cascades and PKC pathways. In addition, specific microRNAs (miRNAs) appear to play a significant role in the regulation of IFN-signaling responses. This review focuses on the emerging evidence for a model in which IFNs share signaling elements and pathways with growth factors and tumorigenic signals, but engage them in a distinctive manner to mediate anti-proliferative and antiviral responses.
Because of their antineoplastic, antiviral, and immunomodulatory properties, recombinant interferons (IFNs) have been used extensively in the treatment of various diseases in humans (1). IFNs have clinical activity against several malignancies and are actively used in the treatment of solid tumors such as malignant melanoma and renal cell carcinoma; and hematological malignancies, such as myeloproliferative neoplasms (MPNs) (1). In addition, IFNs play prominent roles in the treatment of viral syndromes, such as hepatitis B and C (2). In contrast to their beneficial therapeutic properties, IFNs have been also implicated in the pathophysiology of certain diseases in humans. In many cases this involvement reflects abnormal activation of the endogenous IFN system, which has important roles in various physiological processes. Diseases in which dysregulation of the Type I IFN system has been implicated as a pathogenetic mechanism include autoimmune disorders such as systemic lupus erythematosous (3), Sjogren’s syndrome (3,4), dermatomyositis (5) and systemic sclerosis (3, 4). In addition, Type II IFN (IFNγ) overproduction has been implicated in bone marrow failure syndromes, such as aplastic anemia (6). There is also recent evidence for opposing actions of distinct IFN subtypes in the pathophysiology of certain diseases. For instance, a recent study demonstrated that there is an inverse association between IFNβ and IFNγ gene expression in human leprosy, consistent with opposing functions between Type I and II IFNs in the pathophysiology of this disease (7). Thus, differential targeting of components of the IFN-system, to either promote or block induction of IFN-responses depending on the disease context, may be useful in the therapeutic management of various human illnesses. The emerging evidence for the complex regulation of the IFN-system underscores the need for a detailed understanding of the mechanisms of IFN-signaling in order to target IFN-responses effectively and selectively.
It took over 35 years from the original discovery of IFNs in 1957 to the discovery of Jak-Stat pathways (8). The identification of the functions of Jaks and Stats dramatically advanced our understanding of the mechanisms of IFN-signaling and had a broad impact on the cytokine research field as a whole, as it led to the identification of similar pathways from other cytokine receptors (8). Subsequently, several other IFN receptor (IFNR)-regulated pathways were identified (9). As discussed below, in recent years there has been accumulating evidence that beyond Stats, non-Stat pathways play important and essential roles in IFN-signaling. This has led to an evolution of our understanding of the complexity associated with IFN receptor activation and how interacting signaling networks determine the relevant IFN response.
Interferons and their functions
The interferons are classified in 3 major categories, Type I (α, β, ω, ε, τ, κ, ν); Type II (γ) and Type III IFNs (λ1, λ2, λ3) (1, 9, 10). The largest IFN-gene family is the group of Type I IFNs. This family includes 14 IFNα genes, one of which is a pseudogene, resulting in the expression of 13 IFNα protein subtypes (1, 9). There are 3 distinct IFNRs that are specific for the 3 different IFN types. All Type I IFN subtypes bind to and activate the Type I IFNR, while Type II and III IFNs bind to and activate the Type II and III IFNRs, respectively (9–11). It should be noted that although all the different Type I IFNs bind to and activate the Type I IFNR, differences in binding to the receptor may account for specific responses and biological effects (9). For instance, a recent study provided evidence that direct binding of mouse IFNβ to the Ifnar1 subunit, in the absence of Ifnar2, regulates engagement of signals that control expression of genes specifically induced by IFNβ, but not IFNα (12). This recent discovery followed original observations from the 90s that revealed differential interactions between the different subunits of the Type I IFN receptor in response to IFNβ binding as compared to IFNα binding and partially explained observed differences in functional responses between different Type I IFNs (9).
A common property of all IFNs, independently of type and subtype, is the induction of antiviral effects in vitro and in vivo (1). Because of their potent antiviral properties, IFNs constitute an important element of the immune defense against viral infections. There is emerging information indicating that specificity of the antiviral response is cell type dependent and/or reflects specific tissue expression of certain IFNs. As an example, a recent comparative analysis of the involvement of the Type I IFN system as compared to the Type III IFN system in antiviral protection against rotavirus infection of intestinal epithelial cells demonstrated an almost exclusive requirement for IFNλ (Type III IFN) (13). The antiviral effects of IFNα have led to the introduction of this cytokine in the treatment of hepatitis C and B in humans (2) and different viral genotypes have been associated with response or failure to IFN-therapy (14).
Most importantly, IFNs exhibit important antineoplastic effects, reflecting both direct antiproliferative responses mediated by IFNRs expressed on malignant cells, as well as indirect immunomodulatory effects (15). IFNα and its pegylated form (peg IFNα) have been widely used in the treatment of several neoplastic diseases, such as hairy cell leukemia (HCL), chronic myeloid leukemia (CML), cutaneous T cell lymphoma (CTCL), renal cell carcinoma (RCC), malignant melanoma, and myeloproliferative neoplasms (MPNs) (1, 16). Although the emergence of new targeted therapies and more effective agents have minimized the use of IFNs in the treatment of diseases like HCL and CML, IFNs are still used extensively in the treatment of melanoma, CTCL and MPNs (1, 16, 17). Notably, recent studies have provided evidence for long lasting molecular responses in patients with polycythemia vera (PV), essential thrombocytosis (ET) and myelofibrosis (MF) who were treated with IFNα (16). Beyond their inhibitory properties on malignant hematopoietic progenitors, IFNs are potent regulators of normal hematopoiesis (9) and contribute to the regulation of normal homeostasis in the human bone marrow (18). Related to its effects in the central nervous system, IFNβ has clinical activity in multiple sclerosis (MS) and has been used extensively for the treatment of patients with MS (19). The immunoregulatory properties of Type I IFNs include key roles in the control of innate and adaptive immune responses, as well as positive and negative effects on the activation of the inflammasome (15). Dysregulation of the Type I IFN response is seen in certain autoimmune diseases, such as Aicardi-Goutières syndrome (20). In fact, self-amplifying Type I IFN-production is a key pathophysiological mechanism in autoimmune syndromes (21). There is also emerging evidence that IFNλ may contribute to the IFN signature in autoimmune diseases (3).
Jak-Stat pathways
Jak kinases and DNA binding Stat-complexes
Tyrosine kinases of the Janus family (Jaks) are associated in unique combinations with different IFNRs and their functions are essential for IFN-inducible biological responses. Stats are transcriptional activators whose activation depends on tyrosine phosphorylation by Jaks (8, 9). In the case of the Type I IFN receptor, Tyk2 and Jak1 are constitutively associated with the IFNAR1 and IFNAR2 subunits, respectively (8, 9) (Fig. 1). For the Type II IFN receptor, Jak1 and Jak2 are associated with the IFNGR1 and IFNGR2 receptor subunits, respectively (8, 9) (Fig. 1). Finally, in the case of the Type III IFNR, Jak1 and Tyk2 are constitutively associated with the IFN-λR1 and IL-10R2 receptor chains, respectively (10) (Fig. 1). Upon engagement of the different IFNRs by the corresponding ligands, the kinase domains of the associated Jaks are activated and phosphorylate tyrosine residues in the intracellular domains of the receptor subunits that serve as recruitmenst sites for specific Stat proteins. Subsequently, the Jaks phosphorylate Stat proteins that form unique complexes and translocate to the nucleus where they bind to specific sequences in the promoters of ISGs to initiate transcription. A major Stat complex in IFN-signaling is the interferon stimulated gene factor 3 (ISGF3) complex. This IFN-inducible complex is composed or Stat1, Stat2 and IRF9 and regulates transcription by binding to IFN stimulated response elements (ISRE) in the promoters of a large group of IFN stimulated genes (ISGs) (8, 9). ISGF3 complexes are induced during engagement of the Type I and III IFN receptors, but not in response to activation of Type II IFN receptors (8–10) (Table 1). Beyond ISGF3, several other Stat-complexes involving different Stat homodimers or heterodimers are activated by IFNs and bind to IFNγ-activated (GAS) sequences in the promoters of groups of ISGs (8, 9). Such GAS binding complexes are induced by all different IFNs (I, II and III), although there is variability in the engagement and utilization of different Stats by the different IFN-receptors (Table 1). It should also be noted that engagement of certain Stats, such as Stat4 and Stat6, is cell type-specific and may be relevant for tissue specific functions (9). The significance of different Stat binding complexes in the induction of Type I and II IFN responses was in part addressed in a study in which Stat1 cooperative DNA binding was disrupted by generating knock-in mice expressing cooperativity-deficient STAT1 (22). As expected, Type II IFN-induced gene transcription and antibacterial responses were essentially lost in these mice, but Type I IFN-dependent recruitment of Stat1 to ISRE elements and antiviral responses were not affected (22), demonstrating the existence of important differences in Stat1 cooperative DNA binding between Type I and II IFN signaling.
Type I, II, III interferon receptors subunits, associated kinases of the Janus family, and effector Stat-pathways. Note: Stat:Stat reflects multiple potential Stat:Stat compexes, as outlined in Table 2.
Different Stat-DNA binding complexes induced by Type I, II and III IFNs.
Serine phosphorylation of Stats
The nuclear translocation of Stat-proteins occurs after their activation, following phosphorylation on specific sites by Jak kinases (8, 9). It is well established that phosphorylation on tyrosine 701 is required for activation of Stat1 and phosphorylation on tyrosine 705 is required for activation of Stat3 (8, 9). Beyond tyrosine phosphorylation, phosphorylation on serine 727 in the Stat1 and Stat3 transactivation domains is required for full and optimal transcriptional activation of ISGs (8, 9). There is evidence that serine phosphorylation occurs after the phosphorylation of Stat1 on tyrosine 701 and that translocation to the nucleus and recruitment to the chromatin are essential in order for Stat1 to undergo serine 727 phosphorylation (23). Several IFN-dependent serine kinases for Stat1 have been described, raising the possibility that this phosphorylation occurs in a cell type specific manner. After the original demonstration that protein kinase C (PKC) delta (PKCδ) is a serine kinase for Stat1 and is required for optimal transcriptional activation in response to IFNα (24), extensive work has confirmed the role of this PKC isoform in the regulation of serine 727 phosphorylation in Stat1 and has been extended to different cellular systems (25–29) (Table 2). In the Type II IFN system five different serine kinases for the transactivation domain (TAD) of Stat1/phosphorylation on serine 727 have been demonstrated in different cell systems. …..
Serine phosphorylation of Stats
The nuclear translocation of Stat-proteins occurs after their activation, following phosphorylation on specific sites by Jak kinases (8, 9). It is well established that phosphorylation on tyrosine 701 is required for activation of Stat1 and phosphorylation on tyrosine 705 is required for activation of Stat3 (8, 9). Beyond tyrosine phosphorylation, phosphorylation on serine 727 in the Stat1 and Stat3 transactivation domains is required for full and optimal transcriptional activation of ISGs (8, 9). There is evidence that serine phosphorylation occurs after the phosphorylation of Stat1 on tyrosine 701 and that translocation to the nucleus and recruitment to the chromatin are essential in order for Stat1 to undergo serine 727 phosphorylation (23). Several IFN-dependent serine kinases for Stat1 have been described, raising the possibility that this phosphorylation occurs in a cell type specific manner. After the original demonstration that protein kinase C (PKC) delta (PKCδ) is a serine kinase for Stat1 and is required for optimal transcriptional activation in response to IFNα (24), extensive work has confirmed the role of this PKC isoform in the regulation of serine 727 phosphorylation in Stat1 and has been extended to different cellular systems (25–29) (Table 2). In the Type II IFN system five different serine kinases for the transactivation domain (TAD) of Stat1/phosphorylation on serine 727 have been demonstrated in different cell systems. ….
Protein tyrosine phosphatases with regulatory effects on Jak-Stat pathways in IFN-signaling.
…….
MicroRNAs (miRs) and the IFN response
IFN-inducible JAK-STAT, MAPK and mTOR signaling cascades are also regulated potentially by microRNAs (miRs). miRs are important regulators of post-transcriptional events, leading to inhibition of mRNA translation or mRNA degradation (105). In recent years it has become apparent that the direct regulation of STAT activity by mIRs has profound effects on consequent gene expression, specifically in the context of cytokine-inducible events (106). Pertinent for this review of IFN-inducible STAT activation, miR-145, miR-146A and miR-221/222 target STAT1 and miR-221/222 target STAT2 (106). Numerous studies describe different miRs that target STAT3: mIR-17, miR-17-5p, mIR-17-3p, mIR-18a, miR-19b, mIR-92-1, miR-20b, Let-7a, miR-106a, miR-106-25, miR-106a-362 and miR-125b (106) (Fig. 4). mIR-132, miR-212 and miR-200a have been implicated in negatively regulating STAT4 expression in human NK cells (107) and miR-222 has been shown to regulate STAT5 expression (108). In addition, JAK-STAT signaling is affected by miR targeting of suppressors of cytokine signaling (SOCS) proteins. miR-122 and miR-155 targeting of SOCS1 releases the inhibition of STAT1 (and STAT5a/b) (109–111), and mIR-19a regulation of SOCS1 and SOCS3 effectively prolongs activation of both STAT1 and STAT3 (112). There is also evidence that miR-155 targets the inositol phosphatase SHIP1, effectively prolonging/inducing IFN-γ expression (113). Much of the evidence associated with miRs prolonging JAK-STAT activation relates to cancer studies, where tumor-secreted miRs promote cell migration and angiogenesis by prolonging JAK-STAT activation (114). miR-145 targeting of SOCS7 affects nuclear translocation of STAT3 and has been associated with enhanced IFNβ production (115). Beyond inhibition of SOCS proteins, miRs may influence the expression of other inhibitory factors associated with JAK-STAT signaling, and miR-301a and miR-18a have been shown to inhibit PIAS3, a negative regulator of STAT3 activation (116). There is also the potential for STATS to directly regulate miR gene expression. STAT5 suppresses expression of miR15/16 (117) and there is evidence that there are potential STAT3 binding sites in the promoters of about 200 miRs (118). Viewed altogether, there is compelling evidence for miR-STAT interactions, yet few studies have considered the contributions of miRs to IFN-inducible JAK-STAT signaling.
Targeting and regulation of various proteins known to be involved in IFN-signaling by different miRNAs. ….
Evolution of our understanding of IFN-signals and future perspectives
A substantial amount of knowledge has accumulated since the original discovery of the Jak-Stat pathway in the early 90s. It is now clear that several key signaling cascades are essential for the induction of Type I, II and III IFN-responses. The original view that IFN-signals can be transmitted from the cell surface to the nucleus in two simple steps involving tyrosine phosphorylation of Stat proteins (8) now appears somewhat simplistic, as it has been established that modifications of Jak-Stat signals by other pathways and/or simultaneous engagement of other essential complementary cellular cascades is essential for induction of ISG transcriptional activation, mRNA translation, protein expression and subsequent induction of IFN-responses. Such pathways include PKC and MAP kinase pathways and mTORC1 and mTORC2-dpendent signaling cascades.
Over the next decade our understanding of the mechanisms by which IFN-signals are induced will likely continue to evolve, with the anticipated outcome that it will be possible exploit this new knowledge for translational-therapeutic purposes. For instance, selective targeting of kinase-elements of the IFN-pathway with kinase inhibitors may be useful in the treatment of autoimmune diseases where dysregulated/excessive Type I IFN production contributes to the pathophysiology of disease. On the other hand, efforts to promote the induction of specific IFN-signals, may lead to novel, less toxic, therapeutic interventions for a variety of viral infectious diseases and neoplastic disorders.
Exploring the RNA World in Hematopoietic Cells Through the Lens of RNA-Binding Proteins
The discovery of microRNAs has renewed interest in post-transcriptional modes of regulation, fueling an emerging view of a rich RNA world within our cells that deserves further exploration. Much work has gone into elucidating genetic regulatory networks that orchestrate gene expression programs and direct cell fate decisions in the hematopoietic system. However, the focus has been to elucidate signaling pathways and transcriptional programs. To bring us one step closer to reverse engineering the molecular logic of cellular differentiation, it will be necessary to map post-transcriptional circuits as well and integrate them in the context of existing network models. In this regard, RNA-binding proteins (RBPs) may rival transcription factors as important regulators of cell fates and represent a tractable opportunity to connect the RNA world to the proteome. ChIP-seq has greatly facilitated genome-wide localization of DNA-binding proteins, helping us to understand genomic regulation at a systems level. Similarly, technological advances such as CLIP-seq allow transcriptome-wide mapping of RBP binding sites, aiding us to unravel post-transcriptional networks. Here, we review RBP-mediated post-transcriptional regulation, paying special attention to findings relevant to the immune system. As a prime example, we highlight the RBP Lin28B, which acts as a heterochronic switch between fetal and adult lymphopoiesis.
The basis of cellular differentiation and function can be represented as integrated circuits that are genetically programmed. Identification of the master regulators within these complex circuits that can switch on or off a genetic program will enable us to reprogram cells to suit biomedical needs. A remarkable example was the discovery by Takahashi and Yamanaka (1) that somatic cells could be reprogrammed into induced pluripotent stem (iPS) cells via the ectopic expression of four key transcription factors. Interestingly, a specific set of microRNAs (miRNAs) could also mediate this reprogramming (2, 3), revealing a powerful layer of post-transcriptional regulation that is able to override a pre-existing transcriptional program (4). Similarly, miR-9 and miR-124 were sufficient to mediate transdifferentiation of human fibroblasts into neurons (5). Accordingly, we are enamored by the RNA world and pay special attention in our investigations to regulatory non-coding RNAs (ncRNAs), particularly miRNAs and long non-coding RNAs (lncRNAs) and how they integrate with known genetic regulatory networks (Fig. 1). With the exception of certain ribozymes, regulatory RNAs generally do not work alone. Instead, they are physically organized as RNA-protein (RNP) complexes. Operationally, RNA-binding proteins (RBPs) and their interactome work in concert as post-transcriptional networks, or RNA regulons, in response to developmental and environmental cues (6). Inspired by this concept and other pioneering studies in the worm, we recently demonstrated that a single RBP Lin28 was sufficient to reprogram adult hematopoietic progenitors to adopt fetal-like properties (7). We discuss these and related findings, which begin to disentangle the complex functions of RBPs in the context of recent advances in post-transcriptional regulation, starting with the discovery of miRNAs.
Updated model of gene regulation that integrates RBPs and ncRNAs
The Lin28/let-7 circuit: from worm development to lymphopoiesis
Inspiration from the worm
Working in C. elegans, Ambros and Horvitz (8) identified a set of genes that control developmental timing, a category that they termed heterochronic genes. Heterochrony is a term coined by evolutionary biologists and popularized by the worm community to denote events that either positively or negatively regulate developmental timing in multicellular organisms. The discovery of two heterochronic genes, lin-4 and lin-28, which encode a miRNA and RBP respectively, is particularly relevant to this review. The lineage (lin) mutants were previously identified and named because they displayed abnormalities in cell lineage differentiation. Furthermore, some of them were considered heterochronic, as adult mutants harbored immature characteristics (retarded phenotype) or, conversely, larval mutants displayed adult characteristics (precocious phenotype). It was not until 1993 that lin-4 was characterized molecularly, because contrary to popular expectations, the gene did not encode a protein but instead a small RNA now appreciated as the first miRNA to be discovered (9). The lin-4 miRNA acts in part by inhibiting the expression of the LIN-14 transcription factor through imperfect basepairing to sites in the 3′ untranslated region (UTR) of lin-14 mRNA (9, 10). However, it was not apparent initially whether lin-4 or lin-14 is evolutionarily conserved, potentially relegating these findings to be relevant only to the worm. Interestingly, Lin28, a gene conserved in mammals, was later identified to be a direct target of the lin-4 miRNA (11). Lin28 loss-of-function resulted in a precocious phenotype, whereas gain-of-function resulted in a retarded phenotype; thus, Lin28 acts as a heterochronic switch during C. elegans larval development (11).
The possibility that lin-4 may be an oddity of the worm was dissolved with the discovery of the second miRNA, again in C. elegans, let-7 (12). Unlike lin-4, the evolutionary conservation of let-7 from sea urchin to human was quickly appreciated (13). Importantly, expression analysis showed that let-7 expression is temporally regulated from molluscs to vertebrates in all three major clades of bilaterian animals, implying that its role as a developmental timekeeper is conserved (14). This established miRNAs as a field unto its own that has progressed rapidly with the identification of Drosha, Dgcr8, Dicer, and Argonaute (Ago) RBPs as core components of the miRNA pathway (15). Orthologs of lin-4were eventually found in mammals (mir-125a, -b-1, and -b-2) (16) along with hundreds of novel miRNAs from numerous organisms (17). We now recognize that miRNAs, in complex with the RBP Ago, frequently bind their cognate targets via imperfect complementarity to evolutionarily conserved sequences in 3′ UTRs (18–20) and mediate post-transcriptional repression (21).
…..
One diverse group of RBPs appreciated to be important in the immune system, even before the discovery of miRNAs, is distinguished by their ability to bind to AU-rich elements (AREs) often found in 3′ UTRs of genes involved in inflammation, growth, and survival. Such RBPs are known as ARE-BPs and have been implicated in mRNA decay, alternative splicing, translation, as well as both alleviating and enhancing miRNA-mediated mRNA repression (104–107). Genetic inactivation of several ARE-BPs have been linked to aberrant cytokine expression due to impaired ARE-mediated decay (5, 108–111) (Table 1). In addition, deficiency of HuR and AUF1 has uncovered a pro-survival role for both in lymphocytes (112, 113), while ectopic expression of Tis11b (ZFP36L1) negatively regulates erythropoiesis by down-regulating Stat5b mRNA stability (114). The KH-type splicing regulatory protein (KSRP) originally identified as an alternative-splicing factor is a multi-functional RBP. It has been shown to associate with both Drosha and Dicer complexes to positively regulate the biogenesis of a subset of miRNAs including mir-155 and let-7 (73, 108, 115–120). In addition, KSRP, like many other ARE-BPs, mediate selective decay of mRNAs by recruitment of exosome complexes to mRNA targets (121) and constitutes a prime example of a multi-functional RBP.
……
Biological processes involved in the development and function of the immune system require programmed changes in protein production and constitute prime candidates for post-transcriptional regulation. While the ENCODE project initially aimed to identify all functional elements in the human DNA sequence, recent discoveries centered around miRNAs and multi-tasking RBPs, such as Lin28, have highlighted the need for a similar systematic effort in mapping post-transcriptional functional elements within the transcriptome. Integration of genomic, transcriptomic, and proteomic data remains a daunting but necessary task to achieve understanding of the full impact of genetic programs and the enigmatic roles of regulatory RNAs. Mastering the science of (re)programming cell fates promises to unleash the potential of stem cells for Regenerative Medicine.
Gene Editing with CRISPR gets Crisper, Volume 2 (Volume Two: Latest in Genomics Methodologies for Therapeutics: Gene Editing, NGS and BioInformatics, Simulations and the Genome Ontology), Part 2: CRISPR for Gene Editing and DNA Repair
Gene Editing with CRISPR gets Crisper
Curators: Larry H. Bernstein, MD, FCAP and Aviva Lev-Ari, PhD, RN
CRISPR Moves from Butchery to Surgery
More Genomes Are Going Under the CRISPR Knife, So Surgical Standards Are Rising
The Dharmacon subsidary of GE Healthcare provides the Edit-R Lentiviral Gene Engineering platform. It is based on the natural S. pyrogenes system, but unlike that system, which uses a single guide RNA (sgRNA), the platform uses two component RNAs, a gene-specific CRISPR RNA (crRNA) and a universal trans-activating crRNA (tracrRNA). Once hybridized to the universal tracrRNA (blue), the crRNA (green) directs the Cas9 nuclease to a specific genomic region to induce a double- strand break.
Scientists recently convened at the CRISPR Precision Gene Editing Congress, held in Boston, to discuss the new technology. As with any new technique, scientists have discovered that CRISPR comes with its own set of challenges, and the Congress focused its discussion around improving specificity, efficiency, and delivery.
In the naturally occurring system, CRISPR-Cas9 works like a self-vaccination in the bacterial immune system by targeting and cleaving viral DNA sequences stored from previous encounters with invading phages. The endogenous system uses two RNA elements, CRISPR RNA (crRNA) and trans-activating RNA (tracrRNA), which come together and guide the Cas9 nuclease to the target DNA.
Early publications that demonstrated CRISPR gene editing in mammalian cells combined the crRNA and tracrRNA sequences to form one long transcript called asingle-guide RNA (sgRNA). However, an alternative approach is being explored by scientists at the Dharmacon subsidiary of GE Healthcare. These scientists have a system that mimics the endogenous system through a synthetic two-component approach thatpreserves individual crRNA and tracrRNA. The tracrRNA is universal to any gene target or species; the crRNA contains the information needed to target the gene of interest.
Predesigned Guide RNAs
In contrast to sgRNAs, which are generated through either in vitro transcription of a DNA template or a plasmid-based expression system, synthetic crRNA and tracrRNA eliminate the need for additional cloning and purification steps. The efficacy of guide RNA (gRNA), whether delivered as a sgRNA or individual crRNA and tracrRNA, depends not only on DNA binding, but also on the generation of an indel that will deliver the coup de grâce to gene function.
“Almost all of the gRNAs were able to create a break in genomic DNA,” said Louise Baskin, senior product manager at Dharmacon. “But there was a very wide range in efficiency and in creating functional protein knock-outs.”
To remove the guesswork from gRNA design, Dharmacon developed an algorithm to predict gene knockout efficiency using wet-lab data. They also incorporated specificity as a component of their algorithm, using a much more comprehensive alignment tool to predict potential off-target effects caused by mismatches and bulges often missed by other alignment tools. Customers can enter their target gene to access predesigned gRNAs as either two-component RNAs or lentiviral sgRNA vectors for multiple applications.
“We put time and effort into our algorithm to ensure that our guide RNAs are not only functional but also highly specific,” asserts Baskin. “As a result, customers don’t have to do any design work.”
MilliporeSigma’s CRISPR Epigenetic Activator is based on fusion of a nuclease-deficient Cas9 (dCas9) to the catalytic histone acetyltransferase (HAT) core domain of the human E1A-associated protein p300. This technology allows researchers to target specific DNA regions or gene sequences. Researchers can localize epigenetic changes to their target of interest and see the effects of those changes in gene expression.
Knockout experiments are a powerful tool for analyzing gene function. However, for researchers who want to introduce DNA into the genome, guide design, donor DNA selection, and Cas9 activity are paramount to successful DNA integration.MilliporeSigma offers two formats for donor DNA: double-stranded DNA (dsDNA) plasmids and single-stranded DNA (ssDNA) oligonucleotides. The most appropriate format depends on cell type and length of the donor DNA. “There are some cell types that have immune responses to dsDNA,” said Gregory Davis, Ph.D., R&D manager, MilliporeSigma.
The ssDNA format can save researchers time and money, but it has a limited carrying capacity of approximately 120 base pairs.In addition to selecting an appropriate donor DNA format, controlling where, how, and when the Cas9 enzyme cuts can affect gene-editing efficiency. Scientists are playing tug-of-war, trying to pull cells toward the preferred homology-directed repair (HDR) and away from the less favored nonhomologous end joining (NHEJ) repair mechanism.One method to achieve this modifies the Cas9 enzyme to generate a nickase that cuts only one DNA strand instead of creating a double-strand break. Accordingly, MilliporeSigma has created a Cas9 paired-nickase system that promotes HDR, while also limiting off-target effects and increasing the number of sequences available for site-dependent gene modifications, such as disease-associated single nucleotide polymorphisms (SNPs).“The best thing you can do is to cut as close to the SNP as possible,” advised Dr. Davis. “As you move the double-stranded break away from the site of mutation you get an exponential drop in the frequency of recombination.”
Ribonucleo-protein Complexes
Another strategy to improve gene-editing efficiency, developed by Thermo Fisher, involves combining purified Cas9 protein with gRNA to generate a stable ribonucleoprotein (RNP) complex. In contrast to plasmid- or mRNA-based formats, which require transcription and/or translation, the Cas9 RNP complex cuts DNA immediately after entering the cell. Rapid clearance of the complex from the cell helps to minimize off-target effects, and, unlike a viral vector, the transient complex does not introduce foreign DNA sequences into the genome.
To deliver their Cas9 RNP complex to cells, Thermo Fisher has developed a lipofectamine transfection reagent called CRISPRMAX. “We went back to the drawing board with our delivery, screened a bunch of components, and got a brand-new, fully optimized lipid nanoparticle formulation,” explained Jon Chesnut, Ph.D., the company’s senior director of synthetic biology R&D. “The formulation is specifically designed for delivering the RNP to cells more efficiently.”
Besides the reagent and the formulation, Thermo Fisher has also developed a range of gene-editing tools. For example, it has introduced the Neon® transfection system for delivering DNA, RNA, or protein into cells via electroporation. Dr. Chesnut emphasized the company’s focus on simplifying complex workflows by optimizing protocols and pairing everything with the appropriate up- and downstream reagents.
From Mammalian Cells to Microbes
One of the first sources of CRISPR technology was the Feng Zhang laboratory at the Broad Institute, which counted among its first licensees a company called GenScript. This company offers a gene-editing service called GenCRISPR™ to establish mammalian cell lines with CRISPR-derived gene knockouts.
“There are a lot of challenges with mammalian cells, and each cell line has its own set of issues,” said Laura Geuss, a marketing specialist at GenScript. “We try to offer a variety of packages that can help customers who have difficult-to-work-with cells.” These packages include both viral-based and transient transfection techniques.
However, the most distinctive service offered by GenScript is its microbial genome-editing service for bacteria (Escherichia coli) and yeast (Saccharomyces cerevisiae). The company’s strategy for gene editing in bacteria can enable seamless knockins, knockouts, or gene replacements by combining CRISPR with lambda red recombineering. Traditionally one of the most effective methods for gene editing in microbes, recombineering allows editing without restriction enzymes through in vivo homologous recombination mediated by a phage-based recombination system such as lambda red.
On its own, lambda red technology cannot target multiple genes, but when paired with CRISPR, it allows the editing of multiple genes with greater efficiency than is possible with CRISPR alone, as the lambda red proteins help repair double-strand breaks in E. coli. The ability to knockout different gene combinations makes Genscript’s microbial editing service particularly well suited for the optimization of metabolic pathways.
Pooled and Arrayed Library Strategies
Scientists are using CRISPR technology for applications such as metabolic engineering and drug development. Yet another application area benefitting from CRISPR technology is cancer research. Here, the use of pooled CRISPR libraries is becoming commonplace. Pooled CRISPR libraries can help detect mutations that affect drug resistance, and they can aid in patient stratification and clinical trial design.
Pooled screening uses proliferation or viability as a phenotype to assess how genetic alterations, resulting from the application of a pooled CRISPR library, affect cell growth and death in the presence of a therapeutic compound. The enrichment or depletion of different gRNA populations is quantified using deep sequencing to identify the genomic edits that result in changes to cell viability.
MilliporeSigma provides pooled CRISPR libraries ranging from the whole human genome to smaller custom pools for these gene-function experiments. For pharmaceutical and biotech companies, Horizon Discovery offers a pooled screening service, ResponderSCREEN, which provides a whole-genome pooled screen to identify genes that confer sensitivity or resistance to a compound. This service is comprehensive, taking clients from experimental design all the way through to suggestions for follow-up studies.
Horizon Discovery maintains a Research Biotech business unit that is focused on target discovery and enabling translational medicine in oncology. “Our internal backbone gives us the ability to provide expert advice demonstrated by results,” said Jon Moore, Ph.D., the company’s CSO.
In contrast to a pooled screen, where thousands of gRNA are combined in one tube, an arrayed screen applies one gRNA per well, removing the need for deep sequencing and broadening the options for different endpoint assays. To establish and distribute a whole-genome arrayed lentiviral CRISPR library, MilliporeSigma partnered with the Wellcome Trust Sanger Institute. “This is the first and only arrayed CRISPR library in the world,” declared Shawn Shafer, Ph.D., functional genomics market segment manager, MilliporeSigma. “We were really proud to partner with Sanger on this.”
Pooled and arrayed screens are powerful tools for studying gene function. The appropriate platform for an experiment, however, will be determined by the desired endpoint assay.
The QX200 Droplet Digital PCR System from Bio-Rad Laboratories can provide researchers with an absolute measure of target DNA molecules for EvaGreen or probe-based digital PCR applications. The system, which can provide rapid, low-cost, ultra-sensitive quantification of both NHEJ- and HDR-editing events, consists of two instruments, the QX200 Droplet Generator and the QX200 Droplet Reader, and their associated consumables.
Finally, one last challenge for CRISPR lies in the detection and quantification of changes made to the genome post-editing. Conventional methods for detecting these alterations include gel methods and next-generation sequencing. While gel methods lack sensitivity and scalability, next-generation sequencing is costly and requires intensive bioinformatics.
To address this gap, Bio-Rad Laboratories developed a set of assay strategies to enable sensitive and precise edit detection with its Droplet Digital PCR (ddPCR) technology. The platform is designed to enable absolute quantification of nucleic acids with high sensitivity, high precision, and short turnaround time through massive droplet partitioning of samples.
Using a validated assay, a typical ddPCR experiment takes about five to six hours to complete. The ddPCR platform enables detection of rare mutations, and publications have reported detection of precise edits at a frequency of <0.05%, and of NHEJ-derived indels at a frequency as low as 0.1%. In addition to quantifying precise edits, indels, and computationally predicted off-target mutations, ddPCR can also be used to characterize the consequences of edits at the RNA level.
According to a recently published Science paper, the laboratory of Charles A. Gersbach, Ph.D., at Duke University used ddPCR in a study of muscle function in a mouse model of Duchenne muscular dystrophy. Specifically, ddPCR was used to assess the efficiency of CRISPR-Cas9 in removing the mutated exon 23 from the dystrophin gene. (Exon 23 deletion by CRISPR-Cas9 resulted in expression of the modified dystrophin gene and significant enhancement of muscle force.)
Quantitative ddPCR showed that exon 23 was deleted in ~2% of all alleles from the whole-muscle lysate. Further ddPCR studies found that 59% of mRNA transcripts reflected the deletion.
“There’s an overarching idea that the genome-editing field is moving extremely quickly, and for good reason,” asserted Jennifer Berman, Ph.D., staff scientist, Bio-Rad Laboratories. “There’s a lot of exciting work to be done, but detection and quantification of edits can be a bottleneck for researchers.”
The gene-editing field is moving quickly, and new innovations are finding their way into the laboratory as researchers lay the foundation for precise, well-controlled gene editing with CRISPR.
Researchers utilized a systems biology approach to develop new methods to assess drug sensitivity in cells. [The Institute for Systems Biology]
Understanding how cells respond and proliferate in the presence of anticancer compounds has been the foundation of drug discovery ideology for decades. Now, a new study from scientists at Vanderbilt University casts significant suspicion on the primary method used to test compounds for anticancer activity in cells—instilling doubt on methods employed by the entire scientific enterprise and pharmaceutical industry to discover new cancer drugs.
“More than 90% of candidate cancer drugs fail in late-stage clinical trials, costing hundreds of millions of dollars,” explained co-senior author Vito Quaranta, M.D., director of the Quantitative Systems Biology Center at Vanderbilt. “The flawed in vitro drug discovery metric may not be the only responsible factor, but it may be worth pursuing an estimate of its impact.”
The Vanderbilt investigators have developed what they believe to be a new metric for evaluating a compound’s effect on cell proliferation—called the DIP (drug-induced proliferation) rate—that overcomes the flawed bias in the traditional method.
The findings from this study were published recently in Nature Methods in an article entitled “An Unbiased Metric of Antiproliferative Drug Effect In Vitro.”
For more than three decades, researchers have evaluated the ability of a compound to kill cells by adding the compound in vitro and counting how many cells are alive after 72 hours. Yet, proliferation assays that measure cell number at a single time point don’t take into account the bias introduced by exponential cell proliferation, even in the presence of the drug.
“Cells are not uniform, they all proliferate exponentially, but at different rates,” Dr. Quaranta noted. “At 72 hours, some cells will have doubled three times and others will not have doubled at all.”
Dr. Quaranta added that drugs don’t all behave the same way on every cell line—for example, a drug might have an immediate effect on one cell line and a delayed effect on another.
The research team decided to take a systems biology approach, a mixture of experimentation and mathematical modeling, to demonstrate the time-dependent bias in static proliferation assays and to develop the time-independent DIP rate metric.
“Systems biology is what really makes the difference here,” Dr. Quaranta remarked. “It’s about understanding cells—and life—as dynamic systems.”This new study is of particular importance in light of recent international efforts to generate data sets that include the responses of thousands of cell lines to hundreds of compounds. Using the
Cancer Cell Line Encyclopedia (CCLE) and
Genomics of Drug Sensitivity in Cancer (GDSC) databases
will allow drug discovery scientists to include drug response data along with genomic and proteomic data that detail each cell line’s molecular makeup.
“The idea is to look for statistical correlations—these particular cell lines with this particular makeup are sensitive to these types of compounds—to use these large databases as discovery tools for new therapeutic targets in cancer,” Dr. Quaranta stated. “If the metric by which you’ve evaluated the drug sensitivity of the cells is wrong, your statistical correlations are basically no good.”
The Vanderbilt team evaluated the responses from four different melanoma cell lines to the drug vemurafenib, currently used to treat melanoma, with the standard metric—used for the CCLE and GDSC databases—and with the DIP rate. In one cell line, they found a glaring disagreement between the two metrics.
“The static metric says that the cell line is very sensitive to vemurafenib. However, our analysis shows this is not the case,” said co-lead study author Leonard Harris, Ph.D., a systems biology postdoctoral fellow at Vanderbilt. “A brief period of drug sensitivity, quickly followed by rebound, fools the static metric, but not the DIP rate.”
Dr. Quaranta added that the findings “suggest we should expect melanoma tumors treated with this drug to come back, and that’s what has happened, puzzling investigators. DIP rate analyses may help solve this conundrum, leading to better treatment strategies.”
The researchers noted that using the DIP rate is possible because of advances in automation, robotics, microscopy, and image processing. Moreover, the DIP rate metric offers another advantage—it can reveal which drugs are truly cytotoxic (cell killing), rather than merely cytostatic (cell growth inhibiting). Although cytostatic drugs may initially have promising therapeutic effects, they may leave tumor cells alive that then have the potential to cause the cancer to recur.
The Vanderbilt team is currently in the process of identifying commercial entities that can further refine the software and make it widely available to the research community to inform drug discovery.
An unbiased metric of antiproliferative drug effect in vitro
In vitro cell proliferation assays are widely used in pharmacology, molecular biology, and drug discovery. Using theoretical modeling and experimentation, we show that current metrics of antiproliferative small molecule effect suffer from time-dependent bias, leading to inaccurate assessments of parameters such as drug potency and efficacy. We propose the drug-induced proliferation (DIP) rate, the slope of the line on a plot of cell population doublings versus time, as an alternative, time-independent metric.
Researchers develop a technique to direct chromosome recombination with CRISPR/Cas9, allowing high-resolution genetic mapping of phenotypic traits in yeast.
Researchers used CRISPR/Cas9 to make a targeted double-strand break (DSB) in one arm of a yeast chromosome labeled with a green fluorescent protein (GFP) gene. A within-cell mechanism called homologous repair (HR) mends the broken arm using its homolog, resulting in a recombined region from the site of the break to the chromosome tip. When this cell divides by mitosis, each daughter cell will contain a homozygous section in an outcome known as “loss of heterozygosity” (LOH). One of the daughter cells is detectable because, due to complete loss of the GFP gene, it will no longer be fluorescent.REPRINTED WITH PERMISSION FROM M.J. SADHU ET AL., SCIENCE
When mapping phenotypic traits to specific loci, scientists typically rely on the natural recombination of chromosomes during meiotic cell division in order to infer the positions of responsible genes. But recombination events vary with species and chromosome region, giving researchers little control over which areas of the genome are shuffled. Now, a team at the University of California, Los Angeles (UCLA), has found a way around these problems by using CRISPR/Cas9 to direct targeted recombination events during mitotic cell division in yeast. The team described its technique today (May 5) in Science.
“Current methods rely on events that happen naturally during meiosis,” explained study coauthor Leonid Kruglyak of UCLA. “Whatever rate those events occur at, you’re kind of stuck with. Our idea was that using CRISPR, we can generate those events at will, exactly where we want them, in large numbers, and in a way that’s easy for us to pull out the cells in which they happened.”
Generally, researchers use coinheritance of a trait of interest with specific genetic markers—whose positions are known—to figure out what part of the genome is responsible for a given phenotype. But the procedure often requires impractically large numbers of progeny or generations to observe the few cases in which coinheritance happens to be disrupted informatively. What’s more, the resolution of mapping is limited by the length of the smallest sequence shuffled by recombination—and that sequence could include several genes or gene variants.
“Once you get down to that minimal region, you’re done,” said Kruglyak. “You need to switch to other methods to test every gene and every variant in that region, and that can be anywhere from challenging to impossible.”
But programmable, DNA-cutting champion CRISPR/Cas9 offered an alternative. During mitotic—rather than meiotic—cell division, rare, double-strand breaks in one arm of a chromosome preparing to split are sometimes repaired by a mechanism called homologous recombination. This mechanism uses the other chromosome in the homologous pair to replace the sequence from the break down to the end of the broken arm. Normally, such mitotic recombination happens so rarely as to be impractical for mapping purposes. With CRISPR/Cas9, however, the researchers found that they could direct double-strand breaks to any locus along a chromosome of interest (provided it was heterozygous—to ensure that only one of the chromosomes would be cut), thus controlling the sites of recombination.
Combining this technique with a signal of recombination success, such as a green fluorescent protein (GFP) gene at the tip of one chromosome in the pair, allowed the researchers to pick out cells in which recombination had occurred: if the technique failed, both daughter cells produced by mitotic division would be heterozygous, with one copy of the signal gene each. But if it succeeded, one cell would end up with two copies, and the other cell with none—an outcome called loss of heterozygosity.
“If we get loss of heterozygosity . . . half the cells derived after that loss of heterozygosity event won’t have GFP anymore,” study coauthor Meru Sadhu of UCLA explained. “We search for these cells that don’t have GFP out of the general population of cells.” If these non-fluorescent cells with loss of heterozygosity have the same phenotype as the parent for a trait of interest, then CRISPR/Cas9-targeted recombination missed the responsible gene. If the phenotype is affected, however, then the trait must be linked to a locus in the recombined, now-homozygous region, somewhere between the cut site and the GFP gene.
By systematically making cuts using CRISPR/Cas9 along chromosomes in a hybrid, diploid strain ofSaccharomyces cerevisiae yeast, picking out non-fluorescent cells, and then observing the phenotype, the UCLA team demonstrated that it could rapidly identify the phenotypic contribution of specific gene variants. “We can simply walk along the chromosome and at every [variant] position we can ask, does it matter for the trait we’re studying?” explained Kruglyak.
For example, the team showed that manganese sensitivity—a well-defined phenotypic trait in lab yeast—could be pinpointed using this method to a single nucleotide polymorphism (SNP) in a gene encoding the Pmr1 protein (a manganese transporter).
Jason Moffat, a molecular geneticist at the University of Toronto who was not involved in the work, toldThe Scientist that researchers had “dreamed about” exploiting these sorts of mechanisms for mapping purposes, but without CRISPR, such techniques were previously out of reach. Until now, “it hasn’t been so easy to actually make double-stranded breaks on one copy of a pair of chromosomes, and then follow loss of heterozygosity in mitosis,” he said, adding that he hopes to see the approach translated into human cell lines.
Applying the technique beyond yeast will be important, agreed cell and developmental biologist Ethan Bier of the University of California, San Diego, because chromosomal repair varies among organisms. “In yeast, they absolutely demonstrate the power of [this method],” he said. “We’ll just have to see how the technology develops in other systems that are going to be far less suited to the technology than yeast. . . . I would like to see it implemented in another system to show that they can get the same oomph out of it in, say, mammalian somatic cells.”
Kruglyak told The Scientist that work in higher organisms, though planned, is still in early stages; currently, his team is working to apply the technique to map loci responsible for trait differences between—rather than within—yeast species.
“We have a much poorer understanding of the differences across species,” Sadhu explained. “Except for a few specific examples, we’re pretty much in the dark there.”
Linkage and association studies have mapped thousands of genomic regions that contribute to phenotypic variation, but narrowing these regions to the underlying causal genes and variants has proven much more challenging. Resolution of genetic mapping is limited by the recombination rate. We developed a method that uses CRISPR to build mapping panels with targeted recombination events. We tested the method by generating a panel with recombination events spaced along a yeast chromosome arm, mapping trait variation, and then targeting a high density of recombination events to the region of interest. Using this approach, we fine-mapped manganese sensitivity to a single polymorphism in the transporter Pmr1. Targeting recombination events to regions of interest allows us to rapidly and systematically identify causal variants underlying trait differences.
Thank you, David, for the kind words and comments. We agree that the most immediate applications of the CRISPR-based recombination mapping will be in unicellular organisms and cell culture. We also think the method holds a lot of promise for research in multicellular organisms, although we did not mean to imply that it “will be an efficient mapping method for all multicellular organisms”. Every organism will have its own set of constraints as well as experimental tools that will be relevant when adapting a new technique. To best help experts working on these organisms, here are our thoughts on your questions.
You asked about mutagenesis during recombination. We Sanger sequenced 72 of our LOH lines at the recombination site and did not observe any mutations, as described in the supplementary materials. We expect the absence of mutagenesis is because we targeted heterozygous sites where the untargeted allele did not have a usable PAM site; thus, following LOH, the targeted site is no longer present and cutting stops. In your experiments you targeted sites that were homozygous; thus, following recombination, the CRISPR target site persisted, and continued cutting ultimately led to repair by NHEJ and mutagenesis.
As to the more general question of the optimal mapping strategies in different organisms, they will depend on the ease of generating and screening for editing events, the cost and logistics of maintaining and typing many lines, and generation time, among other factors. It sounds like in Drosophila today, your related approach of generating markers with CRISPR, and then enriching for natural recombination events that separate them, is preferable. In yeast, we’ve found the opposite to be the case. As you note, even in Drosophila, our approach may be preferable for regions with low or highly non-uniform recombination rates.
Finally, mapping in sterile interspecies hybrids should be straightforward for unicellular hybrids (of which there are many examples) and for cells cultured from hybrid animals or plants. For studies in hybrid multicellular organisms, we agree that driving mitotic recombination in the early embryo may be the most promising approach. Chimeric individuals with mitotic clones will be sufficient for many traits. Depending on the system, it may in fact be possible to generate diploid individuals with uniform LOH genotype, but this is certainly beyond the scope of our paper. The calculation of the number of lines assumes that the mapping is done in a single step; as you note in your earlier comment, mapping sequentially can reduce this number dramatically.
This is a lovely method and should find wide applicability in many settings, especially for microorganisms and cell lines. However, it is not clear that this approach will be, as implied by the discussion, an efficient mapping method for all multicellular organisms. I have performed similar experiments in Drosophila, focused on meiotic recombination, on a much smaller scale, and found that CRISPR-Cas9 can indeed generate targeted recombination at gRNA target sites. In every case I tested, I found that the recombination event was associated with a deletion at the gRNA site, which is probably unimportant for most mapping efforts, but may be a concern in some specific cases, for example for clinical applications. It would be interesting to know how often mutations occurred at the targeted gRNA site in this study.
The wider issue, however, is whether CRISPR-mediated recombination will be more efficient than other methods of mapping. After careful consideration of all the costs and the time involved in each of the steps for Drosophila, we have decided that targeted meiotic recombination using flanking visible markers will be, in most cases, considerably more efficient than CRISPR-mediated recombination. This is mainly due to the large expense of injecting embryos and the extensive effort and time required to screen injected animals for appropriate events. It is both cheaper and faster to generate markers (with CRISPR) and then perform a large meiotic recombination mapping experiment than it would be to generate the lines required for CRISPR-mediated recombination mapping. It is possible to dramatically reduce costs by, for example, mapping sequentially at finer resolution. But this approach would require much more time than marker-assisted mapping. If someone develops a rapid and cheap method of reliably introducing DNA into Drosophila embryos, then this calculus might change.
However, it is possible to imagine situations where CRISPR-mediated mapping would be preferable, even for Drosophila. For example, some genomic regions display extremely low or highly non-uniform recombination rates. It is possible that CRISPR-mediated mapping could provide a reasonable approach to fine mapping genes in these regions.
The authors also propose the exciting possibility that CRISPR-mediated loss of heterozygosity could be used to map traits in sterile species hybrids. It is not entirely obvious to me how this experiment would proceed and I hope the authors can illuminate me. If we imagine driving a recombination event in the early embryo (with maternal Cas9 from one parent and gRNA from a second parent), then at best we would end up with chimeric individuals carrying mitotic clones. I don’t think one could generate diploid animals where all cells carried the same loss of heterozygosity event. Even if we could, this experiment would require construction of a substantial number of stable transgenic lines expressing gRNAs. Mapping an ~20Mbp chromosome arm to ~10kb would require on the order of two-thousand transgenic lines. Not an undertaking to be taken lightly. It is already possible to perform similar tests (hemizygosity tests) using D. melanogaster deficiency lines in crosses with D. simulans, so perhaps CRISPR-mediated LOH could complement these deficiency screens for fine mapping efforts. But, at the moment, it is not clear to me how to do the experiment.
Scientists build a living cellular organism with a genome smaller than any known in nature.
By Ruth Williams | March 24, 2016
By stripping down the genome of a mycoplasma bacterium to the minimal genes required for life,Craig Venter and colleagues have created a new organism with the smallest genome of any known cellular life form. The work, published in Sciencetoday (March 24), is the closest scientists have come to creating a cell in which every gene and protein is fully understood—but they are not quite there yet.
“In biology, as we’ve been trying to do genetic and biological engineering, we’re frustrated by the fact that . . . evolution has given us a real mess—it’s really just bubble gum and sticks, piecing together whatever works,” said biomedical engineer Chris Voigt of MIT who was not involved in the study. “This [work] is one of the first attempts at a grand scale to go in and try to clean up some of the mess . . . so that we can better understand the genetics.”
The quest to synthesize a minimal genome with only the essential genes for life is one researchers at the J. Craig Venter Institute (JCVI) in San Diego have been doggedly pursuing for the better part of two decades. Clyde Hutchison, an investigator at JCVI and lead author of the new study, explained the motivation: “We want to understand at a mechanistic level how a living cell grows and divides,” he told The Scientist, and yet, “there is no cell that exists where the function of every gene is known.” Possession of such fundamental knowledge, he added, would also put researchers “in a better position to engineer cells to make specific products,” like pharmaceuticals, Hutchinson said.
The team’s starting point was the bacterium Mycoplasma genitalium, which has the smallest known genome of any living cell with just 525 genes. However, it also has a very slow growth rate, making it difficult to work with. To practice synthesizing genomes and building new organisms, the team therefore turned to M. genitalium’s cousins, M. mycoides and M. capricolum, which have bigger genomes and faster growth rates. In 2010, Venter’s team successfully synthesized a version of the M. mycoides genome (JCVI-syn1.0) and placed it into the cell of a M. capricolum that had had its own genome removed. This was the first cell to contain a fully synthetic genome capable of supporting replicative life.
With the genome synthesis and transfer skills mastered, the next step was to make the genome smaller, explained Hutchison. One approach would be to delete the genes one by one and see which the cells could live without. But “we thought we knew enough, that it would be that much faster to design the genome, build it, and install it in a cell,” said Hutchison. The problem was, “we weren’t completely right about that,” he said. “It took quite a bit longer than we thought.”
Using JCVI-syn1.0 as their starting material, the researchers initially designed a minimal genome based on information from the literature and from mutagenesis studies that suggested which genes were likely essential. They divided this genome into eight overlapping segments and tested each one in combination with the complementary seven-eighths of the standard JCVI-syn1.0 genome. All but one of the designed segments failed to sustain viable cells.
Going back to the drawing board, the team decided to perform mutagenesis experiments on JCVI-syn1.0 to determine, categorically, which genes were required for life. Their experiments revealed that the genes fell into three groups: essential, nonessential, and quasiessential—those that aren’t strictly required, but without which growth is severely impaired. The failure to include these quasiessential genes in the initial design explained in large part why it had failed, explained Hutchison. “The concept of a minimal genome seems simple, but when you get into it, it’s a little more complicated,” he said. “There’s a trade-off between genome size and growth rate.”
Equipped with this knowledge, the team redesigned, synthesized, and tested new genome segments retaining the quasiessential genes. Three iterative cycles of testing later, the team had a genome that successfully supported life.
“This is a really pioneering next step in the use of synthetic biology,” said Leroy Hood, president of the Institute for Systems Biology in Seattle who also did not participate in the research.
Ultimately the team removed 428 genes from the JCVI-syn1.0 genome to create JCVI-syn3.0 with 473 genes (438 protein-coding genes and 35 RNA genes)—considerably fewer than the 525 genes of M. genitalium. Interestingly, the functions of around one-third of the genes (149) remain unknown. “I was surprised it was that high,” said Hood, “but I also think we kid ourselves about how much we know about the genomes of organisms. There’s still an enormous amount of dark matter.”
Some of these genes of unknown function appear to be conserved in higher eukaryotes, said Hutchison. “Those, in a way, are the most exciting,” he said, “because they might represent some new undescribed function that has spread through other life forms.”
C.A. Hutchison III et al., “Design and synthesis of a minimal bacterial genome,” Science, 351: 1414, 2016.
Design and synthesis of a minimal bacterial genome
A goal in biology is to understand the molecular and biological function of every gene in a cell. One way to approach this is to build a minimal genome that includes only the genes essential for life. In 2010, a 1079-kb genome based on the genome of Mycoplasma mycoides (JCV-syn1.0) was chemically synthesized and supported cell growth when transplanted into cytoplasm. Hutchison IIIet al. used a design, build, and test cycle to reduce this genome to 531 kb (473 genes). The resulting JCV-syn3.0 retains genes involved in key processes such as transcription and translation, but also contains 149 genes of unknown function.
INTRODUCTION In 1984, the simplest cells capable of autonomous growth, the mycoplasmas, were proposed as models for understanding the basic principles of life. In 1995, we reported the first complete cellular genome sequences (Haemophilus influenza, 1815 genes, and Mycoplasma genitalium, 525 genes). Comparison of these sequences revealed a conserved core of about 250 essential genes, much smaller than either genome. In 1999, we introduced the method of global transposon mutagenesis and experimentally demonstrated that M. genitalium contains many genes that are nonessential for growth in the laboratory, even though it has the smallest genome known for an autonomously replicating cell found in nature. This implied that it should be possible to produce a minimal cell that is simpler than any natural one. Whole genomes can now be built from chemically synthesized oligonucleotides and brought to life by installation into a receptive cellular environment. We have applied whole-genome design and synthesis to the problem of minimizing a cellular genome. RATIONALE Since the first genome sequences, there has been much work in many bacterial models to identify nonessential genes and define core sets of conserved genetic functions, using the methods of comparative genomics. Often, more than one gene product can perform a particular essential function. In such cases, neither gene will be essential, and neither will necessarily be conserved. Consequently, these approaches cannot, by themselves, identify a set of genes that is sufficient to constitute a viable genome. We set out to define a minimal cellular genome experimentally by designing and building one, then testing it for viability. Our goal is a cell so simple that we can determine the molecular and biological function of every gene.
RESULTS Whole-genome design and synthesis were used to minimize the 1079–kilobase pair (kbp) synthetic genome of M. mycoides JCVI-syn1.0. An initial design, based on collective knowledge of molecular biology in combination with limited transposon mutagenesis data, failed to produce a viable cell. Improved transposon mutagenesis methods revealed a class of quasi-essential genes that are needed for robust growth, explaining the failure of our initial design. Three more cycles of design, synthesis, and testing, with retention of quasi-essential genes, produced JCVI-syn3.0 (531 kbp, 473 genes). Its genome is smaller than that of any autonomously replicating cell found in nature. JCVI-syn3.0 has a doubling time of ~180 min, produces colonies that are morphologically similar to those of JCVI-syn1.0, and appears to be polymorphic when examined microscopically. CONCLUSION The minimal cell concept appears simple at first glance but becomes more complex upon close inspection. In addition to essential and nonessential genes, there are many quasi-essential genes, which are not absolutely critical for viability but are nevertheless required for robust growth. Consequently, during the process of genome minimization, there is a trade-off between genome size and growth rate. JCVI-syn3.0 is a working approximation of a minimal cellular genome, a compromise between small genome size and a workable growth rate for an experimental organism. It retains almost all the genes that are involved in the synthesis and processing of macromolecules. Unexpectedly, it also contains 149 genes with unknown biological functions, suggesting the presence of undiscovered functions that are essential for life. JCVI-syn3.0 is a versatile platform for investigating the core functions of life and for exploring whole-genome design.
Four design-build-test cycles produced JCVI-syn3.0.
(A) The cycle for genome design, building by means of synthesis and cloning in yeast, and testing for viability by means of genome transplantation. After each cycle, gene essentiality is reevaluated by global transposon mutagenesis. (B) Comparison of JCVI-syn1.0 (outer blue circle) with JCVI-syn3.0 (inner red circle), showing the division of each into eight segments. The red bars inside the outer circle indicate regions that are retained in JCVI-syn3.0. (C) A cluster of JCVI-syn3.0 cells, showing spherical structures of varying sizes (scale bar, 200 nm).
Abstract
We used whole-genome design and complete chemical synthesis to minimize the 1079–kilobase pair synthetic genome of Mycoplasma mycoides JCVI-syn1.0. An initial design, based on collective knowledge of molecular biology combined with limited transposon mutagenesis data, failed to produce a viable cell. Improved transposon mutagenesis methods revealed a class of quasi-essential genes that are needed for robust growth, explaining the failure of our initial design. Three cycles of design, synthesis, and testing, with retention of quasi-essential genes, produced JCVI-syn3.0 (531 kilobase pairs, 473 genes), which has a genome smaller than that of any autonomously replicating cell found in nature. JCVI-syn3.0 retains almost all genes involved in the synthesis and processing of macromolecules. Unexpectedly, it also contains 149 genes with unknown biological functions. JCVI-syn3.0 is a versatile platform for investigating the core functions of life and for exploring whole-genome design.
War on Cancer Needs to Refocus to Stay Ahead of Disease Says Cancer Expert
Writer, Curator: Stephen J. Williams, Ph.D.
Article ID #171: War on Cancer Needs to Refocus to Stay Ahead of Disease Says Cancer Expert. Published on 3/27/2015
WordCloud Image Produced by Adam Tubman
UPDATED 1/08/2020
Is one of the world’s most prominent cancer researchers throwing in the towel on the War On Cancer? Not throwing in the towel, just reminding us that cancer is more complex than just a genetic disease, and in the process, giving kudos to those researchers who focus on non-genetic aspects of the disease (see Dr. Larry Bernstein’s article Is the Warburg Effect the Cause or the Effect of Cancer: A 21st Century View?).
National Public Radio (NPR) has been conducting an interview series with MIT cancer biology pioneer, founding member of the Whitehead Institute for Biomedical Research, and National Academy of Science member and National Medal of Science awardee Robert A. Weinberg, Ph.D., who co-discovered one of the first human oncogenes (Ras)[1], isolation of first tumor suppressor (Rb)[2], and first (with Dr. Bill Hahn) proved that cells could become tumorigenic after discrete genetic lesions[3]. In the latest NPR piece, Why The War On Cancer Hasn’t Been Won (seen on NPR’s blog by Richard Harris), Dr. Weinberg discusses a comment in an essay he wrote in the journal Cell[4], basically that, in recent years, cancer research may have focused too much on the genetic basis of cancer at the expense of multifaceted etiology of cancer, including the roles of metabolism, immunity, and physiology. Cancer is the second most cause of medically related deaths in the developed world. However, concerted efforts among most developed nations to eradicate the disease, such as increased government funding for cancer research and a mandated ‘war on cancer’ in the mid 70’s has translated into remarkable improvements in diagnosis, early detection, and cancer survival rates for many individual cancer. For example, survival rate for breast and colon cancer have improved dramatically over the last 40 years. In the UK, overall median survival times have improved from one year in 1972 to 5.8 years for patients diagnosed in 2007. In the US, the overall 5 year survival improved from 50% for all adult cancers and 62% for childhood cancer in 1972 to 68% and childhood cancer rate improved to 82% in 2007. However, for some cancers, including lung, brain, pancreatic and ovarian cancer, there has been little improvement in survival rates since the “war on cancer” has started.
As Weinberg said, in the 1950s, medical researchers saw cancer as “an extremely complicated process that needed to be described in hundreds, if not thousands of different ways,”. Then scientists tried to find a unifying principle, first focusing on viruses as the cause of cancer (for example rous sarcoma virus and read Dr. Gallo’s book on his early research on cancer, virology, and HIV in Virus Hunting: AIDS, Cancer & the Human Retrovirus: A Story of Scientific Discovery).
However (as the blog article goes on) “that idea was replaced by the notion that cancer is all about wayward genes.”
“The thought, at least in the early 1980s, was that were a small number of these mutant, cancer-causing oncogenes, and therefore that one could understand a whole disparate group of cancers simply by studying these mutant genes that seemed to be present in many of them,” Weinberg says. “And this gave the notion, the illusion over the ensuing years, that we would be able to understand the laws of cancer formation the way we understand, with some simplicity, the laws of physics, for example.”
According to Weinberg, this gene-directed unifying theory has given way as recent evidences point back once again to a multi-faceted view of cancer etiology.
But this is not a revolutionary or conflicting idea for Dr. Weinberg, being a recipient of the 2007 Otto Warburg Medal and focusing his latest research on complex systems such as angiogenesis, cell migration, and epithelial-stromal interactions.
In fact, it was both Dr. Weinberg and Dr. Bill Hanahan who formulated eight governing principles or Hallmarks of cancer:
Maintaining Proliferative Signals
Avoiding Immune Destruction
Evading Growth Suppressors
Resisting Cell Death
Becoming Immortal
Angiogenesis
Deregulating Cellular Energy
Activating Invasion and Metastasis
Taken together, these hallmarks represent the common features that tumors have, and may involve genetic or non-genetic (epigenetic) lesions … a multi-modal view of cancer that spans over time and across disciplines. As reviewed by both Dr. Larry Bernstein and me in the e-book Volume One: Cancer Biology and Genomics for Disease Diagnosis, each scientific discipline, whether the pharmacologist, toxicologist, virologist, molecular biologist, physiologist, or cell biologist has contributed greatly to our total understanding of this disease, each from their own unique perspective based on their discipline. This leads to a “multi-modal” view on cancer etiology and diagnosis, treatment. Many of the improvements in survival rates are a direct result of the massive increase in the knowledge of tumor biology obtained through ardent basic research. Breakthrough discoveries regarding oncogenes, cancer cell signaling, survival, and regulated death mechanisms, tumor immunology, genetics and molecular biology, biomarker research, and now nanotechnology and imaging, have directly led to the advances we now we in early detection, chemotherapy, personalized medicine, as well as new therapeutic modalities such as cancer vaccines and immunotherapies and combination chemotherapies. Molecular and personalized therapies such as trastuzumab and aromatase inhibitors for breast cancer, imatnib for CML and GIST related tumors, bevacizumab for advanced colorectal cancer have been a direct result of molecular discoveries into the nature of cancer. This then leads to an interesting question (one to be tackled in another post):
Would shifting focus less on cancer genome and back to cancer biology limit the progress we’ve made in personalized medicine?
In a 2012 post Genomics And Targets For The Treatment Of Cancer: Is Our New World Turning Into “Pharmageddon” Or Are We On The Threshold Of Great Discoveries? Dr. Leonard Lichtenfield, MD, Deputy Chief Medical Officer for the ACS, comments on issues regarding the changes which genomics and personalized strategy has on oncology drug development. As he notes, in the past, chemotherapy development was sort of ‘hit or miss’ and the dream and promise of genomics suggested an era of targeted therapy, where drug development was more ‘rational’ and targets were easily identifiable.
To quote his post
“
That was the dream, and there have been some successes–even apparent cures or long term control–with the used of targeted medicines with biologic drugs such as Gleevec®, Herceptin® and Avastin®. But I think it is fair to say that the progress and the impact hasn’t been quite what we thought it would be. Cancer has proven a wily foe, and every time we get answers to questions what we usually get are more questions that need more answers. The complexity of the cancer cell is enormous, and its adaptability and the genetic heterogeneity of even primary cancers (as recently reported in a research paper in the New England Journal of Medicine) has been surprising, if not (realistically) unexpected.
In addition, Dr. Lichtenfeld makes some interesting observations including:
A “pharmageddon” where drug development risks/costs exceed the reward so drug developers keep their ‘wallets shut’. For example even for targeted therapies it takes $12 billion US to develop a drug versus $2 billion years ago
Drugs are still drugs and failure in clinical trials is still a huge risk
“Eroom’s Law” (like “Moore’s Law” but opposite effect) – increasing costs with decreasing success
Limited market for drugs targeted to a select mutant; what he called “slice and dice”
Andrea Califano, PhD – Precision Medicine predictions based on statistical associations where systems biology predictions based on a physical regulatory model
Spyro Mousses, PhD – open biomedical knowledge and private patient data should be combined to form systems oncology clearinghouse to form evolving network, linking drugs, genomic data, and evolving multiscalar models
Razelle Kurzrock, MD – What if every patient with metastatic disease is genomically unique? Problem with model of smaller trials (so-called N=1 studies) of genetically similar disease: drugs may not be easily acquired or re-purposed, and greater regulatory burdens
So, discoveries of oncogenes, tumor suppressors, mutant variants, high-end sequencing, and the genomics and bioinformatic era may have led to advent of targeted chemotherapies with genetically well-defined patient populations, a different focus in chemotherapy development
… but as long as we have the conversation open I have no fear of myopia within the field, and multiple viewpoints on origins and therapeutic strategies will continue to develop for years to come.
References
Parada LF, Tabin CJ, Shih C, Weinberg RA: Human EJ bladder carcinoma oncogene is homologue of Harvey sarcoma virus ras gene. Nature 1982, 297(5866):474-478.
Friend SH, Bernards R, Rogelj S, Weinberg RA, Rapaport JM, Albert DM, Dryja TP: A human DNA segment with properties of the gene that predisposes to retinoblastoma and osteosarcoma. Nature 1986, 323(6089):643-646.
Hahn WC, Counter CM, Lundberg AS, Beijersbergen RL, Brooks MW, Weinberg RA: Creation of human tumour cells with defined genetic elements. Nature 1999, 400(6743):464-468.
Weinberg RA: Coming full circle-from endless complexity to simplicity and back again. Cell2014, 157(1):267-271.
The cancer death rate in the United States fell 2.2 percent in 2017 — the biggest single-year drop ever reported — propelled by gains against lung cancer, the American Cancer Society said Wednesday.
Declines in the mortality rate for lung cancer have accelerated in recent years in response to new treatments and falling smoking rates, said Rebecca Siegel, lead author of Cancer Statistics 2020, the latest edition of the organization’s annual report on cancer trends.
The improvement in 2017, the most recent year for which data is available, is part of a long-term drop in cancer mortality that reflects, to a large extent, the smoking downturn. Since peaking in 1991, the cancer death rate has fallen 29 percent, which translates into 2.9 million fewer deaths.
Norman “Ned” Sharpless, director of the National Cancer Institute, which was not involved in the report, said the data reinforces that “we are making steady progress” on cancer. For lung cancer, he pointed to new immunotherapy treatments and so-called targeted therapies that stop the action of molecules key to cancer growth. He predicted that the mortality rate would continue to fall “as we get better at using these therapies.” Multiple clinical trials are exploring how to combine the new approaches with older ones, such as chemotherapy.
Sharpless expressed concern, however, that progress against cancer would be undermined by increased obesity, which is a risk factor for several malignancies.
The cancer society report projected 1.8 million new cases of cancer in the United States this year and more than 606,000 deaths. Nationally, cancer is the second-leading cause of death after heart disease in both men and women. It is the No. 1 cause in many states, and among Hispanic and Asian Americans and people younger than 80, the report said.
The cancer death rate is defined as deaths per 100,000 people. The cancer society has been reporting the rate since 1930.
Because lung cancer is the leading cause of cancer deaths, accounting for 1 in 4, any change in the mortality rate has a large effect on the overall cancer death rate, Siegel noted.
She described the gains against lung cancer, and against another often deadly cancer, melanoma, as “exciting.” But, she added, “the news this year is mixed” because of slower progress against colorectal, breast and prostate cancers. Those cancers often can be detected early by screening, she said.
The report said substantial racial and geographic disparities remain for highly preventable cancers, such as cervical cancer, and called for “the equitable application” of cancer control measures.
In recent years, melanoma has showed the biggest mortality-rate drop of any cancer. That’s largely a result of breakthrough treatments such as immunotherapy, which unleashes the patient’s own immune system to fight the cancer and was approved for advanced melanoma in 2011.
Other posts on this site on The War on Cancer and Origins of Cancer include:
This diagram shows the chromosomes of Drosophila melanogaster approximately to scale. Chromosome sizes were based on basepair lengths given on the NCBI map viewer, and A. B. Carvalho, 2002. Curr. Op. Genet. & Devel. 12:664-668. Centimorgan distances were derived from selected loci listed in the NCBI website. (credit Wikipedia)
Introduction
Generally speaking sexually reproducing species are composed of individuals of two complementary mating types or sexes. An essential aspect of the developmental history of each individual is thus sex determination and differentiation. There exist two sex determination mechanisms, somatic and germline, that based on the chromosomal mechanism in the Drosophila melanogaster. In the somatic sex determination mechanism, each individual assesses the ratio of X-chromosomes to autosomal chromosome sets), the X:A ratio provides the primary sex-determining signal (reviewed by Cline and Meyer, 1996). When X:A=1, female differentiation ensues (Bridges, 1925), along with the male-mode of X-chromosome dosage compensation. The X:A ratio is calculated within each cell of the developing embryo, 2 hrs after fertilization. The X:A ratio determines the sex in Drosophila (Bridges, 1916, 1921, 1925) in a somatic-cell-autonomous manner that occurs early in embryonic development (Baker and Belote, 1983; Baker, 1989). Females possess two X-chromosomes, and males possess one X-chromosome and one Y-chromosome. The Y-chromosome is required only for spermatogenesis (Lindsley and Tokuyasu 1980; Bridges 1986), and will not be considered further. The number of X-chromosomes is counted through a mechanism involving positive-acting X-chromosome-encoded transcription factors, termed X-numerator elements (Cline, 1988), negative-acting autosome-encoded transcription factors or denominators, and signal transduction factors provided maternally. Among the X-numerators are sisterless-a, sisterless-b (sis-b), sisterless-c, and runt (Schurpbach, 1985; Cline, 1986, 1988; Steinmann-Zwicky et al., 1989; Parkhurst et al., 1990; Ericson and Cline, 1991, 1993; Estes, 1995; Hoshijima et al., 1995; reviewed by Cline, 1993).
The best candidate for a denominator gene is the deadpan (dpn) locus. Both daughterless (da) and extramacrochaete (emc) fulfill the role of maternally contributed transduction loci (Cline, 1976; Cronmiller et al., 1988). Both in vitro biochemical evidence and in vivo genetic evidence support the idea that transcription factors of the basic-helix-loop-helix (bHLH) family are able to form homo- and hetero-dimers; thus the X:A ratio counting mechanism seems to involve the relative affinities and chromosome-dependent stoiciometries of the bHLH proteins SIS-B, DA, EMC, and DPN. When X:A=1, sufficient SIS-B protein is synthesized so that it can effectively compete with the EMC and DPN proteins for binding to DA protein. DA:SIS:B heterodimers then bind to so-called establishment promoter (Pe) elements of the SXL gene and activates its transcription, resulting in an early burst of SXL protein that sets splicing and dosage compensation in to female-specific modes. When X:A=0.5, too little SIS-B is produced, and DA protein remains sequestered with EMC and DPN. The Sxl Pe remains inactive, and splicing and dosage compensation enters male-specific modes. In response to X:A ratio=1, an embryo specific promoter of the gene called Sex-lethal (Sxl) is activated (Keyes et al., 1932).
Sxl protein that acts as a master gene for the somatic germline sex determination, has three somatic functions. First, Sxl protein carries out autoregulation at the level of pre-mRNA splicing. Second, Sxl controls female-specific differentiation at the level of pre-RNA splicing and polyadenylation at least two genes that code for transcription factors that effect terminal differentiation. Third, Sxl protein negatively regulates X-chromosome dosage compensation. It does so in two ways, by alternative RNA splicing of a normally male-specific gene, and by translation-level regulation of many X-chromosomal transcripts during embryogenesis. In the male, with Sxl in the off state, male differentiation occurs because tra is in the off state and therefore the differentiation-effector transcription factors are produced in alternative male-specific modes. Dosage compensation is active, and the male X-chromosome is decorated by a minimum of four proteins and two RNA molecules that form a complex along the entire chromosome (reviewed by Cline and Meyer, 1996). Transcription of the male X-chromosome is elevated two-fold, and it produces the same amount of RNA per template as found in females.
Germline pathway for sex determination and dosage compensation is different than the somatic sex determination mechanism. (Figure 1) Figure 1: Sex determination of D. melanogaster (1998)The vast majority of somatic sex determination loci have no function in germline cells. For example, none of the X-chromosome numerators is required for proper oogenesis (Granadino et al., 1989, 1992; Steinmann-Zwicky 1991), despite the fact that proper oogenesis requires that X:A =1 in the germline (Schupbach, 1982, 1985) nor are tra, tra-2, and dsxF required for oogenesis. Sxl and snf have germline functions but the former is not a binary switch gene between oogenesis and spermatogenesis (Despande et al., 1996; Bopp et al., 1993, 1995; Hager et al., 1997). Systematic screens for female-sterile mutations have identified a large number of genes required for normal oogenesis (e.g. Gans et al., 1975; Mohler, 1977; Perrimon et al., 1986; Schupbach and Wieschaus, 19889, 1991). Female-sterility can arise in diverse ways, but one interesting class of mutations is germline-dependent and causes an “ovarian tumor” phenotype. “Ovarian tumor” mutations cause under-developed ovaries, in which egg chambers and ovarioles are filled with an excess of undifferentiated germ cells that have adopted male-like characteristics that include a prominent spherical nucleus, assembly of mitocondria around the nucleus, and mis-expression of male-specific marker genes (Oliver et al., 1988, 1990, 1993; Steinmann-Zwicky, 1988, 1992; Bopp et al., 1993; Pauli et al., Wei et al., 1994). Among the “ovarian tumor” class of genes are ovo, ovarian tumor (otu), fused, and two genes with somatic phenotypes, namely snf and Sxl. Strong mutations at the ovo and otu loci result in ovaries totally devoid of germ cells (King and Killey, 1982; Busson et al., 1983; Oliver et al., 1987; Mevel-Ninio et al., 1989; Rodesh et al., 1995), Weaker mutations at both loci result in viable germline cells that have abnormal male-like splicing at the Sxl gene (Oliver et al, 1993). The overall conclusion is that oogenesis requires a chromosomally female germline is wild type for ovo, otu, Sxl, and snf. If one of these genes is defective, either the germline will die or male-like differentiation and tumor formation ensure.
However, there are soma-germline interactions for a normal sex determination. (Figure 2) Figure 2: Somatic-Germline Interactions. (1998)Unlike the somatic regulatory hierarchy, which genetic mosaic experiments clearly showed functions in cell-autonomous fashion, sexual differentiation of the germline requires inductive signaling from somatic cells. This was shown by use of pole cell transplantation, the method of making mosaics in which germline cells surgically transferred from donor embryos (Schubach. 1985; Steinmann-Zwicky et al., 1989). These experiments show that proper germline differentiation requires a combination of germline-autonomous chromosomal cues and proper signaling from the soma. Evidence with tra and dsx mutant somatic hosts indicates these soma-germline interactions have detectable effects by larval stages (Steinmann-Zwicky., 1996).
The ovo gene is genetically complex. At least three transcripts are produced from the ovo region (Mevel-Ninio et al, 1991, 1995, 1996; Garfinkel et al., 1992, 1994). Two of these are germline-specific and correspond to the ovo function, while the third corresponds to the somatic-epidermal, non-sex-specific shavenbaby (svb) function. (For a schematic of the gene map please refer to Figure3)
The ovo function is transcribed from two closely spaced germline-specific promoters, ovo a and ovob, give rise to 5-kb mRNAs (Mevel-Ninio et al., 1991, 1995; Garfinkel et al., 1992, 1994). First identified promoter was ovob Garfinkel et al., (1994) and the leader exon it forms is called Exon 1b, 1028-codon-long open reading frame that contains four Cys2-His2 fingers at the carboxy terminus; protein MW of 110.6 kD. A second germline promoter, ovoa, was identified by Mevel-Ninio et al (1995), 1400 codons long, and predicts a 150.8-kD protein. This Exon 1a contains an in-frame AUG upstream of the translation start in Exon 2 utilized by the OvoB open reading frame. The OvoB mRNA isoforms is predominant during adult life, with the OvoA isoforms only appearing during Stage 14 of oogenesis (Mevel-Ninio et al., 1991, 1996; Garfinkel., 1994). The ovo zinc finger domain binds to its own germline promoter regions, to the otu promoter region (Garfinkel et al., 1997; Lee, 1998; Lee and Garfinkel 1998). This is consistent with ovo playing an important role in a sex determination hierarchy operating in germline cells that involves these other genes. The svb function is transcribed from an incompletely characterized somatic promoter that forms a 7.1 kb poly(A)+ mRNA (Garfinkel et al., 1994). This transcript accumulates 9-12-hr post-fertilization, in the somatic tissues that later in embryogenesis form the cuticular structures affected by svb mutations. Wieschaus et al. (1984) observed that ventral denticle belts and dorsal hairs are defective in svb mutations; hence the name, and svb mutations are polyphasic larval lethals. Exons and exon segments that are found in all mRNA forms coded by the region correspond to genomic DNA where so-called svb-ovo- mutations map (Mevel-Ninio et al., 1989; Garfinkel 1992). Finally, somatic-specific exons, exon segments, and transcriptional regions correspond to region mutable to the svb- ovo- phenotype. Since al known mRNA forms utilize the same splice junctions to join Exon3 to Exon4, all protein forms coded by the locus are believed to contain the same four zinc fingers at the carboxy terminus. A wide variety of evidence points to ovo playing a critical role in germline sex determination. High-level of ovo transcription in germline cells, as detected with Xgal staining of ovo promoter-lacZ constructs requires that they have a female karyotype (Oliver et al., 1994). Chromosomally male germline cells have low levels of ovo transcription even if the soma is transformed towards female through the use of hs-traF cDNA minigenes. Likewise, chromosomally female germline cells have high levels of ovo transcription even if the soma is anatomically male through the action of tra loss-of-function mutations. This argues that high-level of ovo transcription is a germline X: A ratio-autonomous property, and stands in contrast to related experiments with otu. In the case of otu, there is evidence that chromosomally male germline cells, which normally have no need of otu+ function at all, require otu- for proliferation when they are in a female host (Nagoshi et al., 1995). The D. melanogaster ovo gene is required for cell viability and differentiation of female germ cells, apparently playing a role in germline sex determination. While female X: A ratio in germline cells is required for high levels of ovo germline promoters. Therefore we undertook to identify trans-acting regulatory regions of the X-chromosome, with a particular interest in identifying candidate germline X-chromosome numerator elements. In this study, I screened X-chromosome using 45 deficiency strains, I found that these trans-regulating regions were grouped into 12 loci based on overlapping cytology. Five regions were trans-regulating activators, and seven were trans-regulating repressors; extrapolating to the entire genome, this result predicts nearly 85 loci. A subset of the dozen X-chromosomal regions correlated with previously identified E(ovoD) and Su(ovoD) loci (Pauli et al., 1995).
Materials and Methods
Fly Strains and Growth Flies were maintained on standard yeast/cornmeal medium and kept at 25oC and 18oC unless otherwise indicated. Mutants are described in Lindsley and Zimm (1992). The ovo3U21 and ovo4B8 were obtained from Brian Oliver of NIH; OvoD1rS1 FM3 is from the Garfinkel lab collection. The remaining stocks were obtained from the Bloomington Stock Center (see Table 2.1 for the list of stocks that had been used and Figure 2.1 for their location on the X Chromosome).
Outcrosses Outcrosses were designed to create transgenic flies so that screening of the X chromosome for trans-regulators of ovo in the germline can be done. Virgin female flies were collected 14 hour long windows at 18oC or 8 hour long windows at 25oC, during which newly emerged males remained immature. Collected females were kept 3-5 days to make sure they are virgin before outcrossing them. Heterozygous virgin females (5-7), carrying deficiency X-chromosomes balanced over first chromosome balancers were mated with males homozygous for either of two P-element transformation constructs of a lacZ reporter gene fused to the ovo promoter. Both events were inserted on third chromosome. They were grown at 25oC unless otherwise noted. The control class of F1 progeny has a complete X-chromosome pair, whereas the experimental class has one complete and one deficient X chromosome in its genome. The [ovo::lacZ constructs] were designed by Oliver et al., (1994). In this study two of their strains, ovo4B8 (pCOW+1.9) and ovo3U21 (pCOW-2.1) respectively, were used to determine the ovo promoter activity.
Outcrosses to Remove Duplications Several X-chromosome deficiencies in the Bloomington collection are carried in males, with compensatory duplications of X material on an autosome. These had to be crossed to eliminate the duplications (Fig 2.4). This was done as follows: FM3/FM7a virgin flies were mated to Df/Y; Dp males. Among the F1 progeny, half of the Df/(FM3 or FM7a) daughters will carry the unwanted duplication, and half will be free of the duplication. In some cases, presence of the duplication could be determined from the females’ phenotypes. In other cases, up to twenty individuals virgin Df(FM3 or FM7) F1 progeny were backcrossed to FM7a/Y males to establish stocks. In the F2, absence of the duplication could be established by examining sons; in all cases, the Df is male-lethal unless “rescued” by the duplication. Also FM3 is itself male lethal. Thus, single-female stocks that produce only FM7a sons had the desired genotypes and were kept for experiments.
X-Gal Staining In this assay ovaries from two-day-old adults were dissected in Drosophila Ringer’s solution (182 mM KCl, 46 mM NaCl, 3 mM CaCl2, 10mM TrisHCl, pH 6.8). Then, these tissues were transferred to a microtiter plate and fixed in 1% gluteraldehyde, 50mM Na-cacodylyte acid solution for 15 minutes. After rinsing the tissues, three times for 5 minutes each staining buffer (7.2 mM Na2HPO4, 2.8 mM NaH2PO4, 1.0 mM MgCl2, 0.15 mM NaCl), they were transferred to incubation buffer (staining buffer, 5 mM Fe2 (CN)3, 5 mM Fe3 (CN)2, 0.2% X-Gal) for an hour at 37oC. Next, tissues were washed three times 5 minutes each in washing buffer, which is a 1 mM EDTA, added PBS (130 mM NaCl, 7 mM Na2HPO4*2H2O, 3 mM NaH2PO4*2H2O, pH 7.0) solution. Finally, the tissues were dehydrated in ethanol solutions of increasing concentrations (50%, 75%, 95%) and mounted on a slide in Permount. Preparate concentrations were examined under a compound microscope to make correlations between staining and gene activity. Although it was easy to determine positive and negative controls, but this assay wasn’t sensitive enough to see subtle differences due to effects of deleted regions on ovo promoters driving LacZ.
Histochemical Assay of LacZ Activity This method allowed us to make quantitative measurements of lacZ activity due to ovo promoter function in animals heterozygous for X-chromosome deletions. Emerging F1 flies were collected and aged for two days before dissecting ovaries under a dissecting microscope. For each soluble assay, 10 flies were dissected. This is repeated at least seven assays (N, sample number) completed per stock for each construct. Ovaries from ten dissected outcrossed flies were out into eppendorf tubes containing 100ml of Assay Buffer (50 mM K-phosphate, 1 mM MgCl2 at pH 7.8) and homogenized about 20 strokes. For each dissected pair of ovaries 100 ml of assay buffer was used and the volume was completed to appropriate amount. After centrifuging for one minute, 20 ml of the supernatant was transferred into 980 ml of assay buffer (Simon and Lis, 1987; Ashburner, 1989) to make 2mM chlorophenol red-beta-D-galactopyranoside (CPRG). Absorbance at 574 nm was measured at half hour time intervals starting from zero to two hours hydrolysis of CPRG by chlorophenol (red CPRG). CPR has a molar extinction coefficient of 75,000 M-1 cm-1 (Boehringer-Manheim data sheet) and this is a very easily detected product of b-galactoside enzyme activity. Range finding experiments showed that 2mM of CPRG gives linear data for 2-3 hours often, color changes could be seen with the unaided eye. Two controls are shown in Figure 2.8 that validates CPRG for this work. Ovaries from a non-transformed strain (y wRD) were used to prepare soluble extracts. A near zero-absorbance at 574 nm was observed that did not appreciably change over several hours. In contrast, ovarian extracts from the ovo promoter-lacZ transformant strain ovo3U21 and ovo4B8 (Oliver et al, 1994) showed a steep linear increase in A 574 during the same period. The slopes of these lines were proportional to the amount of ovo3U21 and ovo4B8 extract added.
Bradford (1976) Assay For Protein This protein determination method is based on the binding of Coomasie Brilliant Blue G-250 to the protein. Preparation of protein reagent was done according to Bradford (1976). After 100 mg of Coomasie Brilliant Blue G-250 was dissolved in 50 ml 95% ethanol, and then 100 ml 85% (w/v) phosphoric acid was added. The resulting solution was diluted to a final volume of 1 liter [final concentrations in the reagent were 0.01% (w/v) Coomasie Brilliant Blue G-250, 4.7% (w/v) ethanol, and 8.5% (w/v) phosphoric acid]. 20ml of prepared soluble extract from the dissected tissues were used. This volume is diluted to 0.1ml with ddH2O, then 5ml of protein reagent was added to the test tube and contents were mixed. The absorbance at 595nm was measured after 2 min and before 1 hr in 3 ml cuvettes against a reagent blank prepared from 0.1 ml of the appropriate buffer and 5 ml of protein reagent. A standard curve using known quantities of bovine serum albumin (BSA) was constructed. Soluble extract absorbances were plotted on the standard curve and protein amount interpolated.
Statistical Analysis Average specific activity is calculated as nanomoles of substrate used per hour per nanogram protein expressed (nmole CPRG liberated /ng / hr). Sample number (N) always exceeded seven. Mean specific activity and standard error of the mean (SEM) were calculated for each experimental and control class. The F test was used to determine whether variances were equal, and therefore,, which type of student’s t-test calculation was appropriate. A significant difference between experimental and control values was identified by a P < 0.05 for the t-test score.
The results are given in three sections: X chromosome deficiency screening, negative autoregulation of ovo exhibited by deficiencies removing ovo, and gene dose analysis using P element transformants carrying extra copies of ovo.
X Chromosome Screening The presence of polytene chromosomes in the salivary glands, which have distinctive, banding patterns allows the map positions of genes to be correlated with physical features of the chromosomes. Breakpoint locations rearrangements, and the locations of cloned sequences can be easily established. Each of the major chromosome arms is divided into 20 numbered segments, except chromosome 4, which is divided into 4 regions. Each numbered region is then divided into six consecutive lettered regions, and each lettered region into numbered bands, for example 4E1. The precise relationship between physical length and the numbering scheme depends on local topography (Lefevre, 1976). In the summary tables, each deficiency listed according to cytological positions. The map of the X chromosome, including the deficiencies used in this study is given in Materials and Methods (Fig 1). Figure 1: Sex determination of D. melanogaster (1998) In Drosophila melanogaster germ cells, ovo has a primary role in female sex specific cell viability, proliferation and differentiation. Ovo responds to the number of X-chromosomes as assessed by high level expression (Oliver et al., 1994). Thus, the ovo promoter may be dependent upon X germline numerator elements. To identify possible trans-regulators of the ovo germline promoter (and, I hope, to identify germline numerators) I undertook deficiency screen for quantitative effects on ovo::lacZ reporter constructs. Determination of trans-regulation effect by any of the deletion mutant, was based on two general rules. If the excised part of the X chromosomes has any genes with the positive regulatory effects on ovo gene activity, then the levels of LacZ reporter gene function will be reduced in experimentals compared to control siblings. If the experimental class results in the elevation of the LacZ activity by producing high levels of enzyme compared to controls, the elevated region having removed a repression locus. Significant effects were determined by statistical analysis, which using a student’s t-test P value is less than or equal to 0.05. X-chromosome screening results are presented in Table 3.1 and 3.2. The entire X-chromosome deficiency set was tested twice: once with a 3.3kb ovo promoter fragment driving LacZ (strain ovo3u21), and separately with a 3.1kb ovo promoter (ovo4B8). Of 45 deficiencies that represent about 70% of the X-chromosome 17 deficiencies had significant effects in both ovo3U21 and ovo4B8 reporter activity, 1 deficiency had significant effects on only ovo3U21 and only 1 deficiency effect on ovo4B8. Some of these deficiencies partly overlap, allowing the identification of 11 regions that apparently contain trans-acting modifiers of ovo promoter activity six are positive regulators and five are negative.
Region 1-4. This region covers the eight overlapping deficiency lines, Df(1) BA1, Df(1)sc14, Df(1)64c18, Df(1)JC19, Df(1)dm75e19, Df(1)N8, Df(1)A113, DF(1)JC70. For three of them, Df(1)A113, Df(1)JC70, and Df(1)BA1, the student’s t-test probabilities show a significant difference between control and experimental siblings. The remaining strain has no significant trans-regulation effect on ovo gene activity. Df(1)BA1 enhanced the ovo gene expression activity about 20% when either ovo3U21 or ovo4B8 is used. It was suggested that a suppressor of ovoD (1F-2B+ locus) maps within 1E3-4 to 2B3-4 because of the dramatic gene dose effect of this region on the development of ovoD2/+ ovaries (Pauli et al, 1995). In contrast, it was found that Df(1)A113 and Df(1)JC70 have repressing effects on ovo expression. Df(1)A113 (3D6-E1; 4F5) removes several genes beside ovo, showed a very significant repression effect in outcrosses, about 82% and 47% (e/C), in ovo3U21 and ovo4B8 respectively. That data obtained in Df/+ females has a particular quantitative significance, which implies that the missing loci have the complementary effect. It was shown that this region is contains a gene or genes resulting in genetic unbalance (Cline et al., 1987). Also, Oliver et al., (1988) show that in deficiency lines, which they have used, strains removing both ovo and snf together are reducing viability of the progeny, that is, there is a synergistic interaction between ovo and snf.
Region 5-8. Twelve overlapping deletions have been tested in this region. Two deletions Df(1)N73 (5C3-5;5E-8) and Df(1)Lz90b24 (8B-D) caused very significant repressing effects, implying the presence of trans-activating loci, one deletion Df(1)RA2(7D10;8A4-5) resulted in heterozygous experimentals with significant elevation in LacZ compared to siblings, implying a trans-repressor locus. It has been reposted that Df(1)RA2 strongly enhances ovoD phenotypes due to the function of otu+ in germline sex determination (Pauli et al., 1993). However, since out protein is cytoplasmic, it is unlikely that the Df(1)RA2 effect on ovo::lacZ promoter activity is due to changing dosage of otu. It is also suggested that there is a synergistic interaction between ovo and lozenge, eye phenotype, which is deleted by Df(1)Lz90b24, and here the data showed a trans-activating effect due to this deletion. The other deletions do not cause any significant effect on gene activity.
Region 9-10. In this cytological position nine deficiency lines had been tested. Since this region was very dense for putative trans-regulation repressors, it was group in a small region. Among nine of the deficiencies were used six of them showed a repressor effect. These effective regions were: Df91)vL15, Df(1)N110, Df(1)HC133, Df(1)vL11, Df(1)KA7, and Df(1)N71. This region seems to have a very important effect on ovo, since in the 9Bto 10F interval there are various levels of repressor effect. Two common overlapping regions were found; one was from 9C4 to 9D1-2, and the other was from 10A to10F6. Other repressor effects from strongest to weakest was Df(1)vL11 (9C4;10A1-2), Df(1)HC133 (9B9-10;9E-F), Df(1)N110 (9B3-4;9D1-2), and Df(1)v-L15 (9B1-2;10A1-2), Df(1)KA7 (10A9;10F6-7) breakpoint was outside the first loci in the examined region. Df(1)Ka7 and Df(1)vL15 show about 20% increase in the heterozygous siblings, the longest and the shortest breakpoints, respectively. Three out of five repressing effect intervals, Df(1)v-L11 (9C4; 10A1-2), Df (1)HC133 (9B9-10; 9E-F), Df(1) N110 (9C4; 10A1-2) is the strongest of all in Df/+ and bearing the common region among the five strains, which is 9C4; 10A1-2.
Region 11-13. Eight deficiency lines were in this region, Df(1)JA26, Df(1)HF368, Df(1)N12, Df(1)C246, Df(1)g, Df(1) RK2, Df(1)RK4, and Df(1) sd 72b . It has been found that this region involves five overlapping deletions that gave rise to repressing effect on ovo gene expression. According to common regions of the cytological positions, these overlapping deletions were grouped into three loci. These three common regions, which are responsible from trans-regulation activity of ovo, reside on 11D0F; 12B-D, and 13F-B regions of the X-chromosome. Df(1)N12 (11D12;11F1-2) and Df(1)C246 (11D-E; 12A1-2) were in the 11D-F loci, Df(1)g (12B;12E8) and Df(1)RK2 (12D2-E1; 13A2-5) were in the 12B0D region, and Df(1)sd72B (13F1-14B1) in the 13B-14B loci, all of which in this examined region showed a repressor activity. The strongest effect among the X-chromosome screening was located in 11D1-11F1-2 excised region of X-chromosome, this deletion corresponds to Df(1)N12 strain, which shows a significant effect as well as high gene activity repression, Around 140% to 240% E/C in Df/+ flies for both ovo::LacZ constructs. In addition, it has been reported that reduced dose of the 11D-F region results in synergistic mutant phenotypes with a number of somatic sex determination genes (Belote et., 1985). Furthermore, Flybase reports that this region seems to include locus involved in early sex determination examined by Scott and baker (1986). However, ambiguities in deficiency breakpoint assignments complicate interpretation. For example, first loci, which includes Df(1)N12 and Df(1)C246 due to uncertainty at the distal end breakpoints of Df(1)C246 (12D-e; 12A1-2); the trans-acting repressor of ovo maybe located in 11E-F rather than 11D-F. Similarly, for the second loci in this region ambiguity at the distal breakpoint of Df(1)RK2 also cause a dilemma about the location of the trans-acting repressor, since the question was the common region between Df(1)g and Df(1)RK2 was whether in the 12D-E or in the 2E1-2E8 of X-chromosome. On the other hand, the last loci were determined by the only one deficiency strain. In this case, the problem was whether determination of the loci was accurate enough, or whether another locus is involved in repressing of ovo reporter activity which Df(1)sd72b (13F1–14B1) may have a common region with. This deficiency removes several lethal mutations, Myb, sd (scalloped), shi (shibiri), and exd (extradenticle). Two genes previously cloned in the 13F cytological region are the Drosophila c-myb oncogene homolog (Katzen et al, 1985) and a G protein b-subunit (Yarfitz et al 1988).It has been suggested that the sd+ gene might be associated with more than one product (perhaps a differential processing) or it might reflect differential tissue and/or temporal regulation (Campbell et al., 1991).
Region 14-20. In this region eight deficiency strains, Df(1)4b18, Df(1)rD1, Df(1)B, Df(1)N19, Df(1)JA27, Df(1)HF396, DF(1)DCB1, and Df(1) A-209, were tested. According to measured specific activities Df(1)4b18 (14B8; 14C1) and DF(1) B (15F9=16A6-7) showed significant activating effect on ovo promoter, activity of the former was weaker than that of latter. Since there is no common region between these two putative trans-acting activators, interpretations of the results gave rise to two loci, 14B8-14C1 and 15F-16A1; 16A6-9. In addition, the Flybase report for Df(1) shows that 70 deletion that breaks within the second exon of the non A (no on or transient A) gene from Stanewsky et al (1993). As a result of X-chromosome screening, 45 deficiency strains were tested and found 17 regions were trans-regulating ovo promoter. These regions were classified into 12 loci according to their overlapping common regions. Among these, six, of which were showing trans-acting activator effect, and seven, of which were responsible for trans-acting repressor effect on ovo promoter. Furthermore, one deficiency strain, Df(1)sc14, showed a significant trans-acting repressor effect in only ovo4B8 strain but not in ovo3U21 strain. This maybe explained by position effect of P[ovo::LacZ] construct due to landing on P element transposase onto insertion site or by difference between the size of the ovo::LacZ constructs, e.g. ovo3U21 carries 200 bp longer than ovo4B8 at the N-terminal end that may cause a better translation product. Consequently, among the X-chromosome screening data, it was found that two of the deficiency lines. Df(1)A113 and Df(1)JC70, which are removing ovo and snf along with the several genes due to deletions, and correspond to one loci acting as an repressor, were taking into more detailed investigations. These results suggested a negative autoregulation mechanism in the ovo promoter. Therefore, negative autoregulation of ovo was examined with three approaches: ovo point mutations, more defined deficiency strain, and downstream genes.
DISCUSSION
The sex determination involves complex set of mechanisms. The fly is chosen to be studied since Drosophila is inexpensive to rear, generates large numbers of progeny, and has nearly a century of accumulated data upon which to design experiments. Mutational analysis of cell biological and developmental process is relatively simple, even if the resulting mutations are organism-lethal when homozygous. This is decided advantage over mammalian genetics, in which lethal mutations often die in utero, which complicates the ability to examine and interpret mutant phenotypes. The Drosophila genome is one-twentieth the size of the mammalian genome, making insertional mutagenesis and positional cloning much less difficult. Additionally, mammalian genetics lacks genetic tools such as balancers that make the maintenance of sterile and lethal-mutations nearly trouble free in Drosophila. Nematodes have many of the same conveniences as Drosophila, with the added advantage of a highly stereotyped pattern of embryonic (and post-hatching) cell lineages. The more-regulative character of Drosophila development induces complications lacking from worm genetics, with respect to cellular level analysis of mutant phenotypes. Perhaps, the most compelling reason to take advantage of the specialized properties of Drosophila, is the extent to which prior studies have shown that genes, proteins, and developmental pathways and processes are conserved among metazoan groups. We can, with high confidence, study sex determination in Drosophila with a reasonable confidence that what we learn can be extrapolated to other species, including man and his clinical diseases.
The deletion mapping technique was used to identify the locations of genes that are required for ovo trans-regulation. Each deficiency line removes several to many genes from the genome. A sufficiently complete set of overlapping deletions can allow, potentially, every individual trans-acting gene to be localized. Seventeen deficiencies that have effects on the ovo germline promoters are shown in Table 4.1. Twelve deficiencies showed repressor effects, and five deficiencies showed activator effects. Deleted regions may affect any of several processes, such as numerator elements, cell viability and differentiation, dosage compensation, and response to inductive signals from soma. Determination of which gene within a specific region is responsible for the effect on ovo requires more defined deletions or having null alleles for each gene. Estimation of the Number of Trans-Regulators. Among the seventeen deficiencies in Table 4.1, overlapping common regions identify seven that function as trans-acting repressor loci, and five that function as trans-acting activator loci. Thus, the entire euchromatic X-chromosome may have as many as ≈10 repressor genes and ≈7 activator genes for the ovo germline promoters. If these results were extrapolated to the entire fly genome, ≈50 repressors and ≈35 activators of ovo transcription are predicted. These are underestimates from the data, since any given deleted common region need not remove exactly one relevant gene. Is it reasonable for nearly 85 genes to be involved in regulating the ovo germline promoters? Precedents from other developmental control systems suggest this is not an implausibly high number.
Regulation of the master sex determination gene Sxl is complex. To establish somatic sex determination in the early embryo, nine genes are required to activate the Sxl early promoter. These are sis-a, sis-b, sis-c, run, da, emc, gro, dpn, and her. In biochemical terms, most are DNA-binding proteins. In genetic terms, some are positive and are others are negative regulators. Maintenance of Sxl expression involves positive autoregulation at the level of pre-mRNA alternative splicing. At least five genes are known to play specific roles in this process: Sxl itself, snf, vir, her, and fl(2)d. Function of Sxl in the germline is regulated in several ways. Germline-specific transcriptional control of Sxl is still conjectural, but it is clear that the somatic functioning numerator elements play no role in the germline. It is possible that ovo may play an important role in germline transcriptional control of Sxl (e.g., Lee. 1998); certainly it has an indirect role (e.g., Oliver et al., 1993). Splicing-level autoregulation of Sxl is active in the female germline, and it involves the same genes that function in this process in somatic cells. Once Sxl protein is produced in female germline cells, the otu protein plays an important role in this relocalization into the nucleus. Thus, a minimum of sixteen genes is required for proper regulation of Sxl.
Establishment of the body plan in Drosophila is also under complex transcriptional control. Maternally localized RNA and protein molecules establish the gross body axes: anterior-posterior and dorsal-ventral. Hierarchically organized sets of zygotically activated genes are transcribed, and their protein products serve to refine the body axes into progressively finer-grained structures. The metameric anterior-posterior body axis is specified by so-called gap genes, pair rule genes, and segment polarity genes, which create the segment-sized repeating units of the body. Homeotic genes encoded by the Antennapedia Complex (ANT-C) and bithorax Complex (BX-C) then confer position-specific identities upon each segment. During the cellular blastoderm stage, gap genes and maternal coordinate genes regulated the activation of primary pair rule genes such as even-skipped (eve). These are expressed in seven one-segment-wide stripes that alternate with on-segment-wide regions of non-expressing cells. For example, the second stripe of eve expression is positively regulated by hunchback and bicoid, and negatively regulated by giant and Kruppel. All four proteins directly bind to a 500-bp-long “eve-stripe 2 enhancer.” Binding have giant and Kruppel is competitive with binding of hunchback and bicoid, and vice versa. Thus, spatially controlled concentrations of giant, Kruppel, bicoid, and hunchback proteins result in spatially restricted activation or repression of the eve stripe 2 enhancer. The remaining six stripes of eve expression are similarly controlled by other DNA-binding proteins, which are acting another discrete stripe-specific enhancers. Ectopic expression of homeotic genes can have disastrous effects on development. Thus, a special heterochromatin-like mechanism functions to ensure that ANT-C and BX-C genes are inactive in cells and tissues that do not require their expression. Stable repression is mediated by the Polycomb class of proteins, which number over forty. Each of these examples illustrates that developmental control of individual gene transcription is mediated by both positive and negative effectors, and that sometimes the number of such upstream regulators numbers between one and several dozen. Thus, our estimate of 85 regulators of the ovo germline promoters is not out of line with other developmentally regulated systems.
Evaluation of Candidate Loci Within Common Regions. Based overlapping cytology, seventeen deficiencies that affected the ovo germline promoter fell into twelve common regions. Each of these will be discussed in turn below. Of particular interest was the relationship each of our trans-acting may have with Su(ovoD) and E(ovoD) loci identified in a generic screen by Pauli et al. (1995). In general, it is not straightforward to suggest identities between Su(ovoD) or E(ovoD) loci and our trans-acting repressor or activator loci because of the dissimilar means of assaying these gene-dose-sensitive interactions. We use quantitative measures of LacZ reporter activity as a proxy for ovo transcription, while Pauli et al. (1995) use semi-quantitative measures of vitellogenesis.
Region 1 (polytene bands 1A1; 2A1-4): The distal region of the X-chromosome showed a trans-regulating activator effect on the ovo promoters. This region includes the acheate-scute complex (AS-C), home of the X-chromosome numerator element sis-b (Cline, 1988; Parkhurst and Ish-Horowicz, 1990), also known as scute-T4. This numerator has no function in the female germline (Granadino et al., 1989). Pauli et al., (1995), using other deficiency strains affecting this section of the X-chromosome, identified a strong Su(ovoD) locus in the polytene region 1E3-4; 2B3-4 that may correspond with our trans-activator. Flybase indicates that this region contains over 100 genes, among them 23 unassigned open reading frames, 33 genes defined by apparent visible mutations, 53 lethal genes,, and two female sterile loci.
Region 2 (polytene bands 4C15-16; 4F15):This region includes the ovo and snf loci, and was identified by Pauli et al., (1995) as a strong E(ovoD) due to the effects of these loci. Further discussion is deferred to mechanism of ovo autoregulation, which deal with ovo negative regulation. Region 3 (polytene bands 5C3-5; 5E8): This region has a trans-regulatory activation effect on the ovo germline promoters. Deficiency for this region showed no interaction with ovoD in the vitellogenesis assay (Pauli et al., 1995). Examination of Flybase records for this region reveals over twenty genes, and no strong candidates that may account for the interaction with the ovo promoters.
Region 4 (polytene bands 7D10; 8A4-5): Results showed that this region contains a transacting-repressor of ovo germline promoter activity. This region reported by Pauli et al. (1995) to contain a strong E(ovoD) locus, which was identified as the ovarian tumor gene (Pauli et al., 1993, 1995). It is virtually certain that the repressor-of-ovo is distinct from otu.First, the otu protein is cytoplasmic and plays a role in egg chamber cytoskeletal function (Nagoshi et al., 1997). Second, the ovo protein binds to the otu promoter in vitro (Garfinkel et al., 1997; Lee, 1998, Lee and Garfinkel 1998; Lu et al., 1998). Third, under certain conditions, in vivo activity of the otu promoter is dependent upon ovo protein production (Hager and Cline, 1997; Lu et al., 1998). Examination of Flybase reveals that this region contains fifty genes mutable to lethal, visible, or female-sterile phenotypes, but none appear to be a strong candidate for the repressor-of-ovo locus.
Region 5 (polytene bands 8B5-8; 8DE): This region also has an apparent repressor of ovo germline promoter activity. Deficiency for this region showed no interaction with ovoD mutations in the Pauli et al. (1995) vitellogenesis assay. Examination of Flybase reveals that this region contains thirty genes mutable to lethal, visible, or female sterile phenotypes. One gene stands out as a candidate for the repressor, namely, lozenge. This is a complex locus that is mutable to female sterility (Green and Green, 1949, 1956), and it is named for a reduced-eye, smoothened-eye, mutant phenotypes. Interestingly, certain ovo-mutant alleles are called “lozenge-like” in recognition of a similar eye defect (Oliver et al., 1987; Mevel-Ninio et al., 1989; Garfinkel et al., 1992). The lz gene codes for a transcription factor (Dag et al., 1996). Region 6 (polytene bands 9C4; 9D1-2): The cytological assignment of this region is based on the overlap of three deficiencies: Df(1)N110, Df(1)H133, and Df(1)v L11. Together, they mark a trans-acting repressor of ovo promoter activity. According to Pauli et al. (1995), only two of these three deficiencies behaved as if they exposed an E(ovoD) locus, while the third had no effect. In combination with positive results from other deficiencies, Pauli et al. positioned the E(ovoD) locus at cytological region 9E-F. Thus, it is again possible that the repressor-of-ovo we identified is distinct from a nearby E(ovoD) locus, and is among the half-dozen loci identified by Flybase as mapping into this interval.
Region 7 (polytene bands 10A6; 10F6-7): This region contains a trans-acting repressor of ovo promoter activity. According to Pauli et al. (1995), the defining deficiency had no significant interaction with ovoD alleles. Examination of Flybase reveals that this region includes the somatic X-chromosome numerator element sis-a, which also has no function in germline development (Granadino et al., 1989, 1990, 1997). Given the extent of this region, it is not surprising that Flybase identifies 65 genes with diverse phenotypes and biochemical roles; however no strong candidate locus that may count for the repressor-of-ovo locus is apparent.
Region 8 (polytene bands11D1-2; 11F1-2): This region contains perhaps the strongest trans-acting repressor of ovo promoter activity in the survey: deficiency heterozygous experimentals had 2-2.5 fold more lacZ specific activity in their ovaries that the balancer carrying controls. According to Pauli et al (1995), one of the two deficiencies defining this common region showed a statistically weak enhancement of ovoDalleles, while the other had a significant Su(ovoD) phenotype. Likewise, Belote et al. (1985) and Scott and Baker (1986) reported that the same deficiency later shown to have Su(ovoD) activity also interacted with loci in the somatic sex determination pathway. It is an open question how these three results relate to one another. Among sixteen genes that map into this region are two signal transduction loci: the Mek3 gene, a serine-threonine-specific protein kinase in the MAP kinase pathway, and a beta subunit of the heterotrimeric GTP-binding protein. A solitary female-sterile, fs(1) K4, also maps roughly into this region; it is germline-dependent, and yields fragile eggs, a phenotype occasionally seen in the eggs laid by ovoD3/+ females.
Region 9 (polytene bands 12D2-12E1; 12E8): This region contains a trans-acting repressor of ovo promoter activity. According to Pauli et al. (1995), neither deficiency defining this common region interacted with ovoDalleles. This region contains the yolkless gene (DiMario et al., 1987), which has been cloned and codes for a member of 35 known genes, including a cluster of tRNA genes, the male-germline-specific Stellate genes, and several lethal and female-sterile genes.
Region 10 (polytene bands 13F1; 14B1): This region contains a trans-acting repressor of ovo promoter activity. Again, no significant interaction with ovoD allel4es was observed by Pauli et al. (1995). Podry, Katzen and others have extensively mutagenized this region due to its containing shibiri (the Drosophila homolog of dynamin), c-myb, another Gb subunit, and the homeodomain protein extradenticle. Their work revealed a total of twenty lethal genes, ten apparent visibles, and over a half-dozen unassigned open reading frames.
Region 11 (polytene bands 14B8; 14C1):This region contains a trans-acting activator of ovo promoter activity. According to Pauli et al., (1995), the defining deficiency had no significant interaction with ovoD alleles. This region is surprisingly dense genetically, as it apparently contains over forty genes. Several behavioral genes coding for neuronal functions map here, including nonA, paralytic, and easily shocked. The nonA gene codes for an RNA-binding protein, and is mutable to a variety of phenotypes including recessive lethality, male-courtship-strong abnormalities, and defective vision. The location of para (a sodium channel) is particularly intriguing since paratsmutations fail to complement certain naptsalleles, and nap genetically overlaps the dosage compensation function maleless. Mutations in maleless are unique among the known dosage compensation loci in having a mutant phenotype in germline clones, and they are said to suppress the female-germline-lethality of ovo null mutations. The easily shocked locus codes for ethanolmine kinase, and mutations at this locus also interact with mle.
Region 12 (polytene bands 15F9-16A1; 16A7): This region contains a trans-acting activator of ovo promoter activity. According to Pauli et al. (1995), the defining deficiency had no significant interaction with ovoDalleles. Examination of Flybase reveals that this region contains at least a dozen female-sterile loci, a dozen lethal loci (including the Bar homeodomain protein gene). There is an ambiguity in compared mean of activities. According to the negative autoregulation mechanism, there suppose to be a linear decrease pattern correlated to increase in copy of ovo. However, the pattern of the gene dose was reaching plato, when three copies of ovo were present in the genome. Yet, this also shows that there is a protection mechanism that counts the number of ovo versus number of X chromosome exists. Therefore, the sex determination mechanism turns off the extra ovo in the system immediately.
Consequently, the system prohibits more wrong information to be processed according to its default setting where if the X:A ratio equals to one the outcome is going to be prepared as female, if not turn off the mechanism towards male-like, sterile mode, or death at the embryonic stage. This discontinuity in the linear correlation may be due to position effect of P[w+ ovo+]. Future Directions and Concluding Remarks The results of this study suggest that the ovo germline promoters are regulated by a large set of upstream factors. Nearly a dozen of these maps to the X-chromosome, some to region that are well characterized genetically. Further deficiency mapping experiments, and assessment of the phenotypes of single-P insertion lines with female-sterile or perhaps lethal phenotypes, would be required to identify the relevant genes. Some regions contain candidate loci that have been cloned (e.g. lozenge); in this example, either in vitro DNA-binding experiments using Lz protein and the ovo promoter region, or computational assessment of the likelihood that the ovo promoter contains binding sites for Lz can be done. Another potential upstream factor not assessed in these experiments is the ecdysone regulatory hierarchy. The steroid ecdysone is the endocrine hormone that controls molting and metamorphosis in arthropods. It is an allosteric effector for a heterodimeric receptor of the steroid-receptor superfamily. The ovaries of adult females manufacture their own ecdysone, and the gene for the rate-limiting steroidogenic enzyme transcribed beginning in Stage 7-8 egg chambers. This stage immediately precedes the onset of the highest level of ovo transcription (Mevel-Ninio et al., 1991; Garfinkel et al., 1994). Mutations in the E74 and E75 genes, when made homozygous in germline clones, cause arrest of oogenesis at Stage 7-8, as if egg chambers are unable to respond to endogenous ecdysone and continue differentiation. Both E74 and E75 code for transcription factors that are induced as immediate-early primary responses to added ecdysone both in-vivo and in tissue culture assays. Thus, it is reasonable to suggest that one or both of these proteins will bind to the ovo germline promoter in an in vivo effect on expression of the ovo::lacZ reporter using the methods established in this dissertation.
Acknowledgement: This work had been comppleted in the laboratory of Dr. Mark Garfinkel at Illinois Institute of Technology. Dr. Demet Sag initiated the project with her own ideas, was fully supported by Turkish National Merit Fellowship, and earned NATO Advanced Science institute Grant on Genome Structure and Functional Genomics, Elba Island, Italy, accepted to work with Dr. Mevel Ninio, based on the proposal submitted by Demet Sag on Molecular Mechanism of ovo, through EMBO long term scholarship in France.
BIBLIOGRAPHY
Ashburner, M., Drosophila Laboratory Manual, Cold Spring Harbor Laboratory Press, pp. 317-318, 1989
Baker, B.S., “Sex in flies: the splice of life,” Nature 340, pp.521-524, 1989.
Baker, B.S. and Belote, J.M., “Sex determination an dosage compensation in Drosophila melanogaster” Ann. Rev of Genetics 17, pp345-393, 1983.
Baker B.S. and Ridge, K.A., “Sex and the single cell I. On the action of major loci affecting sex determination in Drosophila melanogaster,” Genetics 94, pp.383-423, 1980.
Bell, L.R. Horabin, J.I., Schedl, P., and Cline, T.W., “Sex-lethal, a Drosophila sex determination switch gene, exhibits sex specific RNA spilicing and sequence similarity to RNA binding proteins,” Cell 55, pp.1037-1046, 1988.
Bell, L.R. Horabin, J.I., Schedl, P., and Cline, T.W., Positive autoregulation of Sex-lethal by alternative spilicing maintains the female determined state in Drosophila, Cell Vol 65, pp.229-239, 1991.
Belote, J.M., Handler, A.M., Wolfner, M.F., Livak, K.J., and Baker, B.S., Sex-specific regulation of yolk protein expression in Drosophila, Cell 40, pp.339-348, 1985.
Bohringer- Manheim Product Support, Chloroform Red-beta-D_Galactopyranoside sodium salt (CPRG) provided by ID#0223p, June 1986 dated information, fax received on 2/26/1996.
Bopp, D., Bell, L.R., Cline, T.W., Schedl, P. Developmental distribution of female specific Sex-Lethal proteins in Drosophila melanogaster, Genes and Development 5, pp.403-415,, 1991.
Bopp, D., Sex-specific control of Sex-lethal is a conserved mechanism for sex determination in the genus Drosophila, Development 122, pp.971-982, 1996.
Bopp, D., Horabin, J.I., Lersch, R.A., Cline, T.W., Schedl, P., Expression of the Sex-lethal gene is controlled at multiple levels during Drosophila oogenesis, Development 118, pp.797-812, 1993.
Bradford, M. M., A Rapid and sensitive method for the Qantitation of Microgram Quantities of Protein Utilizing the Principle of Protein-Dye Binding, Analytical Biochemistry 72, pp.248-254, 1976.
Bridges, C.B., Sex in relation to chromosomes, Am. Nat.59, pp.127-137, 1925.
Bridges, C.B., Non-disjunction as proof of the chromosome theory of heredity, Genetics 1, pp.1-52, 1916.
Burtis, K.C. and Baker, B.S., Drosophila doublesex gene controls somatic sexual differentiation by producing alternatively spliced mRNAs encoded related sex-specific yolk protein gene enhancer, The EMBO J., Vol. 10, No. 9, pp.2557-2582, 1991.
Busson, D., Gans, M., Komitopoulou, and Masson, M., Genetic Analysis of three dominant female-sterile mutations located on the X-chromosome of Drosophila melanogaster, Genetics 105, pp.309-325, 1983.
Campbell, S.D., Duttoroy, A., Katzen, A.L., and Chovnick, A., Cloning and characterization of the scalloped region of Drosophila melanogaster, Genetics 127, pp.367-380, 1991.
Cline, T.W., The Drosophila sex determination signals: how do flies count two?, Ann. Rev. Genet 30, pp637-702, 1996.
Cline, T.W., Evidence that sisterless-a and sisterless-b are two of several discrete numerator elements of the X/A ratio sex-determination signal in Drosophila that switch Sxl between two alternative stable expression states, Genetics 119, pp.829-862, 1988.
Cline, T.W., Re-evaluation of the functional relationship in Drosophila between a small region on the X-chromosome (3E8-4F11) and the sex determination gene, Sex-lethal, Genetics 116, s:12, 1987.
Cline, T.W., A female specific lethal lesion in an X-linked positive regulator of the Drosophila sex determination gene, Sex-lethal, Genetics 113, pp.641-663, 1986.
Cline, T.W., Autoregulatory functioning of a Drosophila gene product that establishes and maintains the sexuality determined state, Genetics 107, pp.231-277, 1984.
Cline, T.W., Functioning of the gene daughterless and Sex-lethal in Drosophila germ cells, Genetics 107, s16-17, 1983.
Cline, T.W., A sex specific temperature-sensitive maternal effect of the daughterless mutation of Drosophila melanogaster, Genetics 84, pp.723-742, 1976.
Cronmiller, C., Schedl, P., Cline, T.W., Molecular characterization of daughterless, a Drosophila sex determination gene with multiple roles in development, Genes Dev. 2: 155-167, 1988.
Despande, G., Samuels., M.E., and Schedl, P.D. “Sex-lethal interacts with splicing in vitro and in vivo, Molecular and Cellular Biology, Vol. 16, No 8, pp. 5036-5047, 1996.
Despande, Stukey, J., and Schedl, P., scute (sis-b) function in Drosophila sex determination, Moll. Cell Biology 15, pp. 4430-4440, 1995.
DiMario, P.J., and Mahowald, A.P., Female sterile (1) yolkless: A recessive female sterile mutation in Drosophila melanogaster with depressed numbers of coated pits and coated vesicles within the developing oocytes, J. Cell Biology 105: 199-206, 1987.
Erickson, J.W. and Cline, T.W., A bZIP protein, sisterless-a, collaborates with bHLH transcription factors early in Drosophila development to determine sex, Genes and Dev. 7: 1688-1702, 1993.
Erickson, J.W. and Cline, T.W., Molecular nature of the Drosophila sex determination signal and its link to neuorogenesis, Science 251, pp. 1071-1074, 1991.
Estes, P.A., Keyes, L.N., and Schedl, P., Multiple response elements in the sex-lethal early promoter ensure its female-specific expression pattern, Mol Cell Biol 15, pp. 904-917, 1995.
Flickinger, T.W. and Salz, H.K., The Drosophila sex determination gene snf encodes a nuclear protein with sequence and functional similarity to the mammalian U1A snRNP, Gene and Development 8, pp. 914-925, 1994.
Gans, M.C., Audit and Massson, M., Isolation and characterization of sex-linked female-sterile mutations in Drosophila melanogaster, Genetics 81, pp. 683-704, 1975.
Garfinkel, M.D., Lee, S., and Sigar, I., DNA-binding targets of the Drosophila melanogaster OVO protein, 38th Annual Drosophila Research Conference, Chicago, IL, 1997.
Garfinkel, M.D., Wang, J., Liang, Y., and Mahowald, A. P., Multiple products from the shavenbaby-ovo gene region of Drosophila melanogaster: relationship to genetic complexity, Molecular and Cell Biology, Vol. 14, No., 10, pp. 6809-6818, 1994.
Garfinkel, M.D., Lohe, A.H., and Mahowald, A.P., Molecular genetics of the Drosophila melanogaster ovo locus, a gene required for sex determination of germline cells, Genetics 130, pp. 791-803, 1992.
Gollin, S. M. and King, R. C., Studies on fs(1)1621, a mutation producing ovarian tumors in Drosophila melanogaster, Developmental Genetics 2, pp. 203-218, 1981.
Granadino, B., Compuzano, S., and Sanchez, L., The Drosophila melanogaster fl(2)d, a gene needed for Sex-lethal expression in Drosophila melanogaster, Genetics 130, pp. 597-612, 1990.
Granadino, B., Santamaria, P., and Sanchez, L., Sex-determination in the germ line of Drosophila melanogaster: activation of the gene Sex-lethal, Development 118, pp. 813-816, 1993.
Granadino, B., Juan, A. B. S. B, Santamaria, P., Sanchez, L., Distinct mechanisms of splicing regulation in vivo by the Drosoophila protein Sex-lethal, PNAS USA, 94, pp. 7343-7348, 1997.
Hager, J.H. and Cline, T.W., Induction of female Sex-lethal RNA splicing in male germ cells: implications for Drosophila germline sex-determination, Development 124, pp. 5033-5048, 1997.
Hilfiker, A., Amrein, H., H., Dobendorfer, A., Schneiter, R, and Nuthiger, R., The gene virilizer is required for female-specific splicing controlled by Sxl, master gene for development in Drosophila, Development 121, pp. 4017-4026, 1995.
Horabin, J.I., Bopp, D., Waterburry, J., and Schedl, P., Selection and maintenance of sexual identity in the Drosophila melanogaster, Genetics 141, pp. 1521-1565, 1995.
Horabin, J. I. And Schedl, P., Regulated spilicing of the Drosophila Sex-lethal male exon involves a blockage mechanism, Moll. Cell. Biol. 13, pp. 1408-1414, 1993.
Horabin, J.L., and Schedl, P., Sex-lethal autoregulation requires multiple cis-acting elements upstream and downstream of the male exon and appears to depend largely on controlling the use of the male exon 5’ splice site, Moll. Cell. Biol. 13: pp. 7734-7746, 1993.
Hoshijima, K., Kohyama, A., Watakabe, I., Inonue, K., Sakamato, H., and Shimura, Y., Transcriptiuonal regulation of the Sex-lethal gene by helix-loop-helix proteins, Nucleic Acids Res. 23, pp. 3441-3448, 1995.
Inonue, K., Hojhijima, K., Sakamato,, H., and Shimura, Y., Binding of the DrosophilaSex-lethal gene product to the alternative splice site of transformer primary transcript, Nature 344, pp. 461-463, 1990.
Keyes, L.N., Cline, T.W., and Schedl, P., The primary sex-determination signal of Drosophila acts at the level of transcription, Cell, Vol. 68, pp. 933-943, 1992.
Komitopoulou, K., Gans, M., Margaritis, L.H., Kafatos, F.C., and Masson, M., Isolation and characterization of sex-linked female-sterile mutants in Drosophila melanogaster with special attention to eggshell mutants, Genetics 105: 897-921, 1983.
Lee, S., DNA binding targets of the Drosophila melanogaster OVO protein, PhD Dissertation, Illinois Institute of Technology, Chicago, IL, USA, 1998.
Lee, S. and Garfinkel, M.D., DNA-binding targets of the Drosophila melanogaster OVO protein, Nucleic Acid. Res.
Linsley, D.L., and Zimm, G., The genome of the Drosophila melanogaster, Academic Press, San Diego, New York, 1980.
Lu, J., Andrews, J., Pauli, D., and Oliver, B., Drosophila OVO zinc finger protein regulates ovo and ovarian tumor target promoters, Dev. Genes. Evol., pp. 1-10, 1998.
Luccesi, J.C. and Manning, E., Gene dosage compaensation in Drosophila melanogaster, Adv. Genetics 24, pp. 371-429, 1987.
Madl, J.E., and Herman, R.K., Polyploids and sex determination in Caenornabtidis elegans, Genetics 93, pp. 393-402, 1979.
Mevel-Ninio, M., Mariol, M.C. and Gans, M., Mobilization of the gypsy and copia retrotransposans in Drosophila melanogaster induces reversion of the ovoD dominant female-sterile-mutations: molecular analysis of revertant alleles, EMBO J. 8, pp. 1549-1558, 1989.
Mevel-Ninio, M., Terracol, R., and Kafatos, F.C., The ovo gene of Drosophila encodes a zinc finger protein required for female germ line development, EMBO J. 10, pp.2259-2266, 1991.
Mevel-Ninio, M., Guenal, I., and Limburg-Bouchen, B., Production of dominant female sterility in Drosophila melanogaster by insertion of the ovoD1 allele autosomes: use of transformed starins to generate germline mosaic, Mechanism of development 45, pp. 155-162, 1994.
Mevel-Ninio, M., Terracol, R., Salles, C., Vincent, A., and Payre, F., ovo, a Drosophila gene required for ovarian development, is specially expressed in the germline and shares most of its coding sequences with shavenbaby, a gene involved in embryo patterning, Mecahnism of Development 49, pp. 83-95, 1995.
Mevel-Ninio, M., Fouilloux, E., Genal, I. and Vincent, A., The three point dominant female-staerile mutations of Drosophila ovo gene are point mutations that create new translation-initiatorAUG codons, Development 122, pp. 4131-4138, 1996.
Mohler, J.D., Developmental genetics of the Drosophila egg. I.: Identification of 50 sex-linked cistrons with maternal effects on embryonic development, genetics 85, pp. 259-272, 1977.
Nagai, K., Oubridge, C., Jessen, T. H., Li, J., and Evans, P.R., Crystal structure of the RNA-binding domain of the U1 small nuclear ribonucleoprotein A, Nature 348, pp. 515-520, 1990.
Nagoshi, R.N., McKeown, M., Burtis, K.C., Belote, J.M., and Baker, B., The control of alternative spilicing at genes regulating differentiation in D. melanogaster, Cell, Vol. 53, pp.229-236, 1988.
Nagoshi, R.N., and Baker, B., Regulation of sex-specific RNA splicing at the Drosophila doublesex gene: cis-acting mutations in exon sequences alter sex specific RNA splicing patterns, Genes and Development 4, pp. 89-97, 1990.
Nagoshi, R.N., Patton, J.S., Bae, E., and Geyer, P., The somatic sex determines the requirement for ovarian tumor gene activity in the proliferation of the Drosophila germline, development 121, pp.579-587, 1995.
Nothiger, R., and Steinmann-Zwicky, M., Meier-Gerschwiller, P., and Weber, T., Sex determination in the germline of Drosophila depends on genetic signals and inductive somatic factors, development 107, pp.505-518, 1989.
Oliver, B., Singer, J., Laget, V., Pennetta, G. and Pauli, D., Function of Drosophila melanogaster ovo– in germ line sex determination depend on X-Chromosome number, Development 120, pp.1-11, 1994.
Oliver, B., Kim, Y. and Baker, B., Sex-lethal, master and slave: a hierarchy of germ line sex determination in Drosophila, Development 119, pp. 897-908, 1993.
Oliver, B., Pauli, D., and Mahowald, A.P., Genetic evidence that the ovo locus is involved in Drosophila germ line sex determination, Genetics 125, pp. 535-550, 1990.
Oliver, B., Perrimon, N, an Mahowald, A.P., The ovo locus is required for sex-specific germ line maintenance in Drosophila, Genes and Development 1, pp. 913-923, 1987.
Oubridge, C., Ito, N., Evans, P.R., Teo, C.H., and Nagai, K., Crystal structure at the 1.92A resolution of the RNA binding domain of the U1A splicesomal protein completed with an RNA hairpin, Nature 372, pp.432-438, 1994.
Pauli, D., Oliver, B., and Mahowald, A.P., Identifications of regions interacting with ovo D mutations: potential new genes involved in germline sex determination in Drosophila melanogaster, Genetics 139, pp.713-732, 1995.
Pauli, D., Oliver, B., and Mahowal, A.P., The role of the ovarian tumor locus in Drosophila melanogaster germ line sex determination, Development 119, pp.123-134, 1993.
Pauli, D. and Mahowald, A.P., Germline sex determination in Drosophila melanogaster, Trends in Genetics, Vol. 6, No. 8, pp.259-264, 1990.
Parkhurst, S.M., Bopp, D., and Ish-Horowicz, X:A ratio, the primary sex determining signal in Drosophila, is transduced by helix-loop-helix proteins, Cell, Vol. 63, pp.1179-1191, 1990.
Perrimon, N., Mohler, D., Engsttrom, L., and Mahowald, A.P., X-linked female-sterile loci in Drosophila melanogaster, Genetics 113, pp.695-712, 1986.
Perrimon, N., Engstrom, L., and Mahowald, A.P., The effects of zygotic lethal mutations on female-germ-line functions in Drosophila, Developmental Biology 105, pp. 404-414, 1984.
Perrimon, N., Clonal analysis of dominant female-sterile, germline-dependent mutations in Drosophila melanogaster, Genetics 108, pp.927-939, 1984.
Perrimon, N. and Gans, M., Clonalo analysis of the tissue specificity of recessive female-sterile mutations of Drosophila melanogaster using a dominant female sterile mutation Fs(1)K1237, Developmental Biology 100, pp. 365-373, 1983.
Rodesch, C., Geyer, P.K., Patton, J.S., Bae, E., and Nagoshi, R.N., Developmental analysis of the ovarian tumor gene during Drosophila oogenesis, Genetics 141, pp.191-202, 1995.
Sag-Ozkol, D., Tekin, S., Garfinkel, M.D., Gene-dose sensitive trans-acting regulators of the Drosophila melanogaster germline promoter, 38th Annual Drosophila Research Conference, Chicago, IL, USA, 1997.
Sag-Ozkol, D., and Garfinkel, M.D., Negative autoregulation of Drosophila melanogaster female germline specific gene, ovo (in preparation).
Sag-Ozkol, D., and Garfinkel, M.D., X-chromosome screening of Drosophila melanogaster to find numerator elements of germline sex determination (in preparation).
Salz, H.K. and Flickinger, T.W., Both loss of function and gain-of-function mutations in snf define a role for snRNP proteins in regulating Sex-lethal pre-mRNA splicing in Drosophila development, Genetics 144, pp.95-108, 1996.
Salz, H.K., Maine, E.M., Keyes, L.N., Samuels, M.E., Cline, T.W., and Schedl, P., The Drosophila female-specific sex-determination gene, Sex-lethal has stage-, tissue-, and sex-specific RNAs suggesting multiple models of regulation, Genes and Development 3, pp.708-709, 1989.
Salz, H.K., Cline, T.W., and Schedl, P., Functional changes associated with structural alterations induced by mobilization of a p element inserted in the Sex-lethal gene of Drosophila, Genetics 117, pp.221-231, 1987.
Sanchez, L., Granadino, B., and Torres, M., Sex determination in Drosophila melanogaster, X-linked genes involved in the initial step of Sex-lethal activation, Developmental Genetics 15: 251-264, 1994.
Sass, G., Mohler, J.D., Walsh, R.C., Kalfayan, L.J. and Searles, L.L., Structure an the expression of hybrid dysgenesis-induced alleles of the ovarian-tumor (otu) gene in Drosophila melanogaster, Genetics 133, pp.253-263, 1993.
Sass, G., Comer, A.R. and Searles, L.L., The ovarian tumor protein isoforms of Drosophila melanogaster exhibit differences in function, expression, and localization, developmental Biology 167, pp.201-212, 1995.
Schedl, A, Ross, A., Lee, M., Engelkamp, D., Rashbass, van Heyningen, V., and Hastie, N., Influence of PAX6 gene dosage on development: over-expression causes sever eye abnormalities, Cell 86, pp.71-82, 1992.
Schupbach, T., and Wieschhaus, E., Female sterile mutations on the second chromosome of Drosophila melanogaster II mutations blocking oogenesis an altering egg morphology, Genetics 129, pp.1119-1136, 1991.
Shupbach, T., an Wieschaus, E., Female sterile mutations on the second chromosome of Drosophila melanogaster I. Maternal effect mutations, Genetics 121, pp.101-17, 1989.
Schupbach, T., Normal female germ cell differentiation requires the female X-chromosome to autosome ratio and expression of Sex-lethal in Drosophila melanogaster, Genetics 109, pp.529-548, 1985.
Simon, J.A. and Lis, J.T., A germline transformation analysis reveals flexibility in the organization of the heat-shock consensus elements, Nucleic Acids Research, Vol 15, No.7, 1987.
Staab, H., Heller, A., Steinmann-Zwicky, M., Somatic sex determining signals act on XX germ cells in Drosophila embryos, Development 122, pp.4065-4071, 1996.
Staab, H., and Steinmann-Zwicky, M., Female germ cells of Drosophila require zygotic ovo and out product for survival in larvae and pupae, Mech. Dev. 54, pp.205-210, 1995.
Stanewsky, R., Rendahl, K.G., Dill, M., and Saumweber, H., Genetic and molecular analysis of the X-chromosomal region 14B17-14C4 in Drosophila melanogaster: Loss of function in NONA, a nuclear protein common to many cell types, results in specific physiological and behavioral defects, Genetics 135, pp.419-442, 1993.
Steinman-Zwicky, M., Sex determination of the Drosophila germ line: tra and dsx control somatic inductive signals, Development 120, pp. 707-716, 1994.
Steinman-Zwicky, M., Sxl in the germline of Drosophila: A target for somatic late induction, Developmental Genetics 15, pp.265-274, 1994.
Steinman-Zwicky, M., Sex determination in Drosophila: sis-b, a major numerator element of the X:A ratio in the soma, does not contribute to the X:A ratio in germ line, Development 117, pp. 763-767, 1993.
Steinman-Zwicky, M., How do the germ cells choose their sex? Drosophila as a paradigm, Bioassays 14 (8), pp.513-518, 1992.
Steinman-Zwicky, M., Anrein, H. and Nothiger, R., Genetic control of sex determination in Drosophila, Advanced Genetics 27, pp.189-237, 1990.
Steinman-Zwicky, M., Schmid, H. and Nothiger, R., Cell-autonomous an inductive signals can determine the sex of the germ line of Drosophila by regulating the gene Sxl, Cell, Vol. 57, pp.157-166, 1989.
Steinman-Zwicky, M., Sex determination in Drosophila. The X-chromosomal gene liz is required for Sxl activity, The EMBO Journal 7, pp.3889-3898, 1988.
Steinman-Zwicky, M. and Nothiger, R., The small region on the X chromosome of Drosophila regulates a key gene that controls sex determination and dosage compensation, Cell, Vol. 42, pp.877-887, 1985.
Sosnowski, B. A., Belote, J. M. and McKeown, M., Sex specific alternative spilicing of RNA gene results from sequence-dependent splice site blockage, Cell, Vol. 3, pp.449-459, 1989.
Yarfitz, S., Provost, N. M., and Hurley, J. B., Cloning of Drosophila melanogaster guanine nucleotide regulatory protein subunit gene and characterization of its expression during development, PNAS USA 85, pp.7134-7138, 1988.
Wieschaus, E., Audit, C., and Masson, M., A clonal analysis of the rules of somatic cells and germline during oogenesis in Drosophila, Developmental Biology 88, pp.92-103, 1981.
Wieschaus, E., Nusslein-Volhard, C., an Jurgen, G., Mutations affecting the pattern of the larval cuticle in Drosophila melanogaster. Part III. Zygotic loci on the X-chromosome and fourth chromosome, Roux. Arch. Dev. Biol., 193, pp.296-307, 1984.
Track 6 focuses on how compounds (drugs) work in the body. How are they influenced by various ‘omics’? How do they vary by tissue? The practical implications of such a compound-centric approach are exciting: new targets, new screens, new markers, new understanding of drug failure mechanisms. The systems computational tool sets including multi-scale modeling, simulation, web-based platforms, etc. will be emphasized.
Cindy Crowninshield, RD, LDN, Conference Director, Cambridge Healthtech Institute
4:05 Keynote Introduction
Kevin Brode, Senior Director, Health & Life Sciences, Americas Hitachi Data Systems
» 4:15 PLENARY KEYNOTE
Do Network Pharmacologists Need Robot Chemists?
Andrew L. Hopkins, DPhil, FRSC, FSB, Division of Biological Chemistry and Drug Design, College of Life Sciences, University of Dundee
5:00 Welcome Reception in the Exhibit Hall with Poster Viewing
Welcome Reception Introduction, Sponsored by Okta
Drop off a business card at the CHI Sales booth for a chance to win 1 of 2 iPads® or 1 of 2 Kindle Fires®!*
*Apple ® and Amazon are not sponsors or participants in this program
WEDNESDAY, APRIL 10
7:00 am Registration and Morning Coffee
8:00 Chairperson’s Opening Remarks
Phillips Kuhl, Co-Founder and President, Cambridge Healthtech Institute
8:05 Keynote Introduction
Sanjay Joshi, CTO, Life Sciences, EMC Isilon
» 8:15 PLENARY KEYNOTE
Atul Butte, M.D., Ph.D., Division Chief and Associate Professor, Stanford University School of Medicine; Director, Center for Pediatric Bioinformatics, Lucile Packard Children’s Hospital; Co-founder, Personalis and Numedii
8:55 Benjamin Franklin Award & Laureate Presentation
9:15 Best Practices Award Program
9:45 Coffee Break in the Exhibit Hall with Poster Viewing
PHARMACODYNAMIC MODELS
10:50 Chairperson’s Remarks
Hugo Geerts, Ph.D., CSO, Computational Neuropharmacology, In Silico Biosciences
I will describe the emergence of “systems pharmacology” as a means to guide the creation of new molecular matter, study cellular networks and their perturbation by drugs, understand pharmaco-kinetics and pharmaco-dynamics in mouse and man and design and analyze clinical trial data. The approach combines mathematical modeling with empirical measurement as a means to tackle basic and clinical problems in pharmacology. Ultimately we aim for models that describe drug responses at multiple temporal and physical scales from molecular mechanism to whole-organism physiology.
11:30 Using Quantitative Systems Pharmacology for De-Risking Projects in CNS R&D
Hugo Geerts, Ph.D., CSO, Computational Neuropharmacology, In Silico Biosciences
Quantitative Systems Pharmacology is a computer based mechanistic modeling approach combining physiology, the functional imaging of genetics with the pharmacology of drug-receptor interaction and parameterized with clinical data and is a possible powerful tool for improving the success rate of CNS R&D projects. The presentation will include failure analyses of unsuccessful clinical trials, correct prospective identification of clinical problems that halted clinical development and estimation of genotype effects on the pharmacodynamics of candidate drugs.
12:00 pm Systems Pharmacology Approaches to Drug Repositioning
Svetlana Bureeva, Ph.D., Director, Professional Services, Thomson Reuters, IP & Science
Drug repositioning requires advanced computational approaches and comprehensive knowledgebase information to reach success. Thomson Reuters will present on recent advances in drug repositioning approaches, their validation and performance, best practices in using systems biology content, and successful case studies.
12:30 Luncheon Presentation (Sponsorship Opportunity Available) or Lunch on Your Own
HIGH CONTENT ANALYSIS: CANCER CELL LINES
1:40 Chairperson’s Remarks
William Reinhold, Manager, Genomics and Bioinformatics Group, Laboratory of Molecular Pharmacology (LMP), National Cancer Institute (NCI)
1:45 Systems Pharmacology Using CellMiner and the NCI-60 Cancerous Cell Lines
William Reinhold, Manager, Genomics and Bioinformatics Group, Laboratory of Molecular Pharmacology (LMP), National Cancer Institute (NCI)
CellMiner is a web-based application that allows rapid access to and comparison between 20,503 compound activities and the expression levels of 26,065 genes and 360 microRNAs. Included are 102 FDA-approved drugs as well as 53 in clinical trials. The tool is designed for the non-informatisist, and allows the user wide latitude in defining the question of interest. This opens the door to systems pharmacological studies for physicians, molecular biologists and others without bioinformatics expertise.
2:15 Oncology Drug Combinations at Novartis
Joseph Lehár, Ph.D., Associate Director, Bioinformatics, Oncology Translational Research, Novartis; Adjunct Assistant Professor, Bioinformatics, Boston University
Novartis is undertaking a large-scale effort to comprehensively describe cancer through the lens of cell cultures and tissue samples. In collaboration with academic and industrial partners, we have generated mutation status, gene copy number, and gene expression data for a library of 1,000 cancer cell lines, representing most cancer lineages and common genetic backgrounds. Most of these cell lines have been tested for chemosensitivity against ~1,200 cancer-relevant compounds, and we are systematically exploring drug combinations for synergy against ~100 prioritized CCLE lines. We expect this large-scale campaign to enable efficient patient selection for clinical trials on existing cancer drugs, reveal many therapeutically promising drug synergies or anti-resistance combinations, and provide unprecedented detail on functional interactions between cancer signaling pathways. I will discuss early highlights of this work and describe our plans to make use of this resource.
3:15 Refreshment Break in the Exhibit Hall with Poster Viewing
PHARMACODYNAMIC MODELS FOR ONCOLOGY
3:45 Systems Biology in Cancer Immunotherapy: Applications in the Understanding of Mechanism of Action and Therapeutic Response
Debraj Guha Thakurta, Ph.D., Senior Scientist II & Group Leader, Systems Biology, Dendreon Corporation
We are using high-content platforms (DNA and protein microarrays, RNA-seq) in various stages of the development of cellular immunotherapies for cancer. We will provide examples of genomic applications that can aid in the mechanistic understanding and the discovery of molecular markers associated with the efficacy of a cancer immunotherapy..
4:15 Use of Systems Pharmacology to Aid Cancer Clinical Development
Anna Georgieva Kondic, Ph.D., MBA, Senior Principal Scientist, Modeling and Simulation, Merck Research Labs
The last few years have seen an increased use of physiologically-based pharmacokinetics and pharmacodynamics models in Oncology drug development. This is partially due to an improved mechanistic understanding of disease drivers and the collection of better patient-level quantitative data that lends itself to modeling. In this talk, a suite of studies where systems modeling was successfully used to inform either preclinical to clinical transition or clinical study design will be presented. The talk will complete with a potential systems pharmacology framework that can be used systematically in drug development.
4:45 Two-Edged Sword Role of the Mammalian DNA Methyltransferases: New Implication to Cancer Therapy Targeting the Epigenetic Pathway
Che-Kun James Shen, Ph.D., Distinguished Research Fellow, Institute of Molecular Biology, Academia Sinica
Methylation at the 5-position of cytosine (C) to generate 5-methylcytosine (5-mC) on the vertebrate genomes is an essential epigenetic modification that regulates different biological processes including carcinogenesis. This modification has been known to be accomplished by the combined catalytic actions of three DNA methyltransferases (DNMTs), the de novo enzymes DNMT3A/ DNMT3B and the maintenance enzyme DNMT1. This property of DNMTs and the imbalance of CpG methylation in cancer cells have led to the development of cancer therapeutic drugs/ chemicals targeting the DNA methylation activities of DNMTs. However, we have recently discovered that the mammalian DNMTs could also act as active DNA 5-mC demethylases in a Ca++ion-and redox state-dependent manner. This suggests new directions for re-investigation of the structures of DNMTs and their functions in the genome wide and/or local DNA methylation in the mammalian cells. In particular, the concept and strategies for drug therapy targeting the DNMTs may need to be re-evaluated.
5:15 Best of Show Awards Reception in the Exhibit Hall
6:15 Exhibit Hall Closes
Thursday, April 11
7:00 am Breakfast Presentation (Sponsorship Opportunity Available) or Morning Coffee
MODELING AND MINING TARGETS
8:45 Chairperson’s Opening Remarks
I-Ming Wang, Ph.D., Associate Scientific Director, Research Solutions and Bioinformatics, Informatics and Analysis, Merck Research Laboratory
8:50 Systems Biology Approach for Identification of New Targets and Biomarkers
I-Ming Wang, Ph.D., Associate Scientific Director, Research Solutions and Bioinformatics, Informatics and Analysis, Merck Research Laboratory
A representative gene signature was identified by an integrated analysis of expression data in twelve rodent inflammatory models/tissues. This “inflammatome” signature is highly enriched in known drug target genes and is significantly overlapped with macrophage-enriched metabolic networks (MEMN) reported previously. A large proportion of genes in this signature are tightly connected in several tissue-specific Bayesian networks built from multiple mouse F2 crosses and human tissue cohorts; furthermore, these tissue networks are very significantly overlapped. This indicates that variable expression in this set of co-regulated genes is the main driver of many disease states. Disease-specific gene sets with the potential of being utilized as biomarkers were also identified with the approach we applied. The identification of this “inflammatome” gene signature extends the coverage of MEMN beyond adipose and liver in the metabolic disease to multiple diseases involving various affected tissues.
9:20 Optimizing Therapeutic Index (TI) by Exploring Co-Dependencies of Target and Therapeutic Properties
Conventional drug-discovery informatics workflows employ combinations of mechanistic/probabilistic in-silico methods to rank lists of targets; therapeutics are then developed for “optimal” targets. I describe a systems pharmacology approach that instead integrates systematic in-silico therapeutic perturbation with models of target/disease biology to identify conditions for optimal TI; non-intuitively optimal TI is sometimes achieved by pairing sub-optimal targets with therapeutics having appropriate properties.
9:50 Leveraging Mathematical Models to Understand Population Variability in Response to Cardiac Drugs
Eric Sobie, Ph.D., Associate Professor, Pharmacology & Systems Therapeutics, Icahn School of Medicine, Mount Sinai School of Medicine
Mathematical models of heart cells and tissues are sufficiently advanced that the models can predict mechanisms underlying pro-arrhythmic or anti-arrhythmic effects of drugs. At present, however, these models are not adequate for understanding variability across a population, i.e., why a drug may be effective in one patient but ineffective in another patient. I will describe novel computational approaches my laboratory has developed to quantify and predict differences between individuals in response to cardiac drugs.
10:20 Coffee Break in the Exhibit Hall and Poster Competition Winners Announced
10:45 Plenary Keynote Panel Chairperson’s Remarks
Kevin Davies, Ph.D., Editor-in-Chief, Bio-IT World
10:50 Plenary Keynote Panel Introduction
Yury Rozenman, Head of BT for Life Sciences, BT Global Services
» PLENARY KEYNOTE PANEL
11:05 The Life Sciences CIO Panel
Panelists:
Remy Evard, CIO, Novartis Institutes for BioMedical Research
Martin Leach, Ph.D., Vice President, R&D IT, Biogen Idec
Andrea T. Norris, Director, Center for Information Technology (CIT) and Chief Information Officer, NIH
Gunaretnam (Guna) Rajagopal, Ph.D., VP & CIO – R&D IT, Research, Bioinformatics & External Innovation, Janssen Pharmaceuticals
Cris Ross, Chief Information Officer, Mayo Clinic
12:15 pm Luncheon in the Exhibit Hall with Poster Viewing
MODELING MOLECULAR AND PATHOPHYSIOLOGICAL DATA
1:55 Chairperson’s Remarks
Jake Chen, Ph.D., Associate Professor, Indiana University School of Informatics & Purdue University Department of Computer Science; Director, Indiana Center for Systems Biology and Personalized Medicine
2:00 Predicting Adverse Side Effects of Drugs Using Systems Pharmacology
Jake Chen, Ph.D., Associate Professor, Indiana University School of Informatics & Purdue University Department of Computer Science; Director, Indiana Center for Systems Biology and Personalized Medicine
A new way of studying drug toxicity is to incorporate biomolecular annotation and network data with clinical observations of drug targets upon drug perturbations. I will describe the development of a novel computational modeling framework, with which we demonstrated the highest drug toxicity prediction accuracies ever reported by far. Adoption of this framework may have profound practical drug discovery implications.
2:30 Holistic Integration of Molecular and Physiological Data and Its Application in Personalized Healthcare
David de Graaf, Ph.D. President and CEO, Selventa
There are multiple industry-wide challenges in aggregating molecular and pathophysiological data for systems pharmacology to transform the process of drug discovery and development. One of the ways to address these challenges is to utilize a common computable biological expression language (BEL) that can provide a comprehensive knowledge network for new discoveries. An application of BEL and its use in identifying clinically relevant predictive biomarkers for patient stratification will be presented.
3:00 The Role of Informatics in ADME Pharmacogenetics
Boyd Steere, Ph.D., Senior Research Scientist, Lilly Research Laboraories, IT Research Informatics, Eli Lilly
The leveraging of pharmacogenetics to support decisions in early-phase clinical trial design requires informatics methods to integrate, visualize, and analyze heterogeneous data sets from many different discovery platforms. This presentation describes challenges and solutions in making sense of diverse sets of genetic, protein, and metabolic data in support of ADME pharmacology projects.
3:30 A Systems Pharmacology Approach to Understand and Optimize Functional Selectivity for Non-Selective Drugs
Joshua Apgar, Principal Scientist, Systems Biology, Dept. of Immunology & Inflammation, Boehringer Ingelheim Pharmaceuticals, Inc.
Most commonly the selectivity of a compound is defined in an in vitro or cellular assay, and it is thought of as principally a function of the binding energy of the drug to its on-target and off-target proteins; however, in vivo functional selectivity is much more complicated, and is affected by systems level effects such as multiple feedback processes within and between the various on- and off-target pathways. These systems level processes are often impossible to reconstruct in vitro as they involve many cell types, tissues, and organs systems throughout the body. We show here that through mathematical modeling we were able to identify, in silico, molecular properties that are critical to driving functional selectivity. The models, although simple, capture the key systems pharmacology needed to understand the on- an off- target effects. Surprisingly, in this case, the key driver of functional selectivity is not the affinity of the drugs but rather the pharmacokinetics, with drugs having a short half-life predicted to be the most functionally selective.
See the Winners of the following 2013 Awards:
Benjamin Franklin
Best of Show
Best Practices
View Novel Technologies and Solutions in the Expansive Exhibit Hall
And Much More!
KEYNOTE PRESENTERS:
Atul Butte, M.D., Ph.D., Division Chief and Associate Professor, Stanford University School of Medicine; Director, Center for Pediatric Bioinformatics, Lucile Packard Children’s Hospital; Co-founder, Personalis and Numedii
Andrew L. Hopkins, D.Phil, FRSC, FSB, Division of Biological Chemistry and Drug Design, College of Life Sciences, University of Dundee
PLENARY SESSION:
The Life Sciences CIO Panel
From managing big data and cloud computing capabilities to building virtual communities and optimizing drug development, the life sciences CIO has to be a firefighter, evangelist, visionary. In this special plenary roundtable, Bio-IT World invites a select group of CIOs from big pharma, academia and government to discuss the major issues facing today’s biosciences organization and the prospects for future growth and organizational success.
Special guests:
Remy Evard – CIO, Novartis Institutes for BioMedical Research
Martin Leach, Ph.D., Vice President, R&D IT, Biogen Idec
Andrea T. Norris – Director, Center for Information Technology (CIT) and CIO, NIH
Gunaretnam (Guna) Rajagopal, Ph.D., VP & CIO – R&D IT, Research, Bioinformatics & External Innovation, Janssen Pharmaceuticals
Cris Ross – CIO, Mayo Clinic
FEATURED SESSIONS:
Managing Big Data: The Genome Center Perspective
Panelists Include: Matthew Trunnell (Broad Institute)
Alexander (Sasha) Wait Zaranek, (Harvard Medical School/Clinical Future, Inc.)
Guy Coates (The Wellcome Trust Sanger Institute)
Building the IT Architecture of the New York Genome Center
Chris Dwan, Acting Senior Vice President, Information Technology and Research Computing, New York Genome Center
Kevin Shianna, Senior Vice President, Sequencing Operations, New York Genome Center
Jim Harding, CTO, Sabey Corporation
Sanjay Joshi, CTO, Life Sciences, EMC Isilon Storage Division
Robert B. Darnell, M.D., Ph.D., President & Scientific Director, New York Genome Center
Additional Speakers to be Announced
3/4
VIDEO CHANNEL
Cancer Trends Plenary Session (part II) – Bio-IT World Expo 2012
Bio-IT World Expo 2012 – 10th Anniversary Celebration
ClearTrial wins Best of Show 2012
CERF wins Best of Show 2012
OpsCode wins Best of Show 2012
Cambridge Semantics wins Best of Show 2012
Stephen Wolfram, Ph.D., part 1 – Keynote Presentation
BlueArc at 2011 Bio-IT World Conference & Expo
Roche innovative multi-touch environment for scientific decision
Praxeon DocumentLens
Yury Rozenman – Bio-IT World Expo 2011 Keynote Panel
Mark Boguski – Bio-IT World Expo 2011 Keynote Panel
Ken Buetow – Bio-IT World Expo 2011 Keynote Panel
Benjamin Heywood – Bio-IT World Expo 2011 Keynote Panel
Debora Goldfarb – Bio-IT World Expo 2011 Keynote Panel
Martin Leach – Bio-IT World Expo 2011 Keynote Panel
Personalized Medicine: Clinical Aspiration of Microarrays
Reporter, Writer: Stephen J. Williams, Ph.D.
In this month’s Science, Mike May (at http://www.sciencemag.org/site/products/lst_20130215.xhtml) describes some of the challenges and successes in introducing microarray analysis to the clinical setting. Traditionally used for investigational research, microarray is now being developed, customized and used for biomarker analysis, prognostic and predictive value, in a disease-specific manner.
Challenges in data interpretation
In an interview with Seth Crosby, director of the Genome Technology Access Center at Washington University School of Medicine in St. Louis, “the biggest challenge” in moving microarray to the clinical setting is data interpretation. The current technology makes it possible to evaluate expression of thousands of genes from a patient’s sample however as Crosby describes is assigning clinical relevance to the data. For example Crosby explains that Washington University had validated a panel of 45 oncology genes by next generation sequencing and are using these genes to develop diagnostic tests to screen patient tumors for the purpose of determining a personalized therapeutic strategy. Seth Crosby noted it took “hundreds of Ph.D. and M.D. hours” to sift through the hundreds of papers to determine which genes were relevant to a specific cancer type. However, he notes, that once we better understand which changes in the patient’s genome are related to a specific disease we will be able to narrow down the list and be able to produce both economical and more disease-relevant microarrays.
Is this aberration pathogenic or not?
Microarrays are becoming an invaluable tool in cytogenetics, as eluded by Andy Last, executive vice president of the genetic analysis business unit atAffymetrix. Certain diseases like Down syndrome have well characterized chromosomal alterations like additions or deletions of parts or entire chromosomes. According to Affymetrix, the most common use of microarrays is for determining copy number variation. However according to James Clough, vice president of clinical and genomic services at Oxford Gene Technology, given the hundreds of syndromes associated with chromosomal rearrangements, the challenge will be to determine if a small chromosomal aberration has pathologic significance, given that microarray affords much higher diagnostic yield and speed of analysis than traditional microscopic techniques. To address this challenge, Oxford Gene Technologies, PerkinElmer, Affymetrix, and Agilent all have custom designed microarrays to evaluate disease specific copy number and SNP (single nucleotide polymorphism) microarrays. For example PerkinElmer designed OncoChip™ to evaluate copy number variation in more than 1.800 cancer genes. Agilent makes microarrays that evaluates both copy number variation such as its CGH (comparative genomic hybridization) plus SNP microarrays. Patricia Barco, product manager for cytogenetics at Agilent, notes these arrays can be used in prenatal and postnatal research and cancer, and “can be customized from more than 28 million probes in our library”.
Custom Tools and Software to Handle the Onslaught of Big Data
There is a need for FDA approved diagnostic tools based on microarrays. Pathwork Diagnostic’s has one such tool (the Pathwork Tissue of Origin test), which uses 2,000 transcript markers and a proprietary computational algorithm to determine from expression analysis, the tissue of origin of a patient’s tumor. Pathwork also provides a fast, custom turn-around analytical service for pathologists who encounter difficult to interpret samples. Illumina provides the Infinium HumanCore BeadChip family of microarrays, which can determine genetic variations for purposes of biological tissue banking. This system uses a set of over 300,000 SNP probes plus 240,000 exome-based markers.
Tools have also been developed to validate microarray results. A common validation strategy is the use of quantitative real-time PCR to verify the expression changes seen on the microarray. Life Technologies developed the TaqMan OpenArray Real Time PCR plates, which have 3,072 wells and can be custom-formatted using their library of eight million validated TaqMan assays.
Making Sense of the Big Data: Bridging the Knowledge Gap using Bioinformatics
The use of microarray has spurned industries devoted to developing the bioinformatics software to analyze the massive amounts of data and provide clinical significance. For example companies such as Expression Analysis use their bioinformatics software to provide pathway analysis for microarray data in order to translate the data into the biology. Using such strategies can also validate the design of microarrays for various diseases.
Foundation Medicine, Inc., a molecular information company, provides cancer genomics test solutions. It offers FoundationOne, an informative genomic profile to identify a patient’s individual molecular alterations and match them with relevant targeted therapies and clinical trials. The company’s product enables physicians to recommend treatment options for patients based on the molecular subtype of their cancer.
Cancer research has rapidly embraced high throughput technologies into its research, using various microarray, tissue array, and next generation sequencing platforms. The result has been a rapid increase in cancer data output and data types. Now more than ever, having the bioinformatic skills and knowledge of available bioinformatic resources specific to cancer is critical. The CBW will host a 5-day workshop covering the key bioinformatics concepts and tools required to analyze cancer genomic data sets. Participants will gain experience in genomic data visualization tools which will be applied throughout the development of the skills required to analyze cancer -omic data for gene expression, genome rearrangement, somatic mutations and copy number variation. The workshop will conclude with analyzing and conducting pathway analysis on the resultant cancer gene list and integration of clinical data.
Successful Examples of Clinical Ventures Integrating Bioinformatics in Cancer Treatment Decision –Making
The University of Pavia, Italy developed a fully integrated oncology bioinformatics workflow as described on their website and at the ESMO 2012 Congress meeting:
ONCO-I2B2 PROJECT: A BIOINFORMATICS TOOL INTEGRATING –OMICS AND CLINICAL DATA TO SUPPORT TRANSLATIONAL RESEARCH
Abstract:
2530
Congress:
ESMO 2012
Type:
Abstract
Topic:
Translational research
Authors:
A. Zambelli, D. Segagni, V. Tibollo, A. Dagliati, A. Malovini, V. Fotia, S. Manera, R. Bellazzi; Pavia/IT
Body
The ONCO-i2b2 project, supported by the University of Pavia and the Fondazione Salvatore Maugeri (FSM), aims at supporting translational research in oncology and exploits the software solutions implemented by the Informatics for Integrating Biology and the Bedside (i2b2) research centre, an initiative funded by the NIH Roadmap National Centres for Biomedical Computing. The ONCO-i2b2 software is designed to integrate the i2b2 infrastructure with the FSM hospital information system and the Bruno Boerci Biobank, in order to provide well-characterized cancer specimens along with an accurate patients clinical data-base. The i2b2 infrastructure provides a web-based access to all the electronic medical records of cancer patients, and allow researchers analyzing the vast amount of biological and clinical information, relying on a user-friendly interface. Data coming from multiple sources are integrated and jointly queried.
In 2011 at AIOM Meeting we reported the preliminary experience of the ONCO-i2b2 project, now we’re able to present the up and running platform and the extended data set. Currently, more than 4400 specimens are stored and more than 600 of breast cancer patients give the consent for the use of specimens in the context of clinical research, in addition, more than 5000 histological reports are stored in order to integrate clinical data.
Within the ONCO-i2b2 project is possible to query and merge data regarding:
• Anonymous patient personal data;
• Diagnosis and therapy ICD9-CM subset from the hospital information system;
• Histological data (tumour SNOMED and TNM codes) and receptor profile testing (Her2, Ki67) from anatomic pathology database;
• Specimen molecular characteristics (DNA, RNA, blood, plasma and cancer tissues) from the Bruno Boerci Biobank management system.
The research infrastructure will be completed by the development of new set of components designed to enhance the ability of an i2b2 hive to utilize data generated by NGS technology, providing a mechanism to apply custom genomic annotations. The translational tool created at FSM is a concrete example regarding how the integration of different information from heterogeneous sources could bring scientific research closer to understand the nature of disease itself and to create novel diagnostics through handy interfaces.
Disclosure
All authors have declared no conflicts of interest.
Cancer Bioinformatics: Recovery Act Investment Report
November 2009
Public Health Burden of Cancer
Cancer is the second leading cause of death in the United States after heart disease. In 2009, it is estimated that nearly 1.5 million new cases of invasive cancer will be diagnosed in this country and more than 560,000 people will die of the disease.
Over the past five years, NCI’s Center for Biomedical Informatics and Information Technology (CBIIT) has led the effort to develop and deploy the cancer Biomedical Informatics Grid® (caBIG) in partnership with the broader cancer community. The caBIG network is designed to enable the integration and exchange of data among researchers in the laboratory and the clinic, simplify collaboration, and realize the potential of information-based (personalized) medicine in improving patient outcomes. caBIG has connected major components of the cancer community, including NCI-designated Cancer Centers, participating institutions of the NCI Community Cancer Centers Program (NCCCP), and numerous large-scale scientific endeavors, as well as basic, translational, and clinical researchers at public and private institutions across the United States and around the world. Beyond cancer research, caBIG capabilities—infrastructure, standards, and tools—provide a prototype for linking other disease communities and catalyzing a new 21st-century biomedical ecosystem that unifies research and care. ARRA funding will allow NCI to accelerate the ongoing development of the Cancer Knowledge Cloud and Oncology Electronic Health Records (EHRs) initiatives, thereby providing for continued job creation in the areas of biomedical informatics development and application as well as healthcare delivery.
The caBIG Cancer Knowledge Cloud: Extending the Research Infrastructure
The Cancer Knowledge Cloud is a virtual biomedical capability that utilizes caBIG tools, infrastructure, and security frameworks to integrate distributed individual and organizational data, software applications, and computational capacity throughout the broad cancer research and treatment community. The Cancer Knowledge Cloud connects, integrates, and facilitates sharing of the diverse primary data generated through basic and clinical research and care delivery to enable personalized medicine. The cloud includes information generated through large-scale research projects such as The Cancer Genome Atlas (TCGA), the cancer Human Biobank (caHUB) tissue acquisition network, the NCI Functional Biology Consortium, the NCI Patient Characterization Center, and the NCI Preclinical Development Pipeline, academic and industry counterparts to these projects, and clinical observations (from entities such as the NCCCP) captured in oncology-extended Electronic Health Records. Through the use of the caBIG Data Sharing and Security Framework, the Cloud will support appropriate sharing of information, supporting in silico hypothesis generation and testing, and enabling a learning healthcare system.
A caBIG-Based Rapid-Learning Healthcare System: Incorporating Oncology-Extended Electronic Healthcare Records (EHRs)
The 21st-century Cancer Knowledge Cloud will connect individuals, organizations, institutions, and their associated information within an information technology-enabled cycle of discovery, development, and clinical care—the paradigm of a rapid-learning healthcare system. This will transform these disconnected sectors into a system that is personalized, preventive, pre-emptive, and patient-participatory. To be realized, this model requires the adoption of standards-based EHRs. Presently, however, no certified oncology-based EHR exists, and fewer than 3 percent of oncologists with outpatient-based practices utilize EHRs. caBIG has recently established a collaboration with the American Society of Clinical Oncology (ASCO) to develop an oncology-specific EHR (caEHR) specification based on open standards already in use in the oncology community that will utilize caBIG standards for interoperability. NCI will implement an open-source version of this specification to validate the specification and to provide a free alternative to sites that choose not to purchase a commercial system. The launch customer for the caEHR will be NCCCP participating sites. NCI will work with appropriate entities to provide a mechanism for certifying that caEHR implementations are consistent with the NCI/ASCO specification.
Bards Cancer Institute has another clinical bioinformatics program to support their clinical efforts:
Clinical Bioinformatics Program in Oncology at Barts Cancer Institute at Barts and the London School of Medicine
Bioinformatics is a new interdisciplinary area involving biological, statistical and computational sciences. Bioinformatics will enable cancer researchers not only to manage, analyze, mine and understand the currently accumulated, valuable, high-throughput data, but also to integrate these in their current research programs. The need for bioinformatics will become ever more important as new technologies increase the already exponential rate at which cancer data are generated.
What we do
We work alongside clinical and basic scientists to support the cancer projects within BCI. This is an ideal partnership between scientific experts, who know the research questions that will be relevant from a cancer biologist or clinician’s perspective, and bioinformatics experts, who know how to develop the proposed methods to provide answers.
We also conduct independent bioinformatics research, focusing on the development of computational and integrative methods, algorithms, databases and tools to tackle the analysis of the high volumes of cancer data.
We also are actively involved in the development of bioinformatics educational courses at BCI. Our courses offer a unique opportunity for biologists to gain a basic understanding in the use of bioinformatics methods to access and harness large complicated high-throughput data and uncover meaningful information that could be used to understand molecular mechanisms and develop novel targeted therapeutics/diagnostic tools.
Developing Criteria for Genomic Profiling in Lung Cancer:
A Report from U.S. Cancer Centers
In a report by Pao et. al., a group of clinicians organized a meeting to standardize some protocols for the integration of microarray and genomic data from lung cancer patients into the clinical setting.[1] There has been ample evidence that adenocarcinomas could be classified into “clinically relevant molecular subsets” based on distinct genomic changes. For example EGFR (epidermal growth factor receptor) exon 19 deletions and exon 21 point mutations predict sensitivity to tyrosine kinase inhibitors (TKIs) like gefitinib, whereas exon 20 insertions predict primary resistance[2].
However, as the authors note, “mutational profiling has not been widely accepted or adopted into practice in thoracic oncology”.
Therefore, a multi-institutional workshop was held in 2009 among participants from Massachusetts General Hospital (MGH) Cancer Center, Memorial Sloan-Kettering Cancer Center (MSKCC), the Dana-Farber/Bingham & Women’s Cancer Center (DF/BWCC), the M.D. Anderson Cancer Center (VICC), and the Vanderbilt-Ingram Cancer Center (VICC) to discuss their institutes molecular profiling programs with emphasis on:
·Organization/workflow
·Mutation detection technologies
·Clinical protocols and reporting
·Patient consent
In addition to the aforementioned challenges, the panel discussed further issues for developing improved science-driven criteria for determining targeted therapies including:
1)Including pathologists into criteria development as pathology departments are usually the main repositories for specimens
2)Developing integrated informatics systems
3)Standardizing new target validation methodology across cancer centers
This is a lovely method and should find wide applicability in many settings, especially for microorganisms and cell lines. However, it is not clear that this approach will be, as implied by the discussion, an efficient mapping method for all multicellular organisms. I have performed similar experiments in Drosophila, focused on meiotic recombination, on a much smaller scale, and found that CRISPR-Cas9 can indeed generate targeted recombination at gRNA target sites. In every case I tested, I found that the recombination event was associated with a deletion at the gRNA site, which is probably unimportant for most mapping efforts, but may be a concern in some specific cases, for example for clinical applications. It would be interesting to know how often mutations occurred at the targeted gRNA site in this study.
The wider issue, however, is whether CRISPR-mediated recombination will be more efficient than other methods of mapping. After careful consideration of all the costs and the time involved in each of the steps for Drosophila, we have decided that targeted meiotic recombination using flanking visible markers will be, in most cases, considerably more efficient than CRISPR-mediated recombination. This is mainly due to the large expense of injecting embryos and the extensive effort and time required to screen injected animals for appropriate events. It is both cheaper and faster to generate markers (with CRISPR) and then perform a large meiotic recombination mapping experiment than it would be to generate the lines required for CRISPR-mediated recombination mapping. It is possible to dramatically reduce costs by, for example, mapping sequentially at finer resolution. But this approach would require much more time than marker-assisted mapping. If someone develops a rapid and cheap method of reliably introducing DNA into Drosophila embryos, then this calculus might change.
However, it is possible to imagine situations where CRISPR-mediated mapping would be preferable, even for Drosophila. For example, some genomic regions display extremely low or highly non-uniform recombination rates. It is possible that CRISPR-mediated mapping could provide a reasonable approach to fine mapping genes in these regions.
The authors also propose the exciting possibility that CRISPR-mediated loss of heterozygosity could be used to map traits in sterile species hybrids. It is not entirely obvious to me how this experiment would proceed and I hope the authors can illuminate me. If we imagine driving a recombination event in the early embryo (with maternal Cas9 from one parent and gRNA from a second parent), then at best we would end up with chimeric individuals carrying mitotic clones. I don’t think one could generate diploid animals where all cells carried the same loss of heterozygosity event. Even if we could, this experiment would require construction of a substantial number of stable transgenic lines expressing gRNAs. Mapping an ~20Mbp chromosome arm to ~10kb would require on the order of two-thousand transgenic lines. Not an undertaking to be taken lightly. It is already possible to perform similar tests (hemizygosity tests) using D. melanogaster deficiency lines in crosses with D. simulans, so perhaps CRISPR-mediated LOH could complement these deficiency screens for fine mapping efforts. But, at the moment, it is not clear to me how to do the experiment.