Feeds:
Posts
Comments

Posts Tagged ‘bioinformatic tools’

The Use of ChatGPT in the World of BioInformatics and Cancer Research and Development of BioGPT by MIT

Curator: Stephen J. Williams, Ph.D.

Chatbots are being used in multiple interdisciplanary areas in research and medicine and so it was a natural progression to incorporate artificial intelligence (AI), natural language processing (NLP), and chatbot technology like ChatGPT into bioinformatic analysis.

“In domains like informatics, management and marketing, media and communication science, languages and philosophy, psychology and sociology, engineering, design, and human-computer interaction, the fast expanding body of chatbot study is clearly interdisciplinary.

The field of bioinformatics is a natural fit for incorporting this technology.  The curated information below shows some use examples of this technology in bioinformatics related to cancer research.

The text in this reference source was actually generated by ChatGPT (https://chat.openai.com/chat)

Source: https://omicstutorials.com/using-chatgpt-in-bioinformatics-and-biomedical-research/

How ChatGPT can be used in bioinformatics research?
ChatGPT is a variant of the GPT (Generative Pre-trained Transformer) language model that was designed to generate human-like text in a conversational setting. It is not directly related to bioinformatics, which is the field of study that deals with the storage, retrieval, and analysis of biological data, particularly in the context of genetics and genomics. However, ChatGPT could potentially be used in a bioinformatics context as a tool for generating natural language descriptions of genetic or genomic data, or for generating responses to queries about such data.

For example, a researcher could use ChatGPT to generate descriptions of genetic variants or gene expression patterns based on input data. This could be useful for generating summaries of results for reports or papers, or for generating explanations of complex concepts for non-expert audiences. ChatGPT could also be used to generate responses to queries about genetic or genomic data, such as “What is the function of gene X?” or “What are the potential consequences of variant Y?”

It’s worth noting that ChatGPT is just one of many tools and techniques that can be used in bioinformatics, and it is unlikely to be the most suitable or effective option in every situation. There are many specialized tools and resources available for working with biological data, and it is often necessary to use a combination of these tools to fully analyze and understand the data.

Generating descriptions of genetic or genomic data: ChatGPT could be used to generate natural language descriptions of genetic or genomic data based on input data. For example, suppose a researcher has a dataset containing information about gene expression levels in different tissues. The researcher could use ChatGPT to generate a description of the data, such as:
“Gene X is highly expressed in the liver and kidney, with moderate expression in the brain and heart. Gene Y, on the other hand, shows low expression in all tissues except for the lung, where it is highly expressed.”

 

Thereby ChatGPT, at its simplest level, could be used to ask general questions like “What is the function of gene product X?” and a ChatGPT could give a reasonable response without the scientist having to browse through even highly curated databases lie GeneCards or UniProt or GenBank.  Or even “What are potential interactors of Gene X, validated by yeast two hybrid?” without even going to the curated InterActome databases or using expensive software like Genie.

Summarizing results: ChatGPT could be used to generate summaries of results from genetic or genomic studies. For example, a researcher might use ChatGPT to generate a summary of a study that found a association between a particular genetic variant and a particular disease. The summary might look something like this:
“Our study found that individuals with the variant form of gene X are more likely to develop disease Y. Further analysis revealed that this variant is associated with changes in gene expression that may contribute to the development of the disease.”

It’s worth noting that ChatGPT is just one tool that could potentially be used in these types of applications, and it is likely to be most effective when used in combination with other bioinformatics tools and resources. For example, a researcher might use ChatGPT to generate a summary of results, but would also need to use other tools to analyze the data and confirm the findings.

ChatGPT is a variant of the GPT (Generative Pre-training Transformer) language model that is designed for open-domain conversation. It is not specifically designed for generating descriptions of genetic variants or gene expression patterns, but it can potentially be used for this purpose if you provide it with a sufficient amount of relevant training data and fine-tune it appropriately.

To use ChatGPT to generate descriptions of genetic variants or gene expression patterns, you would first need to obtain a large dataset of examples of descriptions of genetic variants or gene expression patterns. You could use this dataset to fine-tune the ChatGPT model on the task of generating descriptions of genetic variants or gene expression patterns.

Here’s an example of how you might use ChatGPT to generate a description of a genetic variant:

First, you would need to pre-process your dataset of descriptions of genetic variants to prepare it for use with ChatGPT. This might involve splitting the descriptions into individual sentences or phrases, and encoding them using a suitable natural language processing (NLP) library or tool.

Next, you would need to fine-tune the ChatGPT model on the task of generating descriptions of genetic variants. This could involve using a tool like Hugging Face’s Transformers library to load the ChatGPT model and your pre-processed dataset, and then training the model on the task of generating descriptions of genetic variants using an appropriate optimization algorithm.

Once the model has been fine-tuned, you can use it to generate descriptions of genetic variants by providing it with a prompt or seed text and asking it to generate a response. For example, you might provide the model with the prompt “Generate a description of a genetic variant associated with increased risk of breast cancer,” and ask it to generate a response. The model should then generate a description of a genetic variant that is associated with increased risk of breast cancer.

It’s worth noting that generating high-quality descriptions of genetic variants or gene expression patterns is a challenging task, and it may be difficult to achieve good results using a language model like ChatGPT without a large amount of relevant training data and careful fine-tuning.

 

To train a language model like chatGPT to extract information about specific genes or diseases from research papers, you would need to follow these steps:

Gather a large dataset of research papers that contain information about the specific genes or diseases you are interested in. This dataset should be diverse and representative of the types of papers you want the model to be able to extract information from.

Preprocess the text data in the research papers by tokenizing the text and creating a vocabulary. You may also want to consider lemmatizing or stemming the text to reduce the dimensionality of the dataset.

Train the language model on the preprocessed text data. You may want to fine-tune a pre-trained model such as chatGPT on your specific dataset, or you can train a new model from scratch.

ChatGPT could also be useful for sequence analysis

A few examples of sequence analysis a ChatGPT could be useful include:

  1. Protein structure
  2. Identifying functional regions of a protein
  3. Predicting protein-protein interactions
  4. Identifying protein homologs
  5. Generating Protein alignments

All this could be done without having access to UNIX servers or proprietary software or knowing GCG coding

ChatGPT in biomedical research
There are several potential ways that ChatGPT or other natural language processing (NLP) models could be applied in biomedical research:

Text summarization: ChatGPT or other NLP models could be used to summarize large amounts of text, such as research papers or clinical notes, in order to extract key information and insights more quickly.

Data extraction: ChatGPT or other NLP models could be used to extract structured data from unstructured text sources, such as research papers or clinical notes. For example, the model could be trained to extract information about specific genes or diseases from research papers, and then used to create a database of this information for further analysis.

Literature review: ChatGPT or other NLP models could be used to assist with literature review tasks, such as identifying relevant papers, extracting key information from papers, or summarizing the main findings of a group of papers.

Predictive modeling: ChatGPT or other NLP models could be used to build predictive models based on large amounts of text data, such as electronic health records or research papers. For example, the model could be trained to predict the likelihood of a patient developing a particular disease based on their medical history and other factors.

It’s worth noting that while NLP models like ChatGPT have the potential to be useful tools in biomedical research, they are only as good as the data they are trained on, and it is important to carefully evaluate the quality and reliability of any results generated by these models.

 

ChatGPT in text mining of biomedical data
ChatGPT could potentially be used for text mining in the biomedical field in a number of ways. Here are a few examples:

Extracting information from scientific papers: ChatGPT could be trained on a large dataset of scientific papers in the biomedical field, and then used to extract specific pieces of information from these papers, such as the names of compounds, their structures, and their potential uses.

Generating summaries of scientific papers: ChatGPT could be used to generate concise summaries of scientific papers in the biomedical field, highlighting the main findings and implications of the research.

Identifying trends and patterns in scientific literature: ChatGPT could be used to analyze large datasets of scientific papers in the biomedical field and identify trends and patterns in the data, such as emerging areas of research or common themes among different papers.

Generating questions for further research: ChatGPT could be used to suggest questions for further research in the biomedical field based on existing scientific literature, by identifying gaps in current knowledge or areas where further investigation is needed.

Generating hypotheses for scientific experiments: ChatGPT could be used to generate hypotheses for scientific experiments in the biomedical field based on existing scientific literature and data, by identifying potential relationships or associations that could be tested in future research.

 

PLEASE WATCH VIDEO

 

In this video, a bioinformatician describes the ways he uses ChatGPT to increase his productivity in writing bioinformatic code and conducting bioinformatic analyses.

He describes a series of uses of ChatGPT in his day to day work as a bioinformatian:

  1. Using ChatGPT as a search engine: He finds more useful and relevant search results than a standard Google or Yahoo search.  This saves time as one does not have to pour through multiple pages to find information.  However, a caveat is ChatGPT does NOT return sources, as highlighted in previous postings on this page.  This feature of ChatGPT is probably why Microsoft bought OpenAI in order to incorporate ChatGPT in their Bing search engine, as well as Office Suite programs

 

  1. ChatGPT to help with coding projects: Bioinformaticians will spend multiple hours searching for and altering open access available code in order to run certain function like determining the G/C content of DNA (although there are many UNIX based code that has already been established for these purposes). One can use ChatGPT to find such a code and then assist in debugging that code for any flaws

 

  1. ChatGPT to document and add coding comments: When writing code it is useful to add comments periodically to assist other users to determine how the code works and also how the program flow works as well, including returned variables.

 

One of the comments was interesting and directed one to use BIOGPT instead of ChatGPT

 

@tzvi7989

1 month ago (edited)

0:54 oh dear. You cannot use chatgpt like that in Bioinformatics as it is rn without double checking the info from it. You should be using biogpt instead for paper summarisation. ChatGPT goes for human-like responses over precise information recal. It is quite good for debugging though and automating boring awkward scripts

So what is BIOGPT?

BioGPT https://github.com/microsoft/BioGPT

 

The BioGPT model was proposed in BioGPT: generative pre-trained transformer for biomedical text generation and mining by Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon and Tie-Yan Liu. BioGPT is a domain-specific generative pre-trained Transformer language model for biomedical text generation and mining. BioGPT follows the Transformer language model backbone, and is pre-trained on 15M PubMed abstracts from scratch.

The abstract from the paper is the following:

Pre-trained language models have attracted increasing attention in the biomedical domain, inspired by their great success in the general natural language domain. Among the two main branches of pre-trained language models in the general language domain, i.e. BERT (and its variants) and GPT (and its variants), the first one has been extensively studied in the biomedical domain, such as BioBERT and PubMedBERT. While they have achieved great success on a variety of discriminative downstream biomedical tasks, the lack of generation ability constrains their application scope. In this paper, we propose BioGPT, a domain-specific generative Transformer language model pre-trained on large-scale biomedical literature. We evaluate BioGPT on six biomedical natural language processing tasks and demonstrate that our model outperforms previous models on most tasks. Especially, we get 44.98%, 38.42% and 40.76% F1 score on BC5CDR, KD-DTI and DDI end-to-end relation extraction tasks, respectively, and 78.2% accuracy on PubMedQA, creating a new record. Our case study on text generation further demonstrates the advantage of BioGPT on biomedical literature to generate fluent descriptions for biomedical terms.

Tips:

  • BioGPT is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left.
  • BioGPT was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next token in a sequence. Leveraging this feature allows BioGPT to generate syntactically coherent text as it can be observed in the run_generation.py example script.
  • The model can take the past_key_values (for PyTorch) as input, which is the previously computed key/value attention pairs. Using this (past_key_values or past) value prevents the model from re-computing pre-computed values in the context of text generation. For PyTorch, see past_key_values argument of the BioGptForCausalLM.forward() method for more information on its usage.

This model was contributed by kamalkraj. The original code can be found here.

 

This repository contains the implementation of BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining, by Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon and Tie-Yan Liu. BioGPT is a github which is being developed by MIT in collaboration with Microsoft. It is based on Python.

License

BioGPT is MIT-licensed. The license applies to the pre-trained models as well.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

As of right now this does not seem Open Access, however a sign up is required!

We provide our pre-trained BioGPT model checkpoints along with fine-tuned checkpoints for downstream tasks, available both through URL download as well as through the Hugging Face 🤗 Hub.

Model Description URL 🤗 Hub
BioGPT Pre-trained BioGPT model checkpoint link link
BioGPT-Large Pre-trained BioGPT-Large model checkpoint link link
BioGPT-QA-PubMedQA-BioGPT Fine-tuned BioGPT for question answering task on PubMedQA link
BioGPT-QA-PubMedQA-BioGPT-Large Fine-tuned BioGPT-Large for question answering task on PubMedQA link
BioGPT-RE-BC5CDR Fine-tuned BioGPT for relation extraction task on BC5CDR link
BioGPT-RE-DDI Fine-tuned BioGPT for relation extraction task on DDI link
BioGPT-RE-DTI Fine-tuned BioGPT for relation extraction task on KD-DTI link
BioGPT-DC-HoC Fine-tuned BioGPT for document classification task on HoC link

 

So far as to date (6/12/2023) there are three papers referencing BioGPT on PubMed (abstracts and links below)

 

BioGPT: generative pre-trained transformer for biomedical text generation and mining.

Luo R, Sun L, Xia Y, Qin T, Zhang S, Poon H, Liu TY.Brief Bioinform. 2022 Nov 19;23(6):bbac409. doi: 10.1093/bib/bbac409.PMID: 36156661

We evaluate BioGPT on six biomedical natural language processing tasks and demonstrate that our model outperforms previous models on most tasks. …Our case study on text generation further demonstrates the advantage of BioGPT on biomedical literature to generate fl …

Here they show the advantages of BioGPT over BioBERT and ChatGPT in an example of text extraction and question answer

Table 7. Example of generated text

Input                                           Model                 Text

Bicalutamide                               GPT-2                 Bicalutamide, a synthetic bicalutamide derivative, was found to increase the cellular proliferation

of C. elegans cells. This effect was not observed with the other synthetic bicalutamide derivatives. This is in line with our previous work which showed that bicalutamide increases the proliferation of C. elegans cells but not in the presence of synthetic bicalutamide.

BioGPT             Bicalutamide (Casodex) is an androgen receptor (AR) antagonist approved for the treatment of metastatic castration-resistant prostate cancer (mCRPC) in patients who have progressed on or are ineligible for docetaxel chemotherapy, as well as for the treatment of early-stage prostate cancer in men who have not progressed on or are ineligible for docetaxel chemotherapy, as well as for the treatment of metastatic castration-sensitive prostate cancer (mCSPC) in men who have not progressed on or are ineligible for docetaxel chemotherapy.

Janus kinase 3 (JAK-3)                 GPT-2               Janus kinase 3 (JAK-3) mediates the conversion of glucose to glycogen in skeletal muscle. The
increase in glucose uptake by skeletal muscle is believed to be associated with a reduction in muscle protein breakdown and with the release of ketone bodies, which contribute to the metabolism of glucose.

BioGPT             Janus kinase 3 (JAK-3) is a member of the Janus kinase (JAK) family of non-receptor tyrosine kinases and plays an important role in the regulation of cell proliferation, differentiation, survival, migration and angiogenesis.

Apricitabine                                 GPT-2               Apricitabine, duloxetine and citalopram) were investigated in patients with a history of
treatment-resistant depression (n = 10). We found that treatment with the three drugs significantly decreased depressive symptoms, but not depressive symptoms over the next 12 weeks. We conclude that antidepressant drugs that reduce depressive symptoms and reduce depressive symptoms over time may be beneficial in patients with recurrent depression.

BioGPT             Apricitabine is an oral prodrug of 5-aza-2 ’-deoxycytidine (5-aza-CdR), a DNA methyltransferase (DNMT) inhibitor, which has been approved by the US Food and Drug Administration (FDA) for the treatment of myelodysplastic syndrome (MDS) and acute myeloid leukemia (AML) in combination with low-dose cytarabine (Ara-C) and granulocyte colony-stimulating factor (G-CSF) for patients with intermediate-2 or high-risk MDS or AML.

Xylazine                                      GPT-2               Xylazine, the ‘active ingredient’ in ‘bath salts’ (see above) has been linked to numerous deaths,
including those of former University of California, Berkeley, professor Andrew Anglin, who posted

a video of himself having a bath filled with it.

BioGPT             Xylazine is an alpha 2-adrenoceptor agonist which has been used as a sedative and analgesic in veterinary medicine for many years, but its effects on the cardiovascular system have not been extensively studied in the dog, and its effects on the central nervous system (CNS) have not been well characterized in the dog, despite the fact that xylazine has been widely used as a sedative and analgesic in veterinary medicine for more than 30 years.

Psoralen                                      GPT-2               Psoralen, the city of the spiritualist, and had been given to him by God, who wished to show him
the way of salvation, and to draw him back to the true life of Christ, and to bring him into communion with his fellow-man.

BioGPT             Psoralen photochemotherapy (PUVA) is a well-established treatment for psoriasis, but its use is limited by the risk of skin cancer, particularly squamous cell carcinoma (SCC) of the head and neck (H & N), which is the most common site of skin cancer in the United States (US)

CP-673451                                  GPT-2                 CP-673451) and 4-5-6-3-1-2-3-4-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-21-22-23-24-25-26-27-

28-29-30-31-32-33-34-35-36-37-38-39-40-41-42-43-44-45-46-47-48-49-50…

BioGPT             CP-673451 is a potent, selective, and orally active inhibitor of human neutrophil elastase (HNE) and human cathepsin G (CatG) with in vitro and in vivo anti-inflammatory activity in a variety of animal models of inflammation and in a model of acute lung injury (ALI) in the rat induced by intratracheal instillation of lipopolysaccharide (LPS) and tumor necrosis factor-alpha (TNF-alpha), a model of acute lung injury (ALI) in which neutrophils play an important role.

BIIB-021                                     GPT-2               BIIB-021A-2-1.pdf. Wright, C. (2002). ‘The War on Terror: A Legal Theory of Terrorist Attacks,’ in
L.E.R. Wright, S.R. Warrick and J.H. Rothman (Eds.), The War on Terror: Essays in Honor of Michael T. Klare (New York: Oxford University Press), 143-57.

BioGPT             BIIB-021 is a novel, orally active, non-peptide bradykinin B2 receptor antagonist with potent and long-lasting anti-inflammatory activity in animal models of acute and chronic inflammation and in a rat model of adjuvant-induced arthritis (AIA), an animal model of rheumatoid arthritis (RA) and in a rat model of collagen-induced arthritis (CIA), an animal model of collagen-induced arthritis (CIA), in which arthritis is induced by immunization with bovine type II collagen (CII).

Note how BioGPT is more descriptive and accurate!

EGFI: drug-drug interaction extraction and generation with fusion of enriched entity and sentence information.

Huang L, Lin J, Li X, Song L, Zheng Z, Wong KC.Brief Bioinform. 2022 Jan 17;23(1):bbab451. doi: 10.1093/bib/bbab451.PMID: 34791012

The rapid growth in literature accumulates diverse and yet comprehensive biomedical knowledge hidden to be mined such as drug interactions. However, it is difficult to extract the heterogeneous knowledge to retrieve or even discover the latest and novel knowledge in an efficient manner. To address such a problem, we propose EGFI for extracting and consolidating drug interactions from large-scale medical literature text data. Specifically, EGFI consists of two parts: classification and generation. In the classification part, EGFI encompasses the language model BioBERT which has been comprehensively pretrained on biomedical corpus. In particular, we propose the multihead self-attention mechanism and packed BiGRU to fuse multiple semantic information for rigorous context modeling. In the generation part, EGFI utilizes another pretrained language model BioGPT-2 where the generation sentences are selected based on filtering rules.

Results: We evaluated the classification part on ‘DDIs 2013’ dataset and ‘DTIs’ dataset, achieving the F1 scores of 0.842 and 0.720 respectively. Moreover, we applied the classification part to distinguish high-quality generated sentences and verified with the existing growth truth to confirm the filtered sentences. The generated sentences that are not recorded in DrugBank and DDIs 2013 dataset demonstrated the potential of EGFI to identify novel drug relationships.

Availability: Source code are publicly available at https://github.com/Layne-Huang/EGFI.

 

GeneGPT: Augmenting Large Language Models with Domain Tools for Improved Access to Biomedical Information.

Jin Q, Yang Y, Chen Q, Lu Z.ArXiv. 2023 May 16:arXiv:2304.09667v3. Preprint.PMID: 37131884 Free PMC article.

While large language models (LLMs) have been successfully applied to various tasks, they still face challenges with hallucinations. Augmenting LLMs with domain-specific tools such as database utilities can facilitate easier and more precise access to specialized knowledge. In this paper, we present GeneGPT, a novel method for teaching LLMs to use the Web APIs of the National Center for Biotechnology Information (NCBI) for answering genomics questions. Specifically, we prompt Codex to solve the GeneTuring tests with NCBI Web APIs by in-context learning and an augmented decoding algorithm that can detect and execute API calls. Experimental results show that GeneGPT achieves state-of-the-art performance on eight tasks in the GeneTuring benchmark with an average score of 0.83, largely surpassing retrieval-augmented LLMs such as the new Bing (0.44), biomedical LLMs such as BioMedLM (0.08) and BioGPT (0.04), as well as GPT-3 (0.16) and ChatGPT (0.12). Our further analyses suggest that: (1) API demonstrations have good cross-task generalizability and are more useful than documentations for in-context learning; (2) GeneGPT can generalize to longer chains of API calls and answer multi-hop questions in GeneHop, a novel dataset introduced in this work; (3) Different types of errors are enriched in different tasks, providing valuable insights for future improvements.

PLEASE WATCH THE FOLLOWING VIDEOS ON BIOGPT

This one entitled

Microsoft’s BioGPT Shows Promise as the Best Biomedical NLP

 

gives a good general description of this new MIT/Microsoft project and its usefullness in scanning 15 million articles on PubMed while returning ChatGPT like answers.

 

Please note one of the comments which is VERY IMPORTANT


@rufus9322

2 months ago

bioGPT is difficult for non-developers to use, and Microsoft researchers seem to default that all users are proficient in Python and ML.

 

Much like Microsoft Azure it seems this BioGPT is meant for developers who have advanced programming skill.  Seems odd then to be paying programmers multiK salaries when one or two Key Opinion Leaders from the medical field might suffice but I would be sure Microsoft will figure this out.

 

ALSO VIEW VIDEO

 

 

This is a talk from Microsoft on BioGPT

 

Other Relevant Articles on Natural Language Processing in BioInformatics, Healthcare and ChatGPT for Medicine on this Open Access Scientific Journal Include

Medicine with GPT-4 & ChatGPT
Explanation on “Results of Medical Text Analysis with Natural Language Processing (NLP) presented in LPBI Group’s NEW GENRE Edition: NLP” on Genomics content, standalone volume in Series B and NLP on Cancer content as Part B New Genre Volume 1 in Series C

Proposal for New e-Book Architecture: Bi-Lingual eTOCs, English & Spanish with NLP and Deep Learning results of Medical Text Analysis – Phase 1: six volumes

From High-Throughput Assay to Systems Biology: New Tools for Drug Discovery

Machine Learning (ML) in cancer prognosis prediction helps the researcher to identify multiple known as well as candidate cancer diver genes

 

20 articles in Natural Language Processing

142 articles in BioIT: BioInformatics

111 articles in BioIT: BioInformatics, NGS, Clinical & Translational, Pharmaceutical R&D Informatics, Clinical Genomics, Cancer Informatics

 

Read Full Post »

Live Notes, Real Time Conference Coverage 2020 AACR Virtual Meeting April 27, 2020 Minisymposium on AACR Project Genie & Bioinformatics 4:00 PM – 6:00 PM

SESSION VMS.MD01.01 – Advancing Cancer Research through an International Cancer Registry: AACR Project GENIE Use Cases
 
Reporter: Stephen J. Williams, PhD

April 27, 2020, 4:00 PM – 6:00 PM
Virtual Meeting: All Session Times Are U.S. EDT

Session Type
Virtual Minisymposium
Track(s)
Bioinformatics and Systems Biology
17 Presentations
4:00 PM – 6:00 PM
– Chairperson Gregory J. Riely. Memorial Sloan Kettering Cancer Center, New York, NY

4:00 PM – 4:01 PM
– Introduction Gregory J. Riely. Memorial Sloan Kettering Cancer Center, New York, NY

Precision medicine requires an end-to-end learning healthcare system, wherein the treatment decisions for patients are informed by the prior experiences of similar patients. Oncology is currently leading the way in precision medicine because the genomic and other molecular characteristics of patients and their tumors are routinely collected at scale. A major challenge to realizing the promise of precision medicine is that no single institution is able to sequence and treat sufficient numbers of patients to improve clinical-decision making independently. To overcome this challenge, the AACR launched Project GENIE (Genomics Evidence Neoplasia Information Exchange).

AACR Project GENIE is a publicly accessible international cancer registry of real-world data assembled through data sharing between 19 of the leading cancer centers in the world. Through the efforts of strategic partners Sage Bionetworks (https://sagebionetworks.org) and cBioPortal (www.cbioportal.org), the registry aggregates, harmonizes, and links clinical-grade, next-generation cancer genomic sequencing data with clinical outcomes obtained during routine medical practice from cancer patients treated at these institutions. The consortium and its activities are driven by openness, transparency, and inclusion, ensuring that the project output remains accessible to the global cancer research community for the benefit of all patients.AACR Project GENIE fulfills an unmet need in oncology by providing the statistical power necessary to improve clinical decision-making, particularly in the case of rare cancers and rare variants in common cancers. Additionally, the registry can power novel clinical and translational research.

Because we collect data from nearly every patient sequenced at participating institutions and have committed to sharing only clinical-grade data, the GENIE registry contains enough high-quality data to power decision making on rare cancers or rare variants in common cancers. We see the GENIE data providing another knowledge turn in the virtuous cycle of research, accelerating the pace of drug discovery, improving the clinical trial design, and ultimately benefiting cancer patients globally.

 

The first set of cancer genomic data aggregated through AACR Project Genomics Evidence Neoplasia Information Exchange (GENIE) was available to the global community in January 2017.  The seventh data set, GENIE 7.0-public, was released in January 2020 adding more than 9,000 records to the database. The combined data set now includes nearly 80,000 de-identified genomic records collected from patients who were treated at each of the consortium’s participating institutions, making it among the largest fully public cancer genomic data sets released to date.  These data will be released to the public every six months. The public release of the eighth data set, GENIE 8.0-public, will take place in July 2020.

The combined data set now includes data for over 80 major cancer types, including data from greater than 12,500 patients with lung cancer, nearly 11,000 patients with breast cancer, and nearly 8,000 patients with colorectal cancer.

For more details about the data, analyses, and summaries of the data attributes from this release, GENIE 7.0-public, consult the data guide.

Users can access the data directly via cbioportal, or download the data directly from Sage Bionetworks. Users will need to create an account for either site and agree to the terms of access.

For frequently asked questions, visit our FAQ page.

  • In fall of 2019 AACR announced the Bio Collaborative which collected pan cancer data in conjuction and collaboration and support by a host of big pharma and biotech companies
  • they have a goal to expand to more than 6 cancer types and more than 50,000 records including smoking habits, lifestyle data etc
  • They have started with NSCLC have have done mutational analysis on these
  • included is tumor mutational burden and using cbioportal able to explore genomic data even further
  • treatment data is included as well
  • need to collect highly CURATED data with PRISM backbone to get more than outcome data, like progression data
  • they might look to incorporate digital pathology but they are not there yet; will need good artificial intelligence systems

 

4:01 PM – 4:15 PM
– Invited Speaker Gregory J. Riely. Memorial Sloan Kettering Cancer Center, New York, NY

4:15 PM – 4:20 PM
– Discussion

4:20 PM – 4:30 PM
1092 – A systematic analysis of BRAF mutations and their sensitivity to different BRAF inhibitors: Zohar Barbash, Dikla Haham, Liat Hafzadi, Ron Zipor, Shaul Barth, Arie Aizenman, Lior Zimmerman, Gabi Tarcic. Novellusdx, Jerusalem, Israel

Abstract: The MAPK-ERK signaling cascade is among the most frequently mutated pathways in human cancer, with the BRAF V600 mutation being the most common alteration. FDA-approved BRAF inhibitors as well as combination therapies of BRAF and MEK inhibitors are available and provide survival benefits to patients with a BRAF V600 mutation in several indications. Yet non-V600 BRAF mutations are found in many cancers and are even more prevalent than V600 mutations in certain tumor types. As the use of NGS profiling in precision oncology is becoming more common, novel alterations in BRAF are being uncovered. This has led to the classification of BRAF mutations, which is dependent on its biochemical properties and affects it sensitivity to inhibitors. Therefore, annotation of these novel variants is crucial for assigning correct treatment. Using a high throughput method for functional annotation of MAPK activity, we profiled 151 different BRAF mutations identified in the AACR Project GENIE dataset, and their response to 4 different BRAF inhibitors- vemurafenib and 3 different exploratory 2nd generation inhibitors. The system is based on rapid synthesis of the mutations and expression of the mutated protein together with fluorescently labeled reporters in a cell-based assay. Our results show that from the 151 different BRAF mutations, ~25% were found to activate the MAPK pathway. All of the class 1 and 2 mutations tested were found to be active, providing positive validation for the method. Additionally, many novel activating mutations were identified, some outside of the known domains. When testing the response of the active mutations to different classes of BRAF inhibitors, we show that while vemurafenib efficiently inhibited V600 mutations, other types of mutations and specifically BRAF fusions were not inhibited by this drug. Alternatively, the second-generation experimental inhibitors were effective against both V600 as well as non-V600 mutations. Using this large-scale approach to characterize BRAF mutations, we were able to functionally annotate the largest number of BRAF mutations to date. Our results show that the number of activating variants is large and that they possess differential sensitivity to different types of direct inhibitors. This data can serve as a basis for rational drug design as well as more accurate treatment options for patients.

  • Molecular profiling is becoming imperative for successful  targeted therapies
  • 500 unique mutations in BRAF so need to use bioinformatic pipeline; start with NGS panels then cluster according to different subtypes or class specific patterns
  • certain mutation like V600E mutations have distinct clustering in tumor types
  • 25% of mutations occur with other mutations; mutations may not be functional; they used highthruput system to analyze other V600 braf mutations to determine if functional
  • active yet uncharacterized BRAF mutations seen in a major proportion of human tumors
  • using genomic drug data found that many inhibitors like verafanib are specific to a specific mutation but other inhibitors that are not specific to a cleft can inhibit other BRAF mutants
  • 40% of 135 mutants were functionally active
  • USE of Functional Profiling instead of just genomic profiling
  • Q?: They have already used this platform and analysis for RTKs and other genes as well successfully
  • Q? how do you deal with co reccuring mutations: platform is able to do RTK plus signaling protiens

4:30 PM – 4:35 PM
– Discussion

4:35 PM – 4:45 PM
1093 – Calibration Tool for Genomic Aggregates (CTGA): A deep learning framework for calibrating somatic mutation profiling data from conventional gene panel data. Jordan Anaya, Craig Cummings, Jocelyn Lee, Alexander Baras. Johns Hopkins Sidney Kimmel Comprehensive Cancer Center, MD, Genentech, Inc., CA, AACR, Philadelphia, PA

Abstract: It has been suggested that aggregate genomic measures such as mutational burden can be associated with response to immunotherapy. Arguably, the gold standard for deriving such aggregate genomic measures (AGMs) would be from exome level sequencing. While many clinical trials run exome level sequencing, the vast majority of routine genomic testing performed today, as seen in AACR Project GENIE, is targeted / gene-panel based sequencing.
Despite the smaller size of these gene panels focused on clinically targetable alterations, it has been shown they can estimate, to some degree, exomic mutational burden; usually by normalizing mutation count by the relevant size of the panels. These smaller gene panels exhibit significant variability both in terms of accuracy relative to exomic measures and in comparison to other gene panels. While many genes are common to the panels in AACR Project GENIE, hundreds are not. These differences in extent of coverage and genomic loci examined can result in biases that may negatively impact panel to panel comparability.
To address these issues we developed a deep learning framework to model exomic AGMs, such as mutational burden, from gene panel data as seen in AACR Project GENIE. This framework can leverage any available sample and variant level information, in which variants are featurized to effectively re-weight their importance when estimating a given AGM, such as mutational burden, through the use of multiple instance learning techniques in this form of weakly supervised data.
Using TCGA data in conjunction with AACR Project GENIE gene panel definitions, as a proof of concept, we first applied this framework to learn expected variant features such as codons and genomic position from mutational data (greater than 99.9% accuracy observed). Having established the validity of the approach, we then applied this framework to somatic mutation profiling data in which we show that data from gene panels can be calibrated to exomic TMB and thereby improve panel to panel compatibility. We observed approximately 25% improvements in mean squared error and R-squared metrics when using our framework over conventional approaches to estimate TMB from gene panel data across the 9 tumors types examined (spanning melanoma, lung cancer, colon cancer, and others). This work highlights the application of sophisticated machine learning approaches towards the development of needed calibration techniques across seemingly disparate gene panel assays used clinically today.

 

4:45 PM – 4:50 PM
– Discussion

4:50 PM – 5:00 PM
1094 – Genetic determinants of EGFR-driven lung cancer growth and therapeutic response in vivoGiorgia Foggetti, Chuan Li, Hongchen Cai, Wen-Yang Lin, Deborah Ayeni, Katherine Hastings, Laura Andrejka, Dylan Maghini, Robert Homer, Dmitri A. Petrov, Monte M. Winslow, Katerina Politi. Yale School of Medicine, New Haven, CT, Stanford University School of Medicine, Stanford, CA, Stanford University School of Medicine, Stanford, CA, Yale School of Medicine, New Haven, CT, Stanford University School of Medicine, Stanford, CA, Yale School of Medicine, New Haven, CT

5:00 PM – 5:05 PM
– Discussion

5:05 PM – 5:15 PM
1095 – Comprehensive pan-cancer analyses of RAS genomic diversityRobert Scharpf, Gregory Riely, Mark Awad, Michele Lenoue-Newton, Biagio Ricciuti, Julia Rudolph, Leon Raskin, Andrew Park, Jocelyn Lee, Christine Lovly, Valsamo Anagnostou. Johns Hopkins Sidney Kimmel Comprehensive Cancer Center, Baltimore, MD, Memorial Sloan Kettering Cancer Center, New York, NY, Dana-Farber Cancer Institute, Boston, MA, Vanderbilt-Ingram Cancer Center, Nashville, TN, Amgen, Inc., Thousand Oaks, CA, AACR, Philadelphia, PA

5:15 PM – 5:20 PM
– Discussion

5:20 PM – 5:30 PM
1096 – Harmonization standards from the Variant Interpretation for Cancer Consortium. Alex H. Wagner, Reece K. Hart, Larry Babb, Robert R. Freimuth, Adam Coffman, Yonghao Liang, Beth Pitel, Angshumoy Roy, Matthew Brush, Jennifer Lee, Anna Lu, Thomas Coard, Shruti Rao, Deborah Ritter, Brian Walsh, Susan Mockus, Peter Horak, Ian King, Dmitriy Sonkin, Subha Madhavan, Gordana Raca, Debyani Chakravarty, Malachi Griffith, Obi L. Griffith. Washington University School of Medicine, Saint Louis, MO, Reece Hart Consulting, CA, Broad Institute, Boston, MA, Mayo Clinic, Rochester, MN, Washington University School of Medicine, Saint Louis, MO, Washington University School of Medicine, Saint Louis, MO, Baylor College of Medicine, Houston, TX, Oregon Health and Science University, Portland, OR, National Cancer Institute, Bethesda, MD, Georgetown University, Washington, DC, The Jackson Laboratory for Genomic Medicine, Farmington, CT, National Center for Tumor Diseases, Heidelberg, Germany, University of Toronto, Toronto, ON, Canada, University of Southern California, Los Angeles, CA, Memorial Sloan Kettering Cancer Center, New York, NY

Abstract: The use of clinical gene sequencing is now commonplace, and genome analysts and molecular pathologists are often tasked with the labor-intensive process of interpreting the clinical significance of large numbers of tumor variants. Numerous independent knowledge bases have been constructed to alleviate this manual burden, however these knowledgebases are non-interoperable. As a result, the analyst is left with a difficult tradeoff: for each knowledgebase used the analyst must understand the nuances particular to that resource and integrate its evidence accordingly when generating the clinical report, but for each knowledgebase omitted there is increased potential for missed findings of clinical significance.The Variant Interpretation for Cancer Consortium (VICC; cancervariants.org) was formed as a driver project of the Global Alliance for Genomics and Health (GA4GH; ga4gh.org) to address this concern. VICC members include representatives from several major somatic interpretation knowledgebases including CIViC, OncoKB, Jax-CKB, the Weill Cornell PMKB, the IRB-Barcelona Cancer Biomarkers Database, and others. Previously, the VICC built and reported on a harmonized meta-knowledgebase of 19,551 biomarker associations of harmonized variants, diseases, drugs, and evidence across the constituent resources.In that study, we analyzed the frequency with which the tumor samples from the AACR Project GENIE cohort would match to harmonized associations. Variant matches increased dramatically from 57% to 86% when broader matching to regions describing categorical variants were allowed. Unlike precise sequence variants with specified alternate alleles, categorical variants describe a collection of potential variants with a common feature, such as “V600” (non-valine alleles at the 600 residue), “Exon 20 mutations” (all non-silent mutations in exon 20), or “Gain-of-function” (hypermorphic alterations that activate or amplify gene activity). However, matching observed sequence variants to categorical variants is challenging, as the latter are typically only described as unstructured text. Here we describe the expressive and computational GA4GH Variation Representation specification (vr-spec.readthedocs.io), which we co-developed as members of the GA4GH Genomic Knowledge Standards work stream. This specification provides a schema for common, precise forms of variation (e.g. SNVs and Indels) and the method for computing identifiers from these objects. We highlight key aspects of the specification and our work to apply it to the characterization of categorical variation, showcasing the variant terminology and classification tools developed by the VICC to support this effort. These standards and tools are free, open-source, and extensible, overcoming barriers to standardized variant knowledge sharing and search.

https://cancervariants.org/

  • store information from different databases by curating them and classifying them then harmonizing them into values
  • harmonize each variant across their knowledgebase; at any level of evidence
  • had 29% of patients variants that matched when compare across many knowledgebase databases versus only 13% when using individual databases
  • they are also trying to curate the database so a variant will have one code instead of various refseq codes or protein codes
  • VIC is an open consortium

 

 

5:30 PM – 5:35 PM
– Discussion

5:35 PM – 5:45 PM
1097 – FGFR2 in-frame indels: A novel targetable alteration in intrahepatic cholangiocarcinoma. Yvonne Y. Li, James M. Cleary, Srivatsan Raghavan, Liam F. Spurr, Qibiao Wu, Lei Shi, Lauren K. Brais, Maureen Loftus, Lipika Goyal, Anuj K. Patel, Atul B. Shinagare, Thomas E. Clancy, Geoffrey Shapiro, Ethan Cerami, William R. Sellers, William C. Hahn, Matthew Meyerson, Nabeel Bardeesy, Andrew D. Cherniack, Brian M. Wolpin. Dana-Farber Cancer Institute, Boston, MA, Dana-Farber Cancer Institute, Boston, MA, Massachusetts General Hospital, Boston, MA, Brigham and Women’s Hospital, Boston, MA, Dana-Farber Cancer Institute, Boston, MA, Dana-Farber Cancer Institute, Boston, MA, Broad Institute of MIT and Harvard, Cambridge, MA, Massachusetts General Hospital, Boston, MA

5:45 PM – 5:50 PM
– Discussion

5:50 PM – 6:00 PM
– Closing RemarksGregory J. Riely. Memorial Sloan Kettering Cancer Center, New York, NY

 

Follow on Twitter at:

@pharma_BI

@AACR

@CureCancerNow

@pharmanews

@BiotechWorld

@HopkinsMedicine

#AACR20

#AACR2020

#curecancernow

#pharmanews

Read Full Post »

Bioinformatic Tools for RNASeq: A Curation

Curator: Stephen J. Williams, Ph.D. 

 

 

Note:  This will be an ongoing curation as new information and tools become available.

RNASeq is a powerful tool for the analysis of the transcriptome profile and has been used to determine the transcriptional changes occurring upon stimuli such as drug treatment or detecting transcript differences between biological sample cohorts such as tumor versus normal tissue.  Unlike its genomic companion, whole genome and whole exome sequencing, which analyzes the primary sequence of the genomic DNA, RNASeq analyzes the mRNA transcripts, thereby more closely resembling the ultimate translated proteome. In addition, RNASeq and transcriptome profiling can determine if splicing variants occur as well as determining the nonexomic sequences, such as miRNA and lncRNA species, all of which have shown pertinence in the etiology of many diseases, including cancer.

However, RNASeq, like other omic technologies, generates enormous big data sets, which requires multiple types of bioinformatic tools in order to correctly analyze the sequence reads, and to visualize and interpret the output data.  This post represents a curation by the RNA-Seq blog of such tools useful for RNASeq studies and lists and reviews published literature using these curated tools.

 

From the RNA-Seq Blog

List of RNA-Seq bioinformatics tools

Posted by: RNA-Seq Blog in Data Analysis, Web Tools September 16, 2015 6,251 Views

from: https://en.wiki2.org/wiki/List_of_RNA-Seq_bioinformatics_tools

A review of some of the literature using some of the aforementioned curated tools are discussed below:

 

A.   Tools Useful for Single Cell RNA-Seq Analysis

 

B.  Tools for RNA-Seq Analysis of the Sliceasome

 

C.  Tools Useful for RNA-Seq read assembly visualization

 

Other articles on RNA and Transcriptomics in this Open Access Journal Include:

NIH to Award Up to $12M to Fund DNA, RNA Sequencing Research: single-cell genomics, sample preparation, transcriptomics and epigenomics, and genome-wide functional analysis.

Single-cell Genomics: Directions in Computational and Systems Biology – Contributions of Prof. Aviv Regev @Broad Institute of MIT and Harvard, Cochair, the Human Cell Atlas Organizing Committee with Sarah Teichmann of the Wellcome Trust Sanger Institute

Complex rearrangements and oncogene amplification revealed by long-read DNA and RNA sequencing of a breast cancer cell line

Single-cell RNA-seq helps in finding intra-tumoral heterogeneity in pancreatic cancer

First challenge to make use of the new NCI Cloud Pilots – Somatic Mutation Challenge – RNA: Best algorithms for detecting all of the abnormal RNA molecules in a cancer cell

Evolution of the Human Cell Genome Biology Field of Gene Expression, Gene Regulation, Gene Regulatory Networks and Application of Machine Learning Algorithms in Large-Scale Biological Data Analysis

 

Read Full Post »

BioInformatic Resources at the Environmental Protection Agency: Tools and Webinars on Toxicity Prediction, Volume 2 (Volume Two: Latest in Genomics Methodologies for Therapeutics: Gene Editing, NGS and BioInformatics, Simulations and the Genome Ontology), Part 1: Next Generation Sequencing (NGS)

BioInformatic Resources at the Environmental Protection Agency: Tools and Webinars on Toxicity Prediction

Curator Stephen J. Williams Ph.D.

New GenRA Module in EPA’s CompTox Dashboard Will Help Predict Potential Chemical Toxicity

Published September 25, 2018

As part of its ongoing computational toxicology research, EPA is developing faster and improved approaches to evaluate chemicals for potential health effects.  One commonly applied approach is known as chemical read-across. Read-across uses information about how a chemical with known data behaves to make a prediction about the behavior of another chemical that is “similar” but does not have as much data. Current read-across, while cost-effective, relies on a subjective assessment, which leads to varying predictions and justifications depending on who undertakes and evaluates the assessment.

To reduce uncertainties and develop a more objective approach, EPA researchers have developed an automated read-across tool called Generalized Read-Across (GenRA), and added it to the newest version of the EPA Computational Toxicology Dashboard. The goal of GenRA is to encode as many expert considerations used within current read-across approaches as possible and combine these with data-driven approaches to transition read-across towards a more systematic and data-based method of making predictions.

EPA chemist Dr. Grace Patlewicz says it was this uncertainty that motivated the development of GenRA. “You don’t actually know if you’ve been successful at using read-across to help predict chemical toxicity because it’s a judgement call based on one person versus the next. That subjectivity is something we were trying to move away from.” Patlewicz says.

Since toxicologists and risk assessors are already familiar with read-across, EPA researchers saw value in creating a tool that that was aligned with the current read-across workflow but which addressed uncertainty using data analysis methods in what they call a “harmonized-hybrid workflow.”

In its current form, GenRA lets users find analogues, or chemicals that are similar to their target chemical, based on chemical structural similarity. The user can then select which analogues they want to carry forward into the GenRA prediction by exploring the consistency and concordance of the underlying experimental data for those analogues. Next, the tool predicts toxicity effects of specific repeated dose studies. Then, a plot with these outcomes is generated based on a similarity-weighted activity of the analogue chemicals the user selected. Finally, the user is presented with a data matrix view showing whether a chemical is predicted to be toxic (yes or no) for a chosen set of toxicity endpoints, with a quantitative measure of uncertainty.

The team is also comparing chemicals based on other similarity contexts, such as physicochemical characteristics or metabolic similarity, as well as extending the approach to make quantitative predictions of toxicity.

Patlewicz thinks incorporating other contexts and similarity measures will refine GenRA to make better toxicity predictions, fulfilling the goal of creating a read-across method capable of assessing thousands of chemicals that currently lack toxicity data.

“That’s the direction that we’re going in,” Patlewicz says. “Recognizing where we are and trying to move towards something a little bit more objective, showing how aspects of the current read-across workflow could be refined.”

Learn more at: https://comptox.epa.gov

 

A listing of EPA Tools for Air Quality Assessment

Tools

  • Atmospheric Model Evaluation Tool (AMET)
    AMET helps in the evaluation of meteorological and air quality simulations.
  • Benchmark Dose Software (BMDS)
    EPA developed the Benchmark Dose Software (BMDS) as a tool to help estimate dose or exposure of a chemical or chemical mixture associated with a given response level. The methodology is used by EPA risk assessors and is fast becoming the world’s standard for dose-response analysis for risk assessments, including air pollution risk assessments.
  • BenMAP
    BenMAP is a Windows-based computer program that uses a Geographic Information System (GIS)-based to estimate the health impacts and economic benefits occurring when populations experience changes in air quality.
  • Community-Focused Exposure and Risk Screening Tool (C-FERST)
    C-FERST is an online tool developed by EPA in collaboration with stakeholders to provide access to resources that can be used with communities to help identify and learn more about their environmental health issues and explore exposure and risk reduction options.
  • Community Health Vulnerability Index
    EPA scientists developed a Community Health Vulnerability Index that can be used to help identify communities at higher health risk from wildfire smoke. Breathing smoke from a nearby wildfire is a health threat, especially for people with lung or heart disease, diabetes and high blood pressure as well as older adults, and those living in communities with poverty, unemployment and other indicators of social stress. Health officials can use the tool, in combination with air quality models, to focus public health strategies on vulnerable populations living in areas where air quality is impaired, either by wildfire smoke or other sources of pollution. The work was published in Environmental Science & Technology.
  • Critical Loads Mapper Tool
    The Critical Loads Mapper Tool can be used to help protect terrestrial and aquatic ecosystems from atmospheric deposition of nitrogen and sulfur, two pollutants emitted from fossil fuel burning and agricultural emissions. The interactive tool provides easy access to information on deposition levels through time; critical loads, which identify thresholds when pollutants have reached harmful levels; and exceedances of these thresholds.
  • EnviroAtlas
    EnviroAtlas provides interactive tools and resources for exploring the benefits people receive from nature or “ecosystem goods and services”. Ecosystem goods and services are critically important to human health and well-being, but they are often overlooked due to lack of information. Using EnviroAtlas, many types of users can access, view, and analyze diverse information to better understand the potential impacts of various decisions.
  • EPA Air Sensor Toolbox for Citizen Scientists
    EPA’s Air Sensor Toolbox for Citizen Scientists provides information and guidance on new low-cost compact technologies for measuring air quality. Citizens are interested in learning more about local air quality where they live, work and play. EPA’s Toolbox includes information about: Sampling methodologies; Calibration and validation approaches; Measurement methods options; Data interpretation guidelines; Education and outreach; and Low cost sensor performance information.
  • ExpoFIRST
    The Exposure Factors Interactive Resource for Scenarios Tool (ExpoFIRST) brings data from EPA’s Exposure Factors Handbook: 2011 Edition (EFH) to an interactive tool that maximizes flexibility and transparency for exposure assessors. ExpoFIRST represents a significant advance for regional, state, and local scientists in performing and documenting calculations for community and site-specific exposure assessments, including air pollution exposure assessments.
  • EXPOsure toolbox (ExpoBox)
    This is a toolbox created to assist individuals from within government, industry, academia, and the general public with assessing exposure, including exposure to air contaminants, fate and transport processes of air pollutants and their potential exposure concentrations. It is a compendium of exposure assessment tools that links to guidance documents, databases, models, reference materials, and other related resources.
  • Federal Reference & Federal Equivalency Methods
    EPA scientists develop and evaluate Federal Reference Methods and Federal Equivalency Methods for accurately and reliably measuring six primary air pollutants in outdoor air. These methods are used by states and other organizations to assess implementation actions needed to attain National Ambient Air Quality Standards.
  • Fertilizer Emission Scenario Tool for CMAQ (FEST-C)
    FEST-C facilitates the definition and simulation of new cropland farm management system scenarios or editing of existing scenarios to drive Environmental Policy Integrated Climate model (EPIC) simulations.  For the standard 12km continental Community Multi-Scale Air Quality model (CMAQ) domain, this amounts to about 250,000 simulations for the U.S. alone. It also produces gridded daily EPIC weather input files from existing hourly Meteorology-Chemistry Interface Processor (MCIP) files, transforms EPIC output files to CMAQ-ready input files and links directly to Visual Environment for Rich Data Interpretation (VERDI) for spatial visualization of input and output files. The December 2012 release will perform all these functions for any CMAQ grid scale or domain.
  • Instruction Guide and Macro Analysis Tool for Community-led Air Monitoring 
    EPA has developed two tools for evaluating the performance of low-cost sensors and interpreting the data they collect to help citizen scientists, communities, and professionals learn about local air quality.
  • Integrated Climate and Land use Scenarios (ICLUS)
    Climate change and land-use change are global drivers of environmental change. Impact assessments frequently show that interactions between climate and land-use changes can create serious challenges for aquatic ecosystems, water quality, and air quality. Population projections to 2100 were used to model the distribution of new housing across the landscape. In addition, housing density was used to estimate changes in impervious surface cover.  A final report, datasets, the ICLUS+ Web Viewer and ArcGIS tools are available.
  • Indoor Semi-Volatile Organic Compound (i-SVOC)
    i-SVOC Version 1.0 is a general-purpose software application for dynamic modeling of the emission, transport, sorption, and distribution of semi-volatile organic compounds (SVOCs) in indoor environments. i-SVOC supports a variety of uses, including exposure assessment and the evaluation of mitigation options. SVOCs are a diverse group of organic chemicals that can be found in: Many are also present in indoor air, where they tend to bind to interior surfaces and particulate matter (dust).

     

    • Pesticides;
    • Ingredients in cleaning agents and personal care products;
    • Additives to vinyl flooring, furniture, clothing, cookware, food packaging, and electronics.
  • Municipal Solid Waste Decision Support Tool (MSW DST)EXIT
    This tool is designed to aid solid waste planners in evaluating the cost and environmental aspects of integrated municipal solid waste management strategies. The tool is the result of collaboration between EPA and RTI International and its partners.
  • Optical Noise-Reduction Averaging (ONA) Program Improves Black Carbon Particle Measurements Using Aethalometers
    ONA is a program that reduces noise in real-time black carbon data obtained using Aethalometers. Aethalometers optically measure the concentration of light absorbing or “black” particles that accumulate on a filter as air flows through it. These particles are produced by incomplete fossil fuel, biofuel and biomass combustion. Under polluted conditions, they appear as smoke or haze.
  • RETIGO tool
    Real Time Geospatial Data Viewer (RETIGO) is a free, web-based tool that shows air quality data that are collected while in motion (walking, biking or in a vehicle). The tool helps users overcome technical barriers to exploring air quality data. After collecting measurements, citizen scientists and other users can import their own data and explore the data on a map.
  • Remote Sensing Information Gateway (RSIG)
    RSIG offers a new way for users to get the multi-terabyte, environmental datasets they want via an interactive, Web browser-based application. A file download and parsing process that now takes months will be reduced via RSIG to minutes.
  • Simulation Tool Kit for Indoor Air Quality and Inhalation Exposure (IAQX)
    IAQX version 1.1 is an indoor air quality (IAQ) simulation software package that complements and supplements existing indoor air quality simulation (IAQ) programs. IAQX is for advanced users who have experience with exposure estimation, pollution control, risk assessment, and risk management. There are many sources of indoor air pollution, such as building materials, furnishings, and chemical cleaners. Since most people spend a large portion of their time indoors, it is important to be able to estimate exposure to these pollutants. IAQX helps users analyze the impact of pollutant sources and sinks, ventilation, and air cleaners. It performs conventional IAQ simulations to calculate the pollutant concentration and/or personal exposure as a function of time. It can also estimate adequate ventilation rates based on user-provided air quality criteria. This is a unique feature useful for product stewardship and risk management.
  • Spatial Allocator
    The Spatial Allocator provides tools that could be used by the air quality modeling community to perform commonly needed spatial tasks without requiring the use of a commercial Geographic Information System (GIS).
  • Traceability Protocol for Assay and Certification of Gaseous Calibration Standards
    This is used to certify calibration gases for ambient and continuous emission monitors. It specifies methods for assaying gases and establishing traceability to National Institute of Standards and Technology (NIST) reference standards. Traceability is required under EPA ambient and continuous emission monitoring regulations.
  • Watershed Deposition Mapping Tool (WDT)
    WDT provides an easy to use tool for mapping the deposition estimates from CMAQ to watersheds to provide the linkage of air and water needed for TMDL (Total Maximum Daily Load) and related nonpoint-source watershed analyses.
  • Visual Environment for Rich Data Interpretation (VERDI)
    VERDI is a flexible, modular, Java-based program for visualizing multivariate gridded meteorology, emissions, and air quality modeling data created by environmental modeling systems such as CMAQ and the Weather Research and Forecasting (WRF) model.

 

Databases

  • Air Quality Data for the CDC National Environmental Public Health Tracking Network 
    EPA’s Exposure Research scientists are collaborating with the Centers for Disease Control and Prevention (CDC) on a CDC initiative to build a National Environmental Public Health Tracking (EPHT) network. Working with state, local and federal air pollution and health agencies, the EPHT program is facilitating the collection, integration, analysis, interpretation, and dissemination of data from environmental hazard monitoring, and from human exposure and health effects surveillance. These data provide scientific information to develop surveillance indicators, and to investigate possible relationships between environmental exposures, chronic disease, and other diseases, that can lead to interventions to reduce the burden of theses illnesses. An important part of the initiative is air quality modeling estimates and air quality monitoring data, combined through Bayesian modeling that can be linked with health outcome data.
  • EPAUS9R – An Energy Systems Database for use with the Market Allocation (MARKAL) Model
    The EPAUS9r is a regional database representation of the United States energy system. The database uses the MARKAL model. MARKAL is an energy system optimization model used by local and federal governments, national and international communities and academia. EPAUS9r represents energy supply, technology, and demand throughout the major sectors of the U.S. energy system.
  • Fused Air Quality Surfaces Using Downscaling
    This database provides access to the most recent O3 and PM2.5 surfaces datasets using downscaling.
  • Health & Environmental Research Online (HERO)
    HERO provides access to scientific literature used to support EPA’s integrated science assessments, including the  Integrated Science Assessments (ISA) that feed into the National Ambient Air Quality (NAAQS) reviews.
  • SPECIATE 4.5 Database
    SPECIATE is a repository of volatile organic gas and particulate matter (PM) speciation profiles of air pollution sources.

A listing of EPA Tools and Databases for Water Contaminant Exposure Assessment

Exposure and Toxicity

  • EPA ExpoBox (A Toolbox for Exposure Assessors)
    This toolbox assists individuals from within government, industry, academia, and the general public with assessing exposure from multiple media, including water and sediment. It is a compendium of exposure assessment tools that links to guidance documents, databases, models, reference materials, and other related resources.

Chemical and Product Categories (CPCat) Database
CPCat is a database containing information mapping more than 43,000 chemicals to a set of terms categorizing their usage or function. The comprehensive list of chemicals with associated categories of chemical and product use was compiled from publically available sources. Unique use category taxonomies from each source are mapped onto a single common set of approximately 800 terms. Users can search for chemicals by chemical name, Chemical Abstracts Registry Number, or by CPCat terms associated with chemicals.

A listing of EPA Tools and Databases for Chemical Toxicity Prediction & Assessment

  • Sequence Alignment to Predict Across Species Susceptibility (SeqAPASS)
    SeqAPASS is a fast, online screening tool that allows researchers and regulators to extrapolate toxicity information across species. For some species, such as humans, mice, rats, and zebrafish, the EPA has a large amount of data regarding their toxicological susceptibility to various chemicals. However, the toxicity data for numerous other plants and animals is very limited. SeqAPASS extrapolates from these data rich model organisms to thousands of other non-target species to evaluate their specific potential chemical susceptibility.

 

A listing of EPA Webinar and Literature on Bioinformatic Tools and Projects

Comparative Bioinformatics Applications for Developmental Toxicology

Discuss how the US EPA/NCCT is trying to solve the problem of too many chemicals, too high cost, and too much biological uncertainty Discuss the solution the ToxCast Program is proposing; a data-rich system to screen, classify and rank chemicals for further evaluation

https://cfpub.epa.gov/si/si_public_record_report.cfm?Lab=NCCT&dirEntryId=186844

CHEMOINFORMATIC AND BIOINFORMATIC CHALLENGES AT THE US ENVIRONMENTAL PROTECTION AGENCY.

This presentation will provide an overview of both the scientific program and the regulatory activities related to computational toxicology. This presentation will provide an overview of both the scientific program and the regulatory activities related to computational toxicology.

https://cfpub.epa.gov/si/si_public_record_report.cfm?Lab=NCCT&dirEntryId=154013

How Can We Use Bioinformatics to Predict Which Agents Will Cause Birth Defects?

The availability of genomic sequences from a growing number of human and model organisms has provided an explosion of data, information, and knowledge regarding biological systems and disease processes. High-throughput technologies such as DNA and protein microarray biochips are now standard tools for probing the cellular state and determining important cellular behaviors at the genomic/proteomic levels. While these newer technologies are beginning to provide important information on cellular reactions to toxicant exposure (toxicogenomics), a major challenge that remains is the formulation of a strategy to integrate transcript, protein, metabolite, and toxicity data. This integration will require new concepts and tools in bioinformatics. The U.S. National Library of Medicine’s Pubmed site includes 19 million citations and abstracts and continues to grow. The BDSM team is now working on assembling the literature’s unstructured data into a structured database and linking it to BDSM within a system that can then be used for testing and generating new hypotheses. This effort will generate data bases of entities (such as genes, proteins, metabolites, gene ontology processes) linked to PubMed identifiers/abstracts and providing information on the relationships between them. The end result will be an online/standalone tool that will help researchers to focus on the papers most relevant to their query and uncover hidden connections and obvious information gaps.

https://cfpub.epa.gov/si/si_public_record_report.cfm?Lab=NCCT&dirEntryId=227345

ADVANCED PROTEOMICS AND BIOINFORMATICS TOOLS IN TOXICOLOGY RESEARCH: OVERCOMING CHALLENGES TO PROVIDE SIGNIFICANT RESULTS

This presentation specifically addresses the advantages and limitations of state of the art gel, protein arrays and peptide-based labeling proteomic approaches to assess the effects of a suite of model T4 inhibitors on the thyroid axis of Xenopus laevis.

https://cfpub.epa.gov/si/si_public_record_report.cfm?Lab=NHEERL&dirEntryId=152823

Bioinformatic Integration of in vivo Data and Literature-based Gene Associations for Prioritization of Adverse Outcome Pathway Development

Adverse outcome pathways (AOPs) describe a sequence of events, beginning with a molecular initiating event (MIE), proceeding via key events (KEs), and culminating in an adverse outcome (AO). A challenge for use of AOPs in a safety evaluation context has been identification of MIEs and KEs relevant for AOs observed in regulatory toxicity studies. In this work, we implemented a bioinformatic approach that leverages mechanistic information in the literature and the AOs measured in regulatory toxicity studies to prioritize putative MIEs and/or early KEs for AOP development relevant to chemical safety evaluation. The US Environmental Protection Agency Toxicity Reference Database (ToxRefDB, v2.0) contains effect information for >1000 chemicals curated from >5000 studies or summaries from sources including data evaluation records from the US EPA Office of Pesticide Programs, the National Toxicology Program (NTP), peer-reviewed literature, and pharmaceutical preclinical studies. To increase ToxRefDB interoperability, endpoint and effect information were cross-referenced with codes from the United Medical Language System, which enabled mapping of in vivo pathological effects from ToxRefDB to PubMed (via Medical Subject Headings or MeSH). This enabled linkage to any resource that is also connected to PubMed or indexed with MeSH. A publicly available bioinformatic tool, the Entity-MeSH Co-occurrence Network (EMCON), uses multiple data sources and a measure of mutual information to identify genes most related to a MeSH term. Using EMCON, gene sets were generated for endpoints of toxicological relevance in ToxRefDB linking putative KEs and/or MIEs. The Comparative Toxicogenomics Database was used to further filter important associations. As a proof of concept, thyroid-related effects and their highly associated genes were examined, and demonstrated relevant MIEs and early KEs for AOPs to describe thyroid-related AOs. The ToxRefDB to gene mapping for thyroid resulted in >50 unique gene to chemical relationships. Integrated use of EMCON and ToxRefDB data provides a basis for rapid and robust putative AOP development, as well as a novel means to generate mechanistic hypotheses for specific chemicals. This abstract does not necessarily reflect U.S. EPA policy. Abstract and Poster for 2019 Society of Toxicology annual meeting in March 2019

https://cfpub.epa.gov/si/si_public_record_report.cfm?Lab=NCCT&dirEntryId=344452

Bioinformatic Integration of in vivo Data and Literature-based Gene Associations for Prioritization of Adverse Outcome Pathway Development

Adverse outcome pathways (AOPs) describe a sequence of events, beginning with a molecular initiating event (MIE), proceeding via key events (KEs), and culminating in an adverse outcome (AO). A challenge for use of AOPs in a safety evaluation context has been identification of MIEs and KEs relevant for AOs observed in regulatory toxicity studies. In this work, we implemented a bioinformatic approach that leverages mechanistic information in the literature and the AOs measured in regulatory toxicity studies to prioritize putative MIEs and/or early KEs for AOP development relevant to chemical safety evaluation. The US Environmental Protection Agency Toxicity Reference Database (ToxRefDB, v2.0) contains effect information for >1000 chemicals curated from >5000 studies or summaries from sources including data evaluation records from the US EPA Office of Pesticide Programs, the National Toxicology Program (NTP), peer-reviewed literature, and pharmaceutical preclinical studies. To increase ToxRefDB interoperability, endpoint and effect information were cross-referenced with codes from the United Medical Language System, which enabled mapping of in vivo pathological effects from ToxRefDB to PubMed (via Medical Subject Headings or MeSH). This enabled linkage to any resource that is also connected to PubMed or indexed with MeSH. A publicly available bioinformatic tool, the Entity-MeSH Co-occurrence Network (EMCON), uses multiple data sources and a measure of mutual information to identify genes most related to a MeSH term. Using EMCON, gene sets were generated for endpoints of toxicological relevance in ToxRefDB linking putative KEs and/or MIEs. The Comparative Toxicogenomics Database was used to further filter important associations. As a proof of concept, thyroid-related effects and their highly associated genes were examined, and demonstrated relevant MIEs and early KEs for AOPs to describe thyroid-related AOs. The ToxRefDB to gene mapping for thyroid resulted in >50 unique gene to chemical relationships. Integrated use of EMCON and ToxRefDB data provides a basis for rapid and robust putative AOP development, as well as a novel means to generate mechanistic hypotheses for specific chemicals. This abstract does not necessarily reflect U.S. EPA policy. Abstract and Poster for 2019 Society of Toxicology annual meeting in March 2019

https://cfpub.epa.gov/si/si_public_record_report.cfm?Lab=NCCT&dateBeginPublishedPresented=03%2F26%2F2014&dateEndPublishedPresented=03%2F26%2F2019&dirEntryId=344452&keyword=Chemical+Safety&showCriteria=2&sortBy=pubDateYear&subject=Chemical+Safety+Research

Bioinformatic Integration of in vivo Data and Literature-based Gene Associations for Prioritization of Adverse Outcome Pathway Development

Adverse outcome pathways (AOPs) describe a sequence of events, beginning with a molecular initiating event (MIE), proceeding via key events (KEs), and culminating in an adverse outcome (AO). A challenge for use of AOPs in a safety evaluation context has been identification of MIEs and KEs relevant for AOs observed in regulatory toxicity studies. In this work, we implemented a bioinformatic approach that leverages mechanistic information in the literature and the AOs measured in regulatory toxicity studies to prioritize putative MIEs and/or early KEs for AOP development relevant to chemical safety evaluation. The US Environmental Protection Agency Toxicity Reference Database (ToxRefDB, v2.0) contains effect information for >1000 chemicals curated from >5000 studies or summaries from sources including data evaluation records from the US EPA Office of Pesticide Programs, the National Toxicology Program (NTP), peer-reviewed literature, and pharmaceutical preclinical studies. To increase ToxRefDB interoperability, endpoint and effect information were cross-referenced with codes from the United Medical Language System, which enabled mapping of in vivo pathological effects from ToxRefDB to PubMed (via Medical Subject Headings or MeSH). This enabled linkage to any resource that is also connected to PubMed or indexed with MeSH. A publicly available bioinformatic tool, the Entity-MeSH Co-occurrence Network (EMCON), uses multiple data sources and a measure of mutual information to identify genes most related to a MeSH term. Using EMCON, gene sets were generated for endpoints of toxicological relevance in ToxRefDB linking putative KEs and/or MIEs. The Comparative Toxicogenomics Database was used to further filter important associations. As a proof of concept, thyroid-related effects and their highly associated genes were examined, and demonstrated relevant MIEs and early KEs for AOPs to describe thyroid-related AOs. The ToxRefDB to gene mapping for thyroid resulted in >50 unique gene to chemical relationships. Integrated use of EMCON and ToxRefDB data provides a basis for rapid and robust putative AOP development, as well as a novel means to generate mechanistic hypotheses for specific chemicals. This abstract does not necessarily reflect U.S. EPA policy. Abstract and Poster for 2019 Society of Toxicology annual meeting in March 2019

https://cfpub.epa.gov/si/si_public_record_report.cfm?Lab=NCCT&dateBeginPublishedPresented=04%2F02%2F2014&dateEndPublishedPresented=04%2F02%2F2019&dirEntryId=344452&keyword=Chemical+Safety&showCriteria=2&sortBy=pubDateYear&subject=Chemical+Safety+Research

Bioinformatic Integration of in vivo Data and Literature-based Gene Associations for Prioritization of Adverse Outcome Pathway Development

Adverse outcome pathways (AOPs) describe a sequence of events, beginning with a molecular initiating event (MIE), proceeding via key events (KEs), and culminating in an adverse outcome (AO). A challenge for use of AOPs in a safety evaluation context has been identification of MIEs and KEs relevant for AOs observed in regulatory toxicity studies. In this work, we implemented a bioinformatic approach that leverages mechanistic information in the literature and the AOs measured in regulatory toxicity studies to prioritize putative MIEs and/or early KEs for AOP development relevant to chemical safety evaluation. The US Environmental Protection Agency Toxicity Reference Database (ToxRefDB, v2.0) contains effect information for >1000 chemicals curated from >5000 studies or summaries from sources including data evaluation records from the US EPA Office of Pesticide Programs, the National Toxicology Program (NTP), peer-reviewed literature, and pharmaceutical preclinical studies. To increase ToxRefDB interoperability, endpoint and effect information were cross-referenced with codes from the United Medical Language System, which enabled mapping of in vivo pathological effects from ToxRefDB to PubMed (via Medical Subject Headings or MeSH). This enabled linkage to any resource that is also connected to PubMed or indexed with MeSH. A publicly available bioinformatic tool, the Entity-MeSH Co-occurrence Network (EMCON), uses multiple data sources and a measure of mutual information to identify genes most related to a MeSH term. Using EMCON, gene sets were generated for endpoints of toxicological relevance in ToxRefDB linking putative KEs and/or MIEs. The Comparative Toxicogenomics Database was used to further filter important associations. As a proof of concept, thyroid-related effects and their highly associated genes were examined, and demonstrated relevant MIEs and early KEs for AOPs to describe thyroid-related AOs. The ToxRefDB to gene mapping for thyroid resulted in >50 unique gene to chemical relationships. Integrated use of EMCON and ToxRefDB data provides a basis for rapid and robust putative AOP development, as well as a novel means to generate mechanistic hypotheses for specific chemicals. This abstract does not necessarily reflect U.S. EPA policy. Abstract and Poster for 2019 Society of Toxicology annual meeting in March 2019

https://cfpub.epa.gov/si/si_public_record_report.cfm?Lab=NCCT&dateBeginPublishedPresented=04%2F02%2F2014&dateEndPublishedPresented=04%2F02%2F2019&dirEntryId=344452&fed_org_id=111&keyword=Chemical+Safety&showCriteria=2&sortBy=pubDateYear&subject=Chemical+Safety+Research

Bioinformatic Integration of in vivo Data and Literature-based Gene Associations for Prioritization of Adverse Outcome Pathway Development

Adverse outcome pathways (AOPs) describe a sequence of events, beginning with a molecular initiating event (MIE), proceeding via key events (KEs), and culminating in an adverse outcome (AO). A challenge for use of AOPs in a safety evaluation context has been identification of MIEs and KEs relevant for AOs observed in regulatory toxicity studies. In this work, we implemented a bioinformatic approach that leverages mechanistic information in the literature and the AOs measured in regulatory toxicity studies to prioritize putative MIEs and/or early KEs for AOP development relevant to chemical safety evaluation. The US Environmental Protection Agency Toxicity Reference Database (ToxRefDB, v2.0) contains effect information for >1000 chemicals curated from >5000 studies or summaries from sources including data evaluation records from the US EPA Office of Pesticide Programs, the National Toxicology Program (NTP), peer-reviewed literature, and pharmaceutical preclinical studies. To increase ToxRefDB interoperability, endpoint and effect information were cross-referenced with codes from the United Medical Language System, which enabled mapping of in vivo pathological effects from ToxRefDB to PubMed (via Medical Subject Headings or MeSH). This enabled linkage to any resource that is also connected to PubMed or indexed with MeSH. A publicly available bioinformatic tool, the Entity-MeSH Co-occurrence Network (EMCON), uses multiple data sources and a measure of mutual information to identify genes most related to a MeSH term. Using EMCON, gene sets were generated for endpoints of toxicological relevance in ToxRefDB linking putative KEs and/or MIEs. The Comparative Toxicogenomics Database was used to further filter important associations. As a proof of concept, thyroid-related effects and their highly associated genes were examined, and demonstrated relevant MIEs and early KEs for AOPs to describe thyroid-related AOs. The ToxRefDB to gene mapping for thyroid resulted in >50 unique gene to chemical relationships. Integrated use of EMCON and ToxRefDB data provides a basis for rapid and robust putative AOP development, as well as a novel means to generate mechanistic hypotheses for specific chemicals. This abstract does not necessarily reflect U.S. EPA policy. Abstract and Poster for 2019 Society of Toxicology annual meeting in March 2019

https://cfpub.epa.gov/si/si_public_record_report.cfm?Lab=NCCT&dirEntryId=344452&fed_org_id=111&keyword=Chemical+Safety&showCriteria=2&sortBy=pubDateYear&subject=Chemical+Safety+Research

 

Bioinformatic Integration of in vivo Data and Literature-based Gene Associations for Prioritization of Adverse Outcome Pathway Development

Adverse outcome pathways (AOPs) describe a sequence of events, beginning with a molecular initiating event (MIE), proceeding via key events (KEs), and culminating in an adverse outcome (AO). A challenge for use of AOPs in a safety evaluation context has been identification of MIEs and KEs relevant for AOs observed in regulatory toxicity studies. In this work, we implemented a bioinformatic approach that leverages mechanistic information in the literature and the AOs measured in regulatory toxicity studies to prioritize putative MIEs and/or early KEs for AOP development relevant to chemical safety evaluation. The US Environmental Protection Agency Toxicity Reference Database (ToxRefDB, v2.0) contains effect information for >1000 chemicals curated from >5000 studies or summaries from sources including data evaluation records from the US EPA Office of Pesticide Programs, the National Toxicology Program (NTP), peer-reviewed literature, and pharmaceutical preclinical studies. To increase ToxRefDB interoperability, endpoint and effect information were cross-referenced with codes from the United Medical Language System, which enabled mapping of in vivo pathological effects from ToxRefDB to PubMed (via Medical Subject Headings or MeSH). This enabled linkage to any resource that is also connected to PubMed or indexed with MeSH. A publicly available bioinformatic tool, the Entity-MeSH Co-occurrence Network (EMCON), uses multiple data sources and a measure of mutual information to identify genes most related to a MeSH term. Using EMCON, gene sets were generated for endpoints of toxicological relevance in ToxRefDB linking putative KEs and/or MIEs. The Comparative Toxicogenomics Database was used to further filter important associations. As a proof of concept, thyroid-related effects and their highly associated genes were examined, and demonstrated relevant MIEs and early KEs for AOPs to describe thyroid-related AOs. The ToxRefDB to gene mapping for thyroid resulted in >50 unique gene to chemical relationships. Integrated use of EMCON and ToxRefDB data provides a basis for rapid and robust putative AOP development, as well as a novel means to generate mechanistic hypotheses for specific chemicals. This abstract does not necessarily reflect U.S. EPA policy. Abstract and Poster for 2019 Society of Toxicology annual meeting in March 2019

https://cfpub.epa.gov/si/si_public_record_report.cfm?Lab=NCCT&dateBeginPublishedPresented=03%2F26%2F2014&dateEndPublishedPresented=03%2F26%2F2019&dirEntryId=344452&fed_org_id=111&keyword=Chemical+Safety&showCriteria=2&sortBy=pubDateYear&subject=Chemical+Safety+Research

Bioinformatic Integration of in vivo Data and Literature-based Gene Associations for Prioritization of Adverse Outcome Pathway Development

Adverse outcome pathways (AOPs) describe a sequence of events, beginning with a molecular initiating event (MIE), proceeding via key events (KEs), and culminating in an adverse outcome (AO). A challenge for use of AOPs in a safety evaluation context has been identification of MIEs and KEs relevant for AOs observed in regulatory toxicity studies. In this work, we implemented a bioinformatic approach that leverages mechanistic information in the literature and the AOs measured in regulatory toxicity studies to prioritize putative MIEs and/or early KEs for AOP development relevant to chemical safety evaluation. The US Environmental Protection Agency Toxicity Reference Database (ToxRefDB, v2.0) contains effect information for >1000 chemicals curated from >5000 studies or summaries from sources including data evaluation records from the US EPA Office of Pesticide Programs, the National Toxicology Program (NTP), peer-reviewed literature, and pharmaceutical preclinical studies. To increase ToxRefDB interoperability, endpoint and effect information were cross-referenced with codes from the United Medical Language System, which enabled mapping of in vivo pathological effects from ToxRefDB to PubMed (via Medical Subject Headings or MeSH). This enabled linkage to any resource that is also connected to PubMed or indexed with MeSH. A publicly available bioinformatic tool, the Entity-MeSH Co-occurrence Network (EMCON), uses multiple data sources and a measure of mutual information to identify genes most related to a MeSH term. Using EMCON, gene sets were generated for endpoints of toxicological relevance in ToxRefDB linking putative KEs and/or MIEs. The Comparative Toxicogenomics Database was used to further filter important associations. As a proof of concept, thyroid-related effects and their highly associated genes were examined, and demonstrated relevant MIEs and early KEs for AOPs to describe thyroid-related AOs. The ToxRefDB to gene mapping for thyroid resulted in >50 unique gene to chemical relationships. Integrated use of EMCON and ToxRefDB data provides a basis for rapid and robust putative AOP development, as well as a novel means to generate mechanistic hypotheses for specific chemicals. This abstract does not necessarily reflect U.S. EPA policy. Abstract and Poster for 2019 Society of Toxicology annual meeting in March 2019

https://cfpub.epa.gov/si/si_public_record_report.cfm?Lab=NCCT&dirEntryId=344452&fed_org_id=111&keyword=Chemical+Safety&showCriteria=2&sortBy=pubDateYear&subject=Chemical+Safety+Research

Bioinformatic Integration of in vivo Data and Literature-based Gene Associations for Prioritization of Adverse Outcome Pathway Development

Adverse outcome pathways (AOPs) describe a sequence of events, beginning with a molecular initiating event (MIE), proceeding via key events (KEs), and culminating in an adverse outcome (AO). A challenge for use of AOPs in a safety evaluation context has been identification of MIEs and KEs relevant for AOs observed in regulatory toxicity studies. In this work, we implemented a bioinformatic approach that leverages mechanistic information in the literature and the AOs measured in regulatory toxicity studies to prioritize putative MIEs and/or early KEs for AOP development relevant to chemical safety evaluation. The US Environmental Protection Agency Toxicity Reference Database (ToxRefDB, v2.0) contains effect information for >1000 chemicals curated from >5000 studies or summaries from sources including data evaluation records from the US EPA Office of Pesticide Programs, the National Toxicology Program (NTP), peer-reviewed literature, and pharmaceutical preclinical studies. To increase ToxRefDB interoperability, endpoint and effect information were cross-referenced with codes from the United Medical Language System, which enabled mapping of in vivo pathological effects from ToxRefDB to PubMed (via Medical Subject Headings or MeSH). This enabled linkage to any resource that is also connected to PubMed or indexed with MeSH. A publicly available bioinformatic tool, the Entity-MeSH Co-occurrence Network (EMCON), uses multiple data sources and a measure of mutual information to identify genes most related to a MeSH term. Using EMCON, gene sets were generated for endpoints of toxicological relevance in ToxRefDB linking putative KEs and/or MIEs. The Comparative Toxicogenomics Database was used to further filter important associations. As a proof of concept, thyroid-related effects and their highly associated genes were examined, and demonstrated relevant MIEs and early KEs for AOPs to describe thyroid-related AOs. The ToxRefDB to gene mapping for thyroid resulted in >50 unique gene to chemical relationships. Integrated use of EMCON and ToxRefDB data provides a basis for rapid and robust putative AOP development, as well as a novel means to generate mechanistic hypotheses for specific chemicals. This abstract does not necessarily reflect U.S. EPA policy. Abstract and Poster for 2019 Society of Toxicology annual meeting in March 2019

https://cfpub.epa.gov/si/si_public_record_report.cfm?Lab=NCCT&dateBeginPublishedPresented=04%2F11%2F2014&dateEndPublishedPresented=04%2F11%2F2019&dirEntryId=344452&fed_org_id=111&keyword=Chemical+Safety&showCriteria=2&sortBy=pubDateYear&subject=Chemical+Safety+Research

Bioinformatic Integration of in vivo Data and Literature-based Gene Associations for Prioritization of Adverse Outcome Pathway Development

Adverse outcome pathways (AOPs) describe a sequence of events, beginning with a molecular initiating event (MIE), proceeding via key events (KEs), and culminating in an adverse outcome (AO). A challenge for use of AOPs in a safety evaluation context has been identification of MIEs and KEs relevant for AOs observed in regulatory toxicity studies. In this work, we implemented a bioinformatic approach that leverages mechanistic information in the literature and the AOs measured in regulatory toxicity studies to prioritize putative MIEs and/or early KEs for AOP development relevant to chemical safety evaluation. The US Environmental Protection Agency Toxicity Reference Database (ToxRefDB, v2.0) contains effect information for >1000 chemicals curated from >5000 studies or summaries from sources including data evaluation records from the US EPA Office of Pesticide Programs, the National Toxicology Program (NTP), peer-reviewed literature, and pharmaceutical preclinical studies. To increase ToxRefDB interoperability, endpoint and effect information were cross-referenced with codes from the United Medical Language System, which enabled mapping of in vivo pathological effects from ToxRefDB to PubMed (via Medical Subject Headings or MeSH). This enabled linkage to any resource that is also connected to PubMed or indexed with MeSH. A publicly available bioinformatic tool, the Entity-MeSH Co-occurrence Network (EMCON), uses multiple data sources and a measure of mutual information to identify genes most related to a MeSH term. Using EMCON, gene sets were generated for endpoints of toxicological relevance in ToxRefDB linking putative KEs and/or MIEs. The Comparative Toxicogenomics Database was used to further filter important associations. As a proof of concept, thyroid-related effects and their highly associated genes were examined, and demonstrated relevant MIEs and early KEs for AOPs to describe thyroid-related AOs. The ToxRefDB to gene mapping for thyroid resulted in >50 unique gene to chemical relationships. Integrated use of EMCON and ToxRefDB data provides a basis for rapid and robust putative AOP development, as well as a novel means to generate mechanistic hypotheses for specific chemicals. This abstract does not necessarily reflect U.S. EPA policy. Abstract and Poster for 2019 Society of Toxicology annual meeting in March 2019

https://cfpub.epa.gov/si/si_public_record_report.cfm?Lab=NCCT&dateBeginPublishedPresented=04%2F11%2F2014&dateEndPublishedPresented=04%2F11%2F2019&dirEntryId=344452&keyword=Chemical+Safety&showCriteria=2&sortBy=pubDateYear&subject=Chemical+Safety+Research

A Web-Hosted R Workflow to Simplify and Automate the Analysis of 16S NGS Data

Next-Generation Sequencing (NGS) produces large data sets that include tens-of-thousands of sequence reads per sample. For analysis of bacterial diversity, 16S NGS sequences are typically analyzed in a workflow that containing best-of-breed bioinformatics packages that may leverage multiple programming languages (e.g., Python, R, Java, etc.). The process totransform raw NGS data to usable operational taxonomic units (OTUs) can be tedious due tothe number of quality control (QC) steps used in QIIME and other software packages forsample processing. Therefore, the purpose of this work was to simplify the analysis of 16SNGS data from a large number of samples by integrating QC, demultiplexing, and QIIME(Quantitative Insights Into Microbial Ecology) analysis in an accessible R project. User command line operations for each of the pipeline steps were automated into a workflow. In addition, the R server allows multi-user access to the automated pipeline via separate useraccounts while providing access to the same large set of underlying data. We demonstratethe applicability of this pipeline automation using 16S NGS data from approximately 100 stormwater runoff samples collected in a mixed-land use watershed in northeast Georgia. OTU tables were generated for each sample and the relative taxonomic abundances were compared for different periods over storm hydrographs to determine how the microbial ecology of a stream changes with rise and fall of stream stage. Our approach simplifies the pipeline analysis of multiple 16S NGS samples by automating multiple preprocessing, QC, analysis and post-processing command line steps that are called by a sequence of R scripts. Presented at ASM 2015 Rapid NGS Bioinformatic Pipelines for Enhanced Molecular Epidemiologic Investigation of Pathogens

https://cfpub.epa.gov/si/si_public_record_report.cfm?Lab=NERL&dirEntryId=309890

DEVELOPING COMPUTATIONAL TOOLS NECESSARY FOR APPLYING TOXICOGENOMICS TO RISK ASSESSMENT AND REGULATORY DECISION MAKING.

GENOMICS, PROTEOMICS & METABOLOMICS CAN PROVIDE USEFUL WEIGHT-OF-EVIDENCE DATA ALONG THE SOURCE-TO-OUTCOME CONTINUUM, WHEN APPROPRIATE BIOINFORMATIC AND COMPUTATIONAL METHODS ARE APPLIED TOWARDS INTEGRATING MOLECULAR, CHEMICAL AND TOXICOGICAL INFORMATION. GENOMICS, PROTEOMICS & METABOLOMICS CAN PROVIDE USEFUL WEIGHT-OF-EVIDENCE DATA ALONG THE SOURCE-TO-OUTCOME CONTINUUM, WHEN APPROPRIATE BIOINFORMATIC AND COMPUTATIONAL METHODS ARE APPLIED TOWARDS INTEGRATING MOLECULAR, CHEMICAL AND TOXICOGICAL INFORMATION.

https://cfpub.epa.gov/si/si_public_record_report.cfm?Lab=NCCT&dirEntryId=156264

The Human Toxome Project

The Human Toxome project, funded as an NIH Transformative Research grant 2011–‐ 2016, is focused on developing the concepts and the means for deducing, validating, and sharing molecular Pathways of Toxicity (PoT). Using the test case of estrogenic endocrine disruption, the responses of MCF–‐7 human breast cancer cells are being phenotyped by transcriptomics and mass–‐spectroscopy–‐based metabolomics. The bioinformatics tools for PoT deduction represent a core deliverable. A number of challenges for quality and standardization of cell systems, omics technologies, and bioinformatics are being addressed. In parallel, concepts for annotation, validation, and sharing of PoT information, as well as their link to adverse outcomes, are being developed. A reasonably comprehensive public database of PoT, the Human Toxome Knowledge–‐base, could become a point of reference for toxicological research and regulatory tests strategies. A reasonably comprehensive public database of PoT, the Human Toxome Knowledge–‐base, could become a point of reference for toxicological research and regulatory tests strategies.

https://cfpub.epa.gov/si/si_public_record_report.cfm?Lab=NCCT&dirEntryId=309453

High-Resolution Metabolomics for Environmental Chemical Surveillance and Bioeffect Monitoring

High-Resolution Metabolomics for Environmental Chemical Surveillance and Bioeffect Monitoring (Presented by: Dean Jones, PhD, Department of Medicine, Emory University) (2/28/2013)

https://www.epa.gov/chemical-research/high-resolution-metabolomics-environmental-chemical-surveillance-and-bioeffect

Identification of Absorption, Distribution, Metabolism, and Excretion (ADME) Genes Relevant to Steatosis Using a Gene Expression Approach

Absorption, distribution, metabolism, and excretion (ADME) impact chemical concentration and activation of molecular initiating events of Adverse Outcome Pathways (AOPs) in cellular, tissue, and organ level targets. In order to better describe ADME parameters and how they modulate potential hazards posed by chemical exposure, our goal is to investigate the relationship between AOPs and ADME related genes and functional information. Given the scope of this task, we began using hepatic steatosis as a case study. To identify ADME genes related to steatosis, we used the publicly available toxicogenomics database, Open TG-GATEsTM. This database contains standardized rodent chemical exposure data from 170 chemicals (mostly drugs), along with differential gene expression data and corresponding associated pathological changes. We examined the chemical exposure microarray data set gathered from 9 chemical exposure treatments resulting in pathologically confirmed (minimal, moderate and severe) incidences of hepatic steatosis. From this differential gene expression data set, we utilized differential expression analyses to identify gene changes resulting from the chemical exposures leading to hepatic steatosis. We then selected differentially expressed genes (DEGs) related to ADME by filtering all genes based on their ADME functional identities. These DEGs include enzymes such as cytochrome p450, UDP glucuronosyltransferase, flavin-containing monooxygenase and transporter genes such as solute carriers and ATP-binding cassette transporter families. The up and downregulated genes were identified across these treatments. Total of 61 genes were upregulated and 68 genes were down regulated in all treatments. Meanwhile, 25 genes were both up regulated and downregulated across all the treatments. This work highlights the application of bioinformatics in linking AOPs with gene modulations specifically in relationships to ADME and exposures to chemicals. This abstract does not necessarily reflect U.S. EPA policy. This work highlights the application of bioinformatics tools to identify genes that are modulated by adverse outcomes. Specifically, we delineate a method to identify genes that are related to ADME and can impact target tissue dose in response to chemical exposures. The computational method outlined in this work is applicable to any adverse outcome pathway, and provide a linkage between chemical exposure, target tissue dose, and adverse outcomes. Application of this method will allow for the rapid screening of chemicals for their impact on ADME-related genes using available gene data bases in literature.

https://cfpub.epa.gov/si/si_public_record_report.cfm?Lab=NHEERL&dirEntryId=341273

Development of Environmental Fate and Metabolic Simulators

Presented at Bioinformatics Open Source Conference (BOSC), Detroit, MI, June 23-24, 2005. see description

https://cfpub.epa.gov/si/si_public_record_report.cfm?Lab=NERL&dirEntryId=257172

 

Useful Webinars on EPA Computational Tools and Informatics

 

Computational Toxicology Communities of Practice

Computational Toxicology Research

EPA’s Computational Toxicology Communities of Practice is composed of hundreds of stakeholders from over 50 public and private sector organizations (ranging from EPA, other federal agencies, industry, academic institutions, professional societies, nongovernmental organizations, environmental non-profit groups, state environmental agencies and more) who have an interest in using advances in computational toxicology and exposure science to evaluate the safety of chemicals.

The Communities of Practice is open to the public. Monthly webinars are held at EPA’s RTP campus, on the fourth Thursday of the month (occasionally rescheduled in November and December to accommodate holiday schedules), from 11am-Noon EST/EDT. Remote participation is available. For more information or to be added to the meeting email list, contact: Monica Linnenbrink (linnenbrink.monica@epa.gov).

Related Links

Past Webinar Presentations

Presentation File Presented By Date
OPEn structure-activity Relationship App (OPERA) Powerpoint(VideoEXIT) Dr. Kamel Mansouri, Lead Computational Chemist contractor for Integrated Laboratory Systems in the National Institute of Environmental Health Sciences 2019/4/25
CompTox Chemicals Dashboard and InVitroDB V3 (VideoEXIT) Dr. Antony Williams, Chemist in EPA’s National Center for Computational Toxicology and Dr. Katie Paul-Friedman, Toxicologist in EPA’s National Center for Computational Toxicology 2019/3/28
The Systematic Empirical Evaluation of Models (SEEM) framework (VideoEXIT) Dr. John Wambaugh, Physical Scientist in EPA’s National Center for Computational Toxicology 2019/2/28
ToxValDB: A comprehensive database of quantitative in vivo study results from over 25,000 chemicals (VideoEXIT) Dr. Richard Judson, Research Chemist in EPA’s National Center for Computational Toxicology 2018/12/20
Sequence Alignment to Predict Across Species Susceptibility (seqAPASS) (VideoEXIT) Dr. Carlie LaLone, Bioinformaticist, EPA’s National Health and Environmental Effects Research Laboratory 2018/11/29
Chemicals and Products Database (VideoEXIT) Dr. Kathie Dionisio, Environmental Health Scientist, EPA’s National Exposure Research Laboratory 2018/10/25
CompTox Chemicals Dashboard V3 (VideoEXIT) Dr. Antony Williams, Chemist, EPA National Center for Computational Toxicology (NCCT). 2018/09/27
Generalised Read-Across (GenRA) (VideoEXIT) Dr. Grace Patlewicz, Chemist, EPA National Center for Computational Toxicology (NCCT). 2018/08/23
EPA’S ToxCast Owner’s Manual  (VideoEXIT) Monica Linnenbrink, Strategic Outreach and Communication lead, EPA National Center for Computational Toxicology (NCCT). 2018/07/26
EPA’s Non-Targeted Analysis Collaborative Trial (ENTACT)      (VideoEXIT) Elin Ulrich, Research Chemist in the Public Health Chemistry Branch, EPA National Exposure Research Laboratory (NERL). 2018/06/28
ECOTOX Knowledgebase: New Tools and Data Visualizations(VideoEXIT) Colleen Elonen, Translational Toxicology Branch, and Dr. Jennifer Olker, Systems Toxicology Branch, in the Mid-Continent Ecology Division of EPA’s National Health & Environmental Effects Research Laboratory (NHEERL) 2018/05/24
Investigating Chemical-Microbiota Interactions in Zebrafish (VideoEXIT) Tamara Tal, Biologist in the Systems Biology Branch, Integrated Systems Toxicology Division, EPA’s National Health & Environmental Effects Research Laboratory (NHEERL) 2018/04/26
The CompTox Chemistry Dashboard v2.6: Delivering Improved Access to Data and Real Time Predictions (VideoEXIT) Tony Williams, Computational Chemist, EPA’s National Center for Computational Toxicology (NCCT) 2018/03/29
mRNA Transfection Retrofits Cell-Based Assays with Xenobiotic Metabolism (VideoEXIT* Audio starts at 10:17) Steve Simmons, Research Toxicologist, EPA’s National Center for Computational Toxicology (NCCT) 2018/02/22
Development and Distribution of ToxCast and Tox21 High-Throughput Chemical Screening Assay Method Description(VideoEXIT) Stacie Flood, National Student Services Contractor, EPA’s National Center for Computational Toxicology (NCCT) 2018/01/25
High-throughput H295R steroidogenesis assay: utility as an alternative and a statistical approach to characterize effects on steroidogenesis (VideoEXIT) Derik Haggard, ORISE Postdoctoral Fellow, EPA’s National Center for Computational Toxicology (NCCT) 2017/12/14
Systematic Review for Chemical Assessments: Core Elements and Considerations for Rapid Response (VideoEXIT) Kris Thayer, Director, Integrated Risk Information System (IRIS) Division of EPA’s National Center for Environmental Assessment (NCEA) 2017/11/16
High Throughput Transcriptomics (HTTr) Concentration-Response Screening in MCF7 Cells (VideoEXIT) Joshua Harrill, Toxicologist, EPA’s National Center for Computational Toxicology (NCCT) 2017/10/26
Learning Boolean Networks from ToxCast High-Content Imaging Data Todor Antonijevic, ORISE Postdoc, EPA’s National Center for Computational Toxicology (NCCT) 2017/09/28
Suspect Screening of Chemicals in Consumer Products Katherine Phillips, Research Chemist, Human Exposure and Dose Modeling Branch, Computational Exposure Division, EPA’s National Exposure Research Laboratory (NERHL) 2017/08/31
The EPA CompTox Chemistry Dashboard: A Centralized Hub for Integrating Data for the Environmental Sciences (VideoEXIT) Antony Williams, Chemist, EPA’s National Center for Computational Toxicology (NCCT) 2017/07/27
Navigating Through the Minefield of Read-Across Tools and Frameworks: An Update on Generalized Read-Across (GenRA)(VideoEXIT)    

 

Read Full Post »

A Nonlinear Methodology to Explain Complexity of the Genome and Bioinformatic Information, Volume 2 (Volume Two: Latest in Genomics Methodologies for Therapeutics: Gene Editing, NGS and BioInformatics, Simulations and the Genome Ontology), Part 1: Next Generation Sequencing (NGS)

A Nonlinear Methodology to Explain Complexity of the Genome and Bioinformatic Information

Reporter: Stephen J. Williams, Ph.D.

Multifractal bioinformatics: A proposal to the nonlinear interpretation of genome

The following is an open access article by Pedro Moreno on a methodology to analyze genetic information across species and in particular, the evolutionary trends of complex genomes, by a nonlinear analytic approach utilizing fractal geometry, coined “Nonlinear Bioinformatics”.  This fractal approach stems from the complex nature of higher eukaryotic genomes including mosaicism, multiple interdispersed  genomic elements such as intronic regions, noncoding regions, and also mobile elements such as transposable elements.  Although seemingly random, there exists a repetitive nature of these elements. Such complexity of DNA regulation, structure and genomic variation is felt best understood by developing algorithms based on fractal analysis, which can best model the regionalized and repetitive variability and structure within complex genomes by elucidating the individual components which contributes to an overall complex structure rather than using a “linear” or “reductionist” approach looking at individual coding regions, which does not take into consideration the aforementioned factors leading to genetic complexity and diversity.

Indeed, many other attempts to describe the complexities of DNA as a fractal geometric pattern have been described.  In a paper by Carlo Cattani “Fractals and Hidden Symmetries in DNA“, Carlo uses fractal analysis to construct a simple geometric pattern of the influenza A virus by modeling the primary sequence of this viral DNA, namely the bases A,G,C, and T. The main conclusions that

fractal shapes and symmetries in DNA sequences and DNA walks have been shown and compared with random and deterministic complex series. DNA sequences are structured in such a way that there exists some fractal behavior which can be observed both on the correlation matrix and on the DNA walks. Wavelet analysis confirms by a symmetrical clustering of wavelet coefficients the existence of scale symmetries.

suggested that, at least, the viral influenza genome structure could be analyzed into its basic components by fractal geometry.
This approach has been used to model the complex nature of cancer as discussed in a 2011 Seminars in Oncology paper
Abstract: Cancer is a highly complex disease due to the disruption of tissue architecture. Thus, tissues, and not individual cells, are the proper level of observation for the study of carcinogenesis. This paradigm shift from a reductionist approach to a systems biology approach is long overdue. Indeed, cell phenotypes are emergent modes arising through collective non-linear interactions among different cellular and microenvironmental components, generally described by “phase space diagrams”, where stable states (attractors) are embedded into a landscape model. Within this framework, cell states and cell transitions are generally conceived as mainly specified by gene-regulatory networks. However, the system s dynamics is not reducible to the integrated functioning of the genome-proteome network alone; the epithelia-stroma interacting system must be taken into consideration in order to give a more comprehensive picture. Given that cell shape represents the spatial geometric configuration acquired as a result of the integrated set of cellular and environmental cues, we posit that fractal-shape parameters represent “omics descriptors of the epithelium-stroma system. Within this framework, function appears to follow form, and not the other way around.

As authors conclude

” Transitions from one phenotype to another are reminiscent of phase transitions observed in physical systems. The description of such transitions could be obtained by a set of morphological, quantitative parameters, like fractal measures. These parameters provide reliable information about system complexity. “

Gene expression also displays a fractal nature. In a Frontiers in Physiology paper by Mahboobeh Ghorbani, Edmond A. Jonckheere and Paul Bogdan* “Gene Expression Is Not Random: Scaling, Long-Range Cross-Dependence, and Fractal Characteristics of Gene Regulatory Networks“,

the authors describe that gene expression networks display time series display fractal and long-range dependence characteristics.

Abstract: Gene expression is a vital process through which cells react to the environment and express functional behavior. Understanding the dynamics of gene expression could prove crucial in unraveling the physical complexities involved in this process. Specifically, understanding the coherent complex structure of transcriptional dynamics is the goal of numerous computational studies aiming to study and finally control cellular processes. Here, we report the scaling properties of gene expression time series in Escherichia coliand Saccharomyces cerevisiae. Unlike previous studies, which report the fractal and long-range dependency of DNA structure, we investigate the individual gene expression dynamics as well as the cross-dependency between them in the context of gene regulatory network. Our results demonstrate that the gene expression time series display fractal and long-range dependence characteristics. In addition, the dynamics between genes and linked transcription factors in gene regulatory networks are also fractal and long-range cross-correlated. The cross-correlation exponents in gene regulatory networks are not unique. The distribution of the cross-correlation exponents of gene regulatory networks for several types of cells can be interpreted as a measure of the complexity of their functional behavior.

 

Given that multitude of complex biomolecular networks and biomolecules can be described by fractal patterns, the development of bioinformatic algorithms  would enhance our understanding of the interdependence and cross funcitonality of these mutiple biological networks, particularly in disease and drug resistance.  The article below by Pedro Moreno describes the development of such bioinformatic algorithms.

Pedro A. Moreno
Escuela de Ingeniería de Sistemas y Computación, Facultad de Ingeniería, Universidad del Valle, Cali, Colombia
E-mail: pedro.moreno@correounivalle.edu.co

Eje temático: Ingeniería de sistemas / System engineering
Recibido: 19 de septiembre de 2012
Aceptado: 16 de diciembre de 2013


 

 


Abstract

The first draft of the human genome (HG) sequence was published in 2001 by two competing consortia. Since then, several structural and functional characteristics for the HG organization have been revealed. Today, more than 2.000 HG have been sequenced and these findings are impacting strongly on the academy and public health. Despite all this, a major bottleneck, called the genome interpretation persists. That is, the lack of a theory that explains the complex puzzles of coding and non-coding features that compose the HG as a whole. Ten years after the HG sequenced, two recent studies, discussed in the multifractal formalism allow proposing a nonlinear theory that helps interpret the structural and functional variation of the genetic information of the genomes. The present review article discusses this new approach, called: “Multifractal bioinformatics”.

Keywords: Omics sciences, bioinformatics, human genome, multifractal analysis.


1. Introduction

Omic Sciences and Bioinformatics

In order to study the genomes, their life properties and the pathological consequences of impairment, the Human Genome Project (HGP) was created in 1990. Since then, about 500 Gpb (EMBL) represented in thousands of prokaryotic genomes and tens of different eukaryotic genomes have been sequenced (NCBI, 1000 Genomes, ENCODE). Today, Genomics is defined as the set of sciences and technologies dedicated to the comprehensive study of the structure, function and origin of genomes. Several types of genomic have arisen as a result of the expansion and implementation of genomics to the study of the Central Dogma of Molecular Biology (CDMB), Figure 1 (above). The catalog of different types of genomics uses the Latin suffix “-omic” meaning “set of” to mean the new massive approaches of the new omics sciences (Moreno et al, 2009). Given the large amount of genomic information available in the databases and the urgency of its actual interpretation, the balance has begun to lean heavily toward the requirements of bioinformatics infrastructure research laboratories Figure 1 (below).

The bioinformatics or Computational Biology is defined as the application of computer and information technology to the analysis of biological data (Mount, 2004). An interdisciplinary science that requires the use of computing, applied mathematics, statistics, computer science, artificial intelligence, biophysical information, biochemistry, genetics, and molecular biology. Bioinformatics was born from the need to understand the sequences of nucleotide or amino acid symbols that make up DNA and proteins, respectively. These analyzes are made possible by the development of powerful algorithms that predict and reveal an infinity of structural and functional features in genomic sequences, as gene location, discovery of homologies between macromolecules databases (Blast), algorithms for phylogenetic analysis, for the regulatory analysis or the prediction of protein folding, among others. This great development has created a multiplicity of approaches giving rise to new types of Bioinformatics, such as Multifractal Bioinformatics (MFB) that is proposed here.

1.1 Multifractal Bioinformatics and Theoretical Background

MFB is a proposal to analyze information content in genomes and their life properties in a non-linear way. This is part of a specialized sub-discipline called “nonlinear Bioinformatics”, which uses a number of related techniques for the study of nonlinearity (fractal geometry, Hurts exponents, power laws, wavelets, among others.) and applied to the study of biological problems (http://pharmaceuticalintelligence.com/tag/fractal-geometry/). For its application, we must take into account a detailed knowledge of the structure of the genome to be analyzed and an appropriate knowledge of the multifractal analysis.

1.2 From the Worm Genome toward Human Genome

To explore a complex genome such as the HG it is relevant to implement multifractal analysis (MFA) in a simpler genome in order to show its practical utility. For example, the genome of the small nematode Caenorhabditis elegans is an excellent model to learn many extrapolated lessons of complex organisms. Thus, if the MFA explains some of the structural properties in that genome it is expected that this same analysis reveals some similar properties in the HG.

The C. elegans nuclear genome is composed of about 100 Mbp, with six chromosomes distributed into five autosomes and one sex chromosome. The molecular structure of the genome is particularly homogeneous along with the chromosome sequences, due to the presence of several regular features, including large contents of genes and introns of similar sizes. The C. elegans genome has also a regional organization of the chromosomes, mainly because the majority of the repeated sequences are located in the chromosome arms, Figure 2 (left) (C. elegans Sequencing Consortium, 1998). Given these regular and irregular features, the MFA could be an appropriate approach to analyze such distributions.

Meanwhile, the HG sequencing revealed a surprising mosaicism in coding (genes) and noncoding (repetitive DNA) sequences, Figure 2 (right) (Venter et al., 2001). This structure of 6 Gbp is divided into 23 pairs of chromosomes (diploid cells) and these highly regionalized sequences introduce complex patterns of regularity and irregularity to understand the gene structure, the composition of sequences of repetitive DNA and its role in the study and application of life sciences. The coding regions of the genome are estimated at ~25,000 genes which constitute 1.4% of GH. These genes are involved in a giant sea of various types of non-coding sequences which compose 98.6% of HG (misnamed popularly as “junk DNA”). The non-coding regions are characterized by many types of repeated DNA sequences, where 10.6% consists of Alu sequences, a type of SINE (short and dispersed repeated elements) sequence and preferentially located towards the genes. LINES, MIR, MER, LTR, DNA transposons and introns are another type of non-coding sequences which form about 86% of the genome. Some of these sequences overlap with each other; as with CpG islands, which complicates the analysis of genomic landscape. This standard genomic landscape was recently clarified, the last studies show that 80.4% of HG is functional due to the discovery of more than five million “switches” that operate and regulate gene activity, re-evaluating the concept of “junk DNA”. (The ENCODE Project Consortium, 2012).

Given that all these genomic variations both in worm and human produce regionalized genomic landscapes it is proposed that Fractal Geometry (FG) would allow measuring how the genetic information content is fragmented. In this paper the methodology and the nonlinear descriptive models for each of these genomes will be reviewed.

1.3 The MFA and its Application to Genome Studies

Most problems in physics are implicitly non-linear in nature, generating phenomena such as chaos theory, a science that deals with certain types of (non-linear) but very sensitive dynamic systems to initial conditions, nonetheless of deterministic rigor, that is that their behavior can be completely determined by knowing initial conditions (Peitgen et al, 1992). In turn, the FG is an appropriate tool to study the chaotic dynamic systems (CDS). In other words, the FG and chaos are closely related because the space region toward which a chaotic orbit tends asymptotically has a fractal structure (strange attractors). Therefore, the FG allows studying the framework on which CDS are defined (Moon, 1992). And this is how it is expected for the genome structure and function to be organized.

The MFA is an extension of the FG and it is related to (Shannon) information theory, disciplines that have been very useful to study the information content over a sequence of symbols. Initially, Mandelbrot established the FG in the 80’s, as a geometry capable of measuring the irregularity of nature by calculating the fractal dimension (D), an exponent derived from a power law (Mandelbrot, 1982). The value of the D gives us a measure of the level of fragmentation or the information content for a complex phenomenon. That is because the D measures the scaling degree that the fragmented self-similarity of the system has. Thus, the FG looks for self-similar properties in structures and processes at different scales of resolution and these self-similarities are organized following scaling or power laws.

Sometimes, an exponent is not sufficient to characterize a complex phenomenon; so more exponents are required. The multifractal formalism allows this, and applies when many subgroups of fractals with different scalar properties with a large number of exponents or fractal dimensions coexist simultaneously. As a result, when a spectrum of multifractal singularity measurement is generated, the scaling behavior of the frequency of symbols of a sequence can be quantified (Vélez et al, 2010).

The MFA has been implemented to study the spatial heterogeneity of theoretical and experimental fractal patterns in different disciplines. In post-genomics times, the MFA was used to study multiple biological problems (Vélez et al, 2010). Nonetheless, very little attention has been given to the use of MFA to characterize the content of the structural genetic information of the genomes obtained from the images of the Chaos Representation Game (CRG). First studies at this level were made recently to the analysis of the C. elegans genome (Vélez et al, 2010) and human genomes (Moreno et al, 2011). The MFA methodology applied for the study of these genomes will be developed below.

2. Methodology

The Multifractal Formalism from the CGR

2.1 Data Acquisition and Molecular Parameters

Databases for the C. elegans and the 36.2 Hs_ refseq HG version were downloaded from the NCBI FTP server. Then, several strategies were designed to fragment the genomic DNA sequences of different length ranges. For example, the C. elegans genome was divided into 18 fragments, Figure 2 (left) and the human genome in 9,379 fragments. According to their annotation systems, the contents of molecular parameters of coding sequences (genes, exons and introns), noncoding sequences (repetitive DNA, Alu, LINES, MIR, MER, LTR, promoters, etc.) and coding/ non-coding DNA (TTAGGC, AAAAT, AAATT, TTTTC, TTTTT, CpG islands, etc.) are counted for each sequence.

2.2 Construction of the CGR 2.3 Fractal Measurement by the Box Counting Method

Subsequently, the CGR, a recursive algorithm (Jeffrey, 1990; Restrepo et al, 2009) is applied to each selected DNA sequence, Figure 3 (above, left) and from which an image is obtained, which is quantified by the box-counting algorithm. For example, in Figure 3 (above, left) a CGR image for a human DNA sequence of 80,000 bp in length is shown. Here, dark regions represent sub-quadrants with a high number of points (or nucleotides). Clear regions, sections with a low number of points. The calculation for the D for the Koch curve by the box-counting method is illustrated by a progression of changes in the grid size, and its Cartesian graph, Table 1

The CGR image for a given DNA sequence is quantified by a standard fractal analysis. A fractal is a fragmented geometric figure whose parts are an approximated copy at full scale, that is, the figure has self-similarity. The D is basically a scaling rule that the figure obeys. Generally, a power law is given by the following expression:

Where N(E) is the number of parts required for covering the figure when a scaling factor E is applied. The power law permits to calculate the fractal dimension as:

The D obtained by the box-counting algorithm covers the figure with disjoint boxes ɛ = 1/E and counts the number of boxes required. Figure 4 (above, left) shows the multifractal measure at momentum q=1.

2.4 Multifractal Measurement

When generalizing the box-counting algorithm for the multifractal case and according to the method of moments q, we obtain the equation (3) (Gutiérrez et al, 1998; Yu et al, 2001):

Where the Mi number of points falling in the i-th grid is determined and related to the total number Mand ɛ to box size. Thus, the MFA is used when multiple scaling rules are applied. Figure 4 (above, right) shows the calculation of the multifractal measures at different momentum q (partition function). Here, linear regressions must have a coefficient of determination equal or close to 1. From each linear regression D are obtained, which generate an spectrum of generalized fractal dimensions Dfor all q integers, Figure 4 (below, left). So, the multifractal spectrum is obtained as the limit:

The variation of the q integer allows emphasizing different regions and discriminating their fractal a high Dq is synonymous of the structure’s richness and the properties of these regions. Negative values emphasize the scarce regions; a high Dindicates a lot of structure and properties in these regions. In real world applications, the limit Dqreadily approximated from the data using a linear fitting: the transformation of the equation (3) yields:

Which shows that ln In(Mi )= for set q is a linear function in the ln(ɛ), Dq can therefore be evaluated as q the slope of a fixed relationship between In(Mi )= and (q-1) ln(ɛ). The methodologies and approaches for the method of box-counting and MFA are detailed in Moreno et al, 2000, Yu et al, 2001; Moreno, 2005. For a rigorous mathematical development of MFA from images consult Multifractal system, wikipedia.

2.5 Measurement of Information Content

Subsequently, from the spectrum of generalized dimensions Dq, the degree of multifractality ΔDq(MD) is calculated as the difference between the maximum and minimum values of : ΔD qq Dqmax – Dqmin (Ivanov et al, 1999). When qmaxqmin ΔDis high, the multifractal spectrum is rich in information and highly aperiodic, when ΔDq is small, the resulting dimension spectrum is poor in information and highly periodic. It is expected then, that the aperiodicity in the genome would be related to highly polymorphic genomic aperiodic structures and those periodic regions with highly repetitive and not very polymorphic genomic structures. The correlation exponent t(q) = (– 1)DqFigure 4 (below, right ) can also be obtained from the multifractal dimension Dq. The generalized dimension also provides significant specific information. D(q = 0) is equal to the Capacity dimension, which in this analysis is the size of the “box count”. D(q = 1) is equal to the Information dimension and D(q = 2) to the Correlation dimension. Based on these multifractal parameters, many of the structural genomic properties can be quantified, related, and interpreted.

2.6 Multifractal Parameters and Statistical and Discrimination Analyses

Once the multifractal parameters are calculated (D= (-20, 20), ΔDq, πq, etc.), correlations with the molecular parameters are sought. These relations are established by plotting the number of genome molecular parameters versus MD by discriminant analysis with Cartesian graphs in 2-D, Figure 5 (below, left) and 3-D and combining multifractal and molecular parameters. Finally, simple linear regression analysis, multivariate analysis, and analyses by ranges and clusterings are made to establish statistical significance.

3 Results and Discussion

3.1 Non-linear Descriptive Model for the C. elegans Genome

When analyzing the C. elegans genome with the multifractal formalism it revealed what symmetry and asymmetry on the genome nucleotide composition suggested. Thus, the multifractal scaling of the C. elegans genome is of interest because it indicates that the molecular structure of the chromosome may be organized as a system operating far from equilibrium following nonlinear laws (Ivanov et al, 1999; Burgos and Moreno-Tovar, 1996). This can be discussed from two points of view:

1) When comparing C. elegans chromosomes with each other, the X chromosome showed the lowest multifractality, Figure 5 (above). This means that the X chromosome is operating close to equilibrium, which results in an increased genetic instability. Thus, the instability of the X could selectively contribute to the molecular mechanism that determines sex (XX or X0) during meiosis. Thus, the X chromosome would be operating closer to equilibrium in order to maintain their particular sexual dimorphism.

2) When comparing different chromosome regions of the C. elegans genome, changes in multifractality were found in relation to the regional organization (at the center and arms) exhibited by the chromosomes, Figure 5 (below, left). These behaviors are associated with changes in the content of repetitive DNA, Figure 5 (below, right). The results indicated that the chromosome arms are even more complex than previously anticipated. Thus, TTAGGC telomere sequences would be operating far from equilibrium to protect the genetic information encoded by the entire chromosome.

All these biological arguments may explain why C. elegans genome is organized in a nonlinear way. These findings provide insight to quantify and understand the organization of the non-linear structure of the C. elegans genome, which may be extended to other genomes, including the HG (Vélez et al, 2010).

3.2 Nonlinear Descriptive Model for the Human Genome

Once the multifractal approach was validated in C. elegans genome, HG was analyzed exhaustively. This allowed us to propose a nonlinear model for the HG structure which will be discussed under three points of view.

1) It was found that the HG high multifractality depends strongly on the contents of Alu sequences and to a lesser extent on the content of CpG islands. These contents would be located primarily in highly aperiodic regions, thus taking the chromosome far from equilibrium and giving to it greater genetic stability, protection and attraction of mutations, Figure 6 (A-C). Thus, hundreds of regions in the HG may have high genetic stability and the most important genetic information of the HG, the genes, would be safeguarded from environmental fluctuations. Other repeated elements (LINES, MIR, MER, LTRs) showed no significant relationship,

Figure 6 (D). Consequently, the human multifractal map developed in Moreno et al, 2011 constitutes a good tool to identify those regions rich in genetic information and genomic stability. 2) The multifractal context seems to be a significant requirement for the structural and functional organization of thousands of genes and gene families. Thus, a high multifractal context (aperiodic) appears to be a “genomic attractor” for many genes (KOGs, KEEGs), Figure 6 (E) and some gene families, Figure 6 (F) are involved in genetic and deterministic processes, in order to maintain a deterministic regulation control in the genome, although most of HG sequences may be subject to a complex epigenetic control.

3) The classification of human chromosomes and chromosome regions analysis may have some medical implications (Moreno et al, 2002; Moreno et al, 2009). This means that the structure of low nonlinearity exhibited by some chromosomes (or chromosome regions) involve an environmental predisposition, as potential targets to undergo structural or numerical chromosomal alterations in Figure 6 (G). Additionally, sex chromosomes should have low multifractality to maintain sexual dimorphism and probably the X chromosome inactivation.

All these fractals and biological arguments could explain why Alu elements are shaping the HG in a nonlinearly manner (Moreno et al, 2011). Finally, the multifractal modeling of the HG serves as theoretical framework to examine new discoveries made by the ENCODE project and new approaches about human epigenomes. That is, the non-linear organization of HG might help to explain why it is expected that most of the GH is functional.

4. Conclusions

All these results show that the multifractal formalism is appropriate to quantify and evaluate genetic information contents in genomes and to relate it with the known molecular anatomy of the genome and some of the expected properties. Thus, the MFB allows interpreting in a logic manner the structural nature and variation of the genome.

The MFB allows understanding why a number of chromosomal diseases are likely to occur in the genome, thus opening a new perspective toward personalized medicine to study and interpret the GH and its diseases.

The entire genome contains nonlinear information organizing it and supposedly making it function, concluding that virtually 100% of HG is functional. Bioinformatics in general, is enriched with a novel approach (MFB) making it possible to quantify the genetic information content of any DNA sequence and their practical applications to different disciplines in biology, medicine and agriculture. This novel breakthrough in computational genomic analysis and diseases contributes to define Biology as a “hard” science.

MFB opens a door to develop a research program towards the establishment of an integrative discipline that contributes to “break” the code of human life. (http://pharmaceuticalintelligence. com/page/3/).

5. Acknowledgements

Thanks to the directives of the EISC, the Universidad del Valle and the School of Engineering for offering an academic, scientific and administrative space for conducting this research. Likewise, thanks to co authors (professors and students) who participated in the implementation of excerpts from some of the works cited here. Finally, thanks to Colciencias by the biotechnology project grant # 1103-12-16765.


6. References

Blanco, S., & Moreno, P.A. (2007). Representación del juego del caos para el análisis de secuencias de ADN y proteínas mediante el análisis multifractal (método “box-counting”). In The Second International Seminar on Genomics and Proteomics, Bioinformatics and Systems Biology (pp. 17-25). Popayán, Colombia.         [ Links ]

Burgos, J.D., & Moreno-Tovar, P. (1996). Zipf scaling behavior in the immune system. BioSystem , 39, 227-232.         [ Links ]

C. elegans Sequencing Consortium. (1998). Genome sequence of the nematode C. elegans: a platform for investigating biology. Science , 282, 2012-2018.         [ Links ]

Gutiérrez, J.M., Iglesias A., Rodríguez, M.A., Burgos, J.D., & Moreno, P.A. (1998). Analyzing the multifractals structure of DNA nucleotide sequences. In, M. Barbie & S. Chillemi (Eds.) Chaos and Noise in Biology and Medicine (cap. 4). Hackensack (NJ): World Scientific Publishing Co.         [ Links ]

Ivanov, P.Ch., Nunes, L.A., Golberger, A.L., Havlin, S., Rosenblum, M.G., Struzikk, Z.R., & Stanley, H.E. (1999). Multifractality in human heartbeat dynamics. Nature , 399, 461-465.         [ Links ]

Jeffrey, H.J. (1990). Chaos game representation of gene structure. Nucleic Acids Research , 18, 2163-2175.         [ Links ]

Mandelbrot, B. (1982). La geometría fractal de la naturaleza. Barcelona. España: Tusquets editores.         [ Links ]

Moon, F.C. (1992). Chaotic and fractal dynamics. New York: John Wiley.         [ Links ]

Moreno, P.A. (2005). Large scale and small scale bioinformatics studies on the Caenorhabditis elegans enome. Doctoral thesis. Department of Biology and Biochemistry, University of Houston, Houston, USA.         [ Links ]

Moreno, P.A., Burgos, J.D., Vélez, P.E., Gutiérrez, J.M., & et al., (2000). Multifractal analysis of complete genomes. In P roceedings of the 12th International Genome Sequencing and Analysis Conference (pp. 80-81). Miami Beach (FL).         [ Links ]

Moreno, P.A., Rodríguez, J.G., Vélez, P.E., Cubillos, J.R., & Del Portillo, P. (2002). La genómica aplicada en salud humana. Colombia Ciencia y Tecnología. Colciencias , 20, 14-21.         [ Links ]

Moreno, P.A., Vélez, P.E., & Burgos, J.D. (2009). Biología molecular, genómica y post-genómica. Pioneros, principios y tecnologías. Popayán, Colombia: Editorial Universidad del Cauca.         [ Links ]

Moreno, P.A., Vélez, P.E., Martínez, E., Garreta, L., Díaz, D., Amador, S., Gutiérrez, J.M., et. al. (2011). The human genome: a multifractal analysis. BMC Genomics , 12, 506.         [ Links ]

Mount, D.W. (2004). Bioinformatics. Sequence and ge nome analysis. New York: Cold Spring Harbor Laboratory Press.         [ Links ]

Peitgen, H.O., Jürgen, H., & Saupe D. (1992). Chaos and Fractals. New Frontiers of Science. New York: Springer-Verlag.         [ Links ]

Restrepo, S., Pinzón, A., Rodríguez, L.M., Sierra, R., Grajales, A., Bernal, A., Barreto, E. et. al. (2009). Computational biology in Colombia. PLoS Computational Biology, 5 (10), e1000535.         [ Links ]

The ENCODE Project Consortium. (2012). An integrated encyclopedia of DNA elements in the human genome. Nature , 489, 57-74.         [ Links ]

Vélez, P.E., Garreta, L.E., Martínez, E., Díaz, N., Amador, S., Gutiérrez, J.M., Tischer, I., & Moreno, P.A. (2010). The Caenorhabditis elegans genome: a multifractal analysis. Genet and Mol Res , 9, 949-965.         [ Links ]

Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., & et al. (2001). The sequence of the human genome. Science , 291, 1304-1351.         [ Links ]

Yu, Z.G., Anh, V., & Lau, K.S. (2001). Measure representation and multifractal analysis of complete genomes. Physical Review E: Statistical, Nonlinear, and Soft Matter Physics , 64, 031903.         [ Links ]

 

Other articles on Bioinformatics on this Open Access Journal include:

Bioinformatics Tool Review: Genome Variant Analysis Tools

2017 Agenda – BioInformatics: Track 6: BioIT World Conference & Expo ’17, May 23-35, 2017, Seaport World Trade Center, Boston, MA

Better bioinformatics

Broad Institute, Google Genomics combine bioinformatics and computing expertise

Autophagy-Modulating Proteins and Small Molecules Candidate Targets for Cancer Therapy: Commentary of Bioinformatics Approaches

CRACKING THE CODE OF HUMAN LIFE: The Birth of BioInformatics & Computational Genomics

Read Full Post »

Bioinformatics Tool Review: Genome Variant Analysis Tools, Volume 2 (Volume Two: Latest in Genomics Methodologies for Therapeutics: Gene Editing, NGS and BioInformatics, Simulations and the Genome Ontology), Part 1: Next Generation Sequencing (NGS)

Bioinformatics Tool Review: Genome Variant Analysis Tools

Curator: Stephen J. Williams, Ph.D.

Updated 02/07/2021

Updated 11/15/2018

The following post will be an ongoing curation of reviews of gene variant bioinformatic software.

The Ensembl Variant Effect Predictor.

McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F.

Genome Biol. 2016 Jun 6;17(1):122. doi: 10.1186/s13059-016-0974-4.

Author information

1

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK. wm2@ebi.ac.uk.

2

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.

3

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK. fiona@ebi.ac.uk.

Abstract

The Ensembl Variant Effect Predictor is a powerful toolset for the analysis, annotation, and prioritization of genomic variants in coding and non-coding regions. It provides access to an extensive collection of genomic annotation, with a variety of interfaces to suit different requirements, and simple options for configuring and extending analysis. It is open source, free to use, and supports full reproducibility of results. The Ensembl Variant Effect Predictor can simplify and accelerate variant interpretation in a wide range of study designs.

Rare diseases can be difficult to diagnose due to low incidence and incomplete penetrance of implicated alleles however variant analysis of whole genome sequencing can identify underlying genetic events responsible for the disease (Nature, 2015).  However, a large cohort is required for many WGS association studies in order to produce enough statistical power for interpretation (see post and here).  To this effect major sequencing projects have been initiated worldwide including:

A more thorough curation of sequencing projects can be seen in the following post:

Icelandic Population Genomic Study Results by deCODE Genetics come to Fruition: Curation of Current genomic studies

And although sequencing costs have dramatically been reduced over the years, the costs to determine the functional consequences of such variants remains high, as thorough basic research studies must be conducted to validate the interpretation of variant data with respect to the underlying disease, as only a small fraction of variants from a genome sequencing project will encode for a functional protein.  Correct annotation of sequences and variants, identification of correct corresponding reference genes or transcripts in GENCODE or RefSeq respectively offer compelling challenges to the proper identification of sequenced variants as potential functional variants.

To this effect, the authors developed the Ensembl Variant Effect Predictor (VEP), which is a software suite that performs annotations and analysis of most types of genomic variation in coding and non-coding regions of the genome.

Summary of Features

  • Annotation: VEP can annotate two broad categories of genomic variants
    • Sequence variants with specific and defined changes: indels, base substitutions, SNVs, tandem repeats
    • Larger structural variants > 50 nucleotides
  • Species and assembly/genomic database support: VEP can analyze data from any species with assembled genome sequence and annotated gene set. VEP supports chromosome assemblies such as the latest GRCh38, FASTA, as well as transcripts from RefSeq as well as user-derived sequences
  • Transcript Annotation: VEP includes a wide variety of gene and transcript related information including NCBI Gene ID, Gene Symbol, Transcript ID, NCBI RefSeq ID, exon/intron information, and cross reference to other databases such as UniProt
  • Protein Annotation: Protein-related fields include Protein ID, RefSeq ID, SwissProt, UniParc ID, reference codons and amino acids, SIFT pathogenicity score, protein domains
  • Noncoding Annotation: VEP reports variants in noncoding regions including genomic regulatory regions, intronic regions, transcription binding motifs. Data from ENCODE, BLUEPRINT, and NIH Epigenetics RoadMap are used for primary annotation.  Plugins to the Perl coding are also available to link other databases which annotate noncoding sequence features.
  • Frequency, phenotype, and citation annotation: VEP searches Ensembl databases containing a large amount of germline variant information and checks variants against the dbSNP single nucleotide polymorphism database. VEP integrates with mutational databases such as COSMIC, the Human Gene Mutation Database, and structural and copy number variants from Database of Genomic Variants.  Allele Frequencies are reported from 1000 Genomes and NHLBI and integrates with PubMed for literature annotation.  Phenotype information is from OMIM, Orphanet, GWAS and clinical information of variants from ClinVar.
  • Flexible Input and Output Formats: VEP supports input data format called “variant call format” or VCP, a standard in next-gen sequencing. VEP has the ability to process variant identifiers from other database formats.  Output formats are tab deliminated and give the user choices in presentation of results (HTML or text based)
  • Choice of user interface
    • Online tool (VEP Web): simple point and click; incorporates Instant VEP Functionality and copy and paste features. Results can be stored online in cloud storage on Ensembl.
    • VEP script: VEP is available as a downloadable PERL script (see below for link) and can process large amounts of data rapidly. This interface is powerfully flexible with the ability to integrate multiple plugins available from Ensembl and GitHub.  The ability to alter the PERL code and add plugins and code functions allows the flexibility to modify any feature of VEP.
    • VEP REST API: provides robust computational access to any programming language and returns basic variant annotation. Can make use of external plugins.

 

Watch Video on VES Instructional Webinar: https://youtu.be/7Fs7MHfXjWk

Watch Video on VES Web Version training on How to Analyze Your Sequence in VEP

Availability of data and materials

The dataset supporting the conclusions of this article is available from Illumina’s Platinum Genomes [93] and using the Ensembl release 75 gene set. Pre-built data sets are available for all Ensembl and Ensembl Genomes species [94]. They can also be downloaded automatically during set up whilst installing the VEP.

References

Large-scale discovery of novel genetic causes of developmental disorders.

Deciphering Developmental Disorders Study.

Nature2015 Mar 12;519(7542):223-8. doi: 10.1038/nature14135. PMID:25533962

Updated 11/15/2018

Research Points to Caution in Use of Variant Effect Prediction Bioinformatic Tools

Although we have the ability to use high throughput sequencing to identify allelic variants occurring in rare disease, correlation of these variants with the underlying disease is often difficult due to a few concerns:

  • For rare sporadic diseases, classical gene/variant association studies have proven difficult to perform (Meyts et al. 2016)
  • As Whole Exome Sequencing (WES) returns a considerable number of variants, how to differentiate the normal allelic variation found in the human population from disease-causing pathogenic alleles
  • For rare diseases, pathogenic allele frequencies are generally low

Therefore, for these rare pathogenic alleles, the use of bioinformatics tools in order to predict the resulting changes in gene function may provide insight into disease etiology when validation of these allelic changes might be experimentally difficult.

In a 2017 Genes & Immunity paper, Line Lykke Andersen and Rune Hartmann tested the reliability of various bioinformatic software to predict the functional consequence of variants of six different genes involved in interferon induction and sixteen allelic variants of the IFNLR1 gene.  These variants were found in cohorts of patients presenting with herpes simplex encephalitis (HSE). Most of the adult population is seropositive for Herpes Simplex Virus (HSV) however a minor fraction (1 in 250,000 individuals per year) of HSV infected individuals will develop HSE (Hjalmarsson et al., 2007).  It has been suggested that HSE occurs in individuals with rare primary immunodeficiencies caused by gene defects affecting innate immunity through reduced production of interferons (IFN) (Zhang et al., Lim et al.).

References

Meyts I, Bosch B, Bolze A, Boisson B, Itan Y, Belkadi A, et al. Exome and genome sequencing for inborn errors of immunity. J Allergy Clin Immunol. 2016;138:957–69.

Hjalmarsson A, Blomqvist P, Skoldenberg B. Herpes simplex encephalitis in Sweden, 1990-2001: incidence, morbidity, and mortality. Clin Infect Dis. 2007;45:875–80.

Zhang SY, Jouanguy E, Ugolini S, Smahi A, Elain G, Romero P, et al. TLR3 deficiency in patients with herpes simplex encephalitis. Science. 2007;317:1522–7.

Lim HK, Seppanen M, Hautala T, Ciancanelli MJ, Itan Y, Lafaille FG, et al. TLR3 deficiency in herpes simplex encephalitis: high allelic heterogeneity and recurrence risk. Neurology. 2014;83:1888–97.

Genes Immun. 2017 Dec 4. doi: 10.1038/s41435-017-0002-z.

Frequently used bioinformatics tools overestimate the damaging effect of allelic variants.

Andersen LL1Terczyńska-Dyla E1Mørk N2Scavenius C1Enghild JJ1Höning K3Hornung V3,4Christiansen M5,6Mogensen TH2,6Hartmann R7.

Abstract

We selected two sets of naturally occurring human missense allelic variants within innate immune genes. The first set represented eleven non-synonymous variants in six different genes involved in interferon (IFN) induction, present in a cohort of patients suffering from herpes simplex encephalitis (HSE) and the second set represented sixteen allelic variants of the IFNLR1 gene. We recreated the variants in vitro and tested their effect on protein function in a HEK293T cell based assay. We then used an array of 14 available bioinformatics tools to predict the effect of these variants upon protein function. To our surprise two of the most commonly used tools, CADD and SIFT, produced a high rate of false positives, whereas SNPs&GO exhibited the lowest rate of false positives in our test. As the problem in our test in general was false positive variants, inclusion of mutation significance cutoff (MSC) did not improve accuracy.

Methodology

  1. Identification of rare variants
  2. Genomes of nineteen Dutch patients with a history of HSE sequenced by WES and identification of novel HSE causing variants determined by filtering the single nucleotide polymorphisms (SNPs) that had a frequency below 1% in the NHBLI Exome Sequencing Project Exome Variant Server and the 1000 Genomes Project and were present within 204 genes involved in the immune response to HSV.
  3. Identified variants (204) manually evaluated for involvement of IFN induction based on IDBase and KEGG pathway database analysis.
  4. In-silico predictions: Variants classified by the in silico variant pathogenicity prediction programs: SIFT, Mutation Assessor, FATHMM, PROVEAN, SNAP2, PolyPhen2, PhD-SNP, SNP&GO, FATHMM-MKL, MutationTaster2, PredictSNP, Condel, MetaSNP, and CADD. Each program returned prediction scores measuring likelihood of a variant either being ‘deleterious’ or ‘neutral’. Prediction accuracy measured as

ACC = (true positive+true negative)/(true positive+true negative+false positive+false negative)

  1. Validation of prediction software/tools

In order to validate the predictive value of the software, HEK293T cells, deficient in IRF3, MAVS, and IKKe/TBK1, were cotransfected with the nine variants of the aforementioned genes and a luciferase reporter under control of the IFN-b promoter and luciferase activity measured as an indicator of IFN signaling function.  Western blot was performed to confirm the expression of the constructs.

Results

Table 2 Summary of the
bioinformatic predictions
HSE variants IFNLR1 variants Overall ACC
TN TP FN FP Total ACC TN TP FN FP Total ACC
Uniform cutoff
SIFT 4 1 0 4 9 0.56 8 1 0 7 16 0.56 0.56
Mutation assessor 6 1 0 2 9 0.78 9 1 0 6 16 0.63 0.68
FATHMM 7 1 0 1 9 0.89 0.89
PROVEAN 8 1 0 0 9 1.00 11 1 0 4 16 0.75 0.84
SNAP2 5 1 0 3 9 0.67 8 0 1 7 16 0.50 0.56
PolyPhen2 6 1 0 2 9 0.78 12 1 0 3 16 0.81 0.80
PhD-SNP 7 1 0 1 9 0.89 11 1 0 4 16 0.75 0.80
SNPs&GO 8 1 0 0 9 1.00 14 1 0 1 16 0.94 0.96
FATHMM MKL 4 1 0 4 9 0.56 13 0 1 2 16 0.81 0.72
MutationTaster2 4 0 1 4 9 0.44 14 0 1 1 16 0.88 0.72
PredictSNP 6 1 0 2 9 0.78 11 1 0 4 16 0.75 0.76
Condel 6 1 0 2 9 0.78 0.78
Meta-SNP 8 1 0 0 9 1.00 11 1 0 4 16 0.75 0.84
CADD 2 1 0 6 9 0.33 8 0 1 7 16 0.50 0.44
MSC 95% cutoff
SIFT 5 1 0 3 9 0.67 8 1 0 8 16 0.50 0.56
PolyPhen2 6 1 0 2 9 0.78 13 1 0 3 16 0.81 0.80
CADD 4 1 0 4 9 0.56 7 0 1 9 16 0.44 0.48

Note: TN: true negative, TP: true positive, FN: false negative, FP: false positive, ACC: accuracy

Functional testing (data obtained from reporter construct experiments) were considered as the correct outcome.

Three prediction tools (PROVEAN, SNP&GO, and MetaSNP correctly predicted the effect of all nine variants tested.

Updated 02/07/2021

InMeRF: prediction of pathogenicity of missense variants by individual modeling for each amino acid substitution
Jun-Ichi Takeda Kentaro Nanatsue Ryosuke Yamagishi Mikako Ito Nobuhiko Haga 2Hiromi Hirata Tomoo Ogi Kinji Ohno in NAR Genomics and  Bioinformatics. 2020 May 26;2(2):lqaa038.doi: 10.1093/nargab/lqaa038. eCollection 2020 Jun.

Abstract

In predicting the pathogenicity of a nonsynonymous single-nucleotide variant (nsSNV), a radical change in amino acid properties is prone to be classified as being pathogenic. However, not all such nsSNVs are associated with human diseases. We generated random forest (RF) models individually for each amino acid substitution to differentiate pathogenic nsSNVs in the Human Gene Mutation Database and common nsSNVs in dbSNP. We named a set of our models ‘Individual Meta RF’ (InMeRF). Ten-fold cross-validation of InMeRF showed that the areas under the curves (AUCs) of receiver operating characteristic (ROC) and precision-recall curves were on average 0.941 and 0.957, respectively. To compare InMeRF with seven other tools, the eight tools were generated using the same training dataset, and were compared using the same three testing datasets. ROC-AUCs of InMeRF were ranked first in the eight tools. We applied InMeRF to 155 pathogenic and 125 common nsSNVs in seven major genes causing congenital myasthenic syndromes, as well as in VANGL1 causing spina bifida, and found that the sensitivity and specificity of InMeRF were 0.942 and 0.848, respectively. We made the InMeRF web service, and also made genome-wide InMeRF scores available online (https://www.med.nagoya-u.ac.jp/neurogenetics/InMeRF/).

Source: https://pubmed.ncbi.nlm.nih.gov/33543123/

ADDRESS: A database of disease-associated human variants incorporating protein structure and folding stabilities
Jaie Woodard Chengxin Zhang Yang Zhang in J Mol Biol. 2021 Feb 1;166840. doi: 10.1016/j.jmb.2021.166840.

Abstract

Numerous human diseases are caused by mutations in genomic sequences. Since amino acid changes affect protein function through mechanisms often predictable from protein structure, the integration of structural and sequence data enables us to estimate with greater accuracy whether and how a given mutation will lead to disease. Publicly available annotated databases enable hypothesis assessment and benchmarking of prediction tools. However, the results are often presented as summary statistics or black box predictors, without providing full descriptive information. We developed a new semi-manually curated human variant database presenting information on the protein contact-map, sequence-to-structure mapping, amino acid identity change, and stability prediction for the popular UniProt database. We found that the profiles of pathogenic and benign missense polymorphisms can be effectively deduced using decision trees and comparative analyses based on the presented dataset. The database is made publicly available through https://zhanglab.ccmb.med.umich.edu/ADDRESS.

Source: https://pubmed.ncbi.nlm.nih.gov/33539887/

PopDel identifies medium-size deletions simultaneously in tens of thousands of genomes

Abstract

Thousands of genomic structural variants (SVs) segregate in the human population and can impact phenotypic traits and diseases. Their identification in whole-genome sequence data of large cohorts is a major computational challenge. Most current approaches identify SVs in single genomes and afterwards merge the identified variants into a joint call set across many genomes. We describe the approach PopDel, which directly identifies deletions of about 500 to at least 10,000 bp in length in data of many genomes jointly, eliminating the need for subsequent variant merging. PopDel scales to tens of thousands of genomes as we demonstrate in evaluations on up to 49,962 genomes. We show that PopDel reliably reports common, rare and de novo deletions. On genomes with available high-confidence reference call sets PopDel shows excellent recall and precision. Genotype inheritance patterns in up to 6794 trios indicate that genotypes predicted by PopDel are more reliable than those of previous SV callers. Furthermore, PopDel’s running time is competitive with the fastest tested previous tools. The demonstrated scalability and accuracy of PopDel enables routine scans for deletions in large-scale sequencing studies.

Source: https://pubmed.ncbi.nlm.nih.gov/33526789/

Other articles related to Genomics and Bioinformatics on this online Open Access Journal Include:

Finding the Genetic Links in Common Disease: Caveats of Whole Genome Sequencing Studies

Large-scale sequencing does not support the idea that lower-frequency variants have a major role in predisposition to type 2 diabetes

US Personalized Cancer Genome Sequencing Market Outlook 2018 –

Icelandic Population Genomic Study Results by deCODE Genetics come to Fruition: Curation of Current genomic studies

Read Full Post »

%d bloggers like this: