Posts Tagged ‘Cancer Genome Atlas’

Bioinformatic Tools for Cancer Mutational Analysis: COSMIC and Beyond

Curator: Stephen J. Williams, Ph.D.

Updated 7/26/2019

Updated 04/27/2019

Signatures of Mutational Processes in Human Cancer (from COSMIC)

From The COSMIC Database


The genomic landscape of cancer. The COSMIC database has a fully curated and annotated database of recurrent genetic mutations founds in various cancers (data taken form cancer sequencing projects). For interactive map please go to the COSMIC database here: http://cancer.sanger.ac.uk/cosmic



Somatic mutations are present in all cells of the human body and occur throughout life. They are the consequence of multiple mutational processes, including the intrinsic slight infidelity of the DNA replication machinery, exogenous or endogenous mutagen exposures, enzymatic modification of DNA and defective DNA repair. Different mutational processes generate unique combinations of mutation types, termed “Mutational Signatures”.

In the past few years, large-scale analyses have revealed many mutational signatures across the spectrum of human cancer types [Nik-Zainal S. et al., Cell (2012);Alexandrov L.B. et al., Cell Reports (2013);Alexandrov L.B. et al., Nature (2013);Helleday T. et al., Nat Rev Genet (2014);Alexandrov L.B. and Stratton M.R., Curr Opin Genet Dev (2014)]. However, as the number of mutational signatures grows the need for a curated census of signatures has become apparent. Here, we deliver such a resource by providing the profiles of, and additional information about, known mutational signatures.

The current set of mutational signatures is based on an analysis of 10,952 exomes and 1,048 whole-genomes across 40 distinct types of human cancer. These analyses are based on curated data that were generated by The Cancer Genome Atlas (TCGA), the International Cancer Genome Consortium (ICGC), and a large set of freely available somatic mutations published in peer-reviewed journals. Complete details about the data sources will be provided in future releases of COSMIC.

The profile of each signature is displayed using the six substitution subtypes: C>A, C>G, C>T, T>A, T>C, and T>G (all substitutions are referred to by the pyrimidine of the mutated Watson–Crick base pair). Further, each of the substitutions is examined by incorporating information on the bases immediately 5’ and 3’ to each mutated base generating 96 possible mutation types (6 types of substitution ∗ 4 types of 5’ base ∗ 4 types of 3’ base). Mutational signatures are displayed and reported based on the observed trinucleotide frequency of the human genome, i.e., representing the relative proportions of mutations generated by each signature based on the actual trinucleotide frequencies of the reference human genome version GRCh37. Note that only validated mutational signatures have been included in the curated census of mutational signatures.

Additional information is provided for each signature, including the cancer types in which the signature has been found, proposed aetiology for the mutational processes underlying the signature, other mutational features that are associated with each signature and information that may be relevant for better understanding of a particular mutational signature.

The set of signatures will be updated in the future. This will include incorporating additional mutation types (e.g., indels, structural rearrangements, and localized hypermutation such as kataegis) and cancer samples. With more cancer genome sequences and the additional statistical power this will bring, new signatures may be found, the profiles of current signatures may be further refined, signatures may split into component signatures and signatures

See their COSMIC tutorial page here for instructional videos

Updated News: COSMIC v75 – 24th November 2015

COSMIC v75 includes curations across GRIN2A, fusion pair TCF3-PBX1, and genomic data from 17 systematic screen publications. We are also beginning a reannotation of TCGA exome datasets using Sanger’s Cancer Genome Project analyis pipeline to ensure consistency; four studies are included in this release, to be expanded across the next few releases. The Cancer Gene Census now has a dedicated curator, Dr. Zbyslaw Sondka, who will be focused on expanding the Census, enhancing the evidence underpinning it, and developing improved expert-curated detail describing each gene’s impact in cancer. Finally, as we begin to streamline our ever-growing website, we have combined all information for each gene onto one page and simplified the layout and design to improve navigation

may be found in cancer types in which they are currently not detected.

mutational signatures across human cancer

Mutational signatures across human cancer

Patterns of mutational signatures [Download signatures]

 COSMIC database identifies 30 mutational signatures in human cancer

Please goto to COSMIC site to see bigger .png of mutation signatures

Signature 1

Cancer types:

Signature 1 has been found in all cancer types and in most cancer samples.

Proposed aetiology:

Signature 1 is the result of an endogenous mutational process initiated by spontaneous deamination of 5-methylcytosine.

Additional mutational features:

Signature 1 is associated with small numbers of small insertions and deletions in most tissue types.


The number of Signature 1 mutations correlates with age of cancer diagnosis.

Signature 2

Cancer types:

Signature 2 has been found in 22 cancer types, but most commonly in cervical and bladder cancers. In most of these 22 cancer types, Signature 2 is present in at least 10% of samples.

Proposed aetiology:

Signature 2 has been attributed to activity of the AID/APOBEC family of cytidine deaminases. On the basis of similarities in the sequence context of cytosine mutations caused by APOBEC enzymes in experimental systems, a role for APOBEC1, APOBEC3A and/or APOBEC3B in human cancer appears more likely than for other members of the family.

Additional mutational features:

Transcriptional strand bias of mutations has been observed in exons, but is not present or is weaker in introns.


Signature 2 is usually found in the same samples as Signature 13. It has been proposed that activation of AID/APOBEC cytidine deaminases is due to viral infection, retrotransposon jumping or to tissue inflammation. Currently, there is limited evidence to support these hypotheses. A germline deletion polymorphism involving APOBEC3A and APOBEC3B is associated with the presence of large numbers of Signature 2 and 13 mutations and with predisposition to breast cancer. Mutations of similar patterns to Signatures 2 and 13 are commonly found in the phenomenon of local hypermutation present in some cancers, known as kataegis, potentially implicating AID/APOBEC enzymes in this process as well.

Signature 3

Cancer types:

Signature 3 has been found in breast, ovarian, and pancreatic cancers.

Proposed aetiology:

Signature 3 is associated with failure of DNA double-strand break-repair by homologous recombination.

Additional mutational features:

Signature 3 associates strongly with elevated numbers of large (longer than 3bp) insertions and deletions with overlapping microhomology at breakpoint junctions.


Signature 3 is strongly associated with germline and somatic BRCA1 and BRCA2 mutations in breast, pancreatic, and ovarian cancers. In pancreatic cancer, responders to platinum therapy usually exhibit Signature 3 mutations.

Signature 4

Cancer types:

Signature 4 has been found in head and neck cancer, liver cancer, lung adenocarcinoma, lung squamous carcinoma, small cell lung carcinoma, and oesophageal cancer.

Proposed aetiology:

Signature 4 is associated with smoking and its profile is similar to the mutational pattern observed in experimental systems exposed to tobacco carcinogens (e.g., benzo[a]pyrene). Signature 4 is likely due to tobacco mutagens.

Additional mutational features:

Signature 4 exhibits transcriptional strand bias for C>A mutations, compatible with the notion that damage to guanine is repaired by transcription-coupled nucleotide excision repair. Signature 4 is also associated with CC>AA dinucleotide substitutions.


Signature 29 is found in cancers associated with tobacco chewing and appears different from Signature 4.

Signature 5

Cancer types:

Signature 5 has been found in all cancer types and most cancer samples.

Proposed aetiology:

The aetiology of Signature 5 is unknown.

Additional mutational features:

Signature 5 exhibits transcriptional strand bias for T>C substitutions at ApTpN context.


Signature 6

Cancer types:

Signature 6 has been found in 17 cancer types and is most common in colorectal and uterine cancers. In most other cancer types, Signature 6 is found in less than 3% of examined samples.

Proposed aetiology:

Signature 6 is associated with defective DNA mismatch repair and is found in microsatellite unstable tumours.

Additional mutational features:

Signature 6 is associated with high numbers of small (shorter than 3bp) insertions and deletions at mono/polynucleotide repeats.


Signature 6 is one of four mutational signatures associated with defective DNA mismatch repair and is often found in the same samples as Signatures 15, 20, and 26.

Signature 7

Cancer types:

Signature 7 has been found predominantly in skin cancers and in cancers of the lip categorized as head and neck or oral squamous cancers.

Proposed aetiology:

Based on its prevalence in ultraviolet exposed areas and the similarity of the mutational pattern to that observed in experimental systems exposed to ultraviolet light Signature 7 is likely due to ultraviolet light exposure.

Additional mutational features:

Signature 7 is associated with large numbers of CC>TT dinucleotide mutations at dipyrimidines. Additionally, Signature 7 exhibits a strong transcriptional strand-bias indicating that mutations occur at pyrimidines (viz., by formation of pyrimidine-pyrimidine photodimers) and these mutations are being repaired by transcription-coupled nucleotide excision repair.


Signature 8

Cancer types:

Signature 8 has been found in breast cancer and medulloblastoma.

Proposed aetiology:

The aetiology of Signature 8 remains unknown.

Additional mutational features:

Signature 8 exhibits weak strand bias for C>A substitutions and is associated with double nucleotide substitutions, notably CC>AA.


Signature 9

Cancer types:

Signature 9 has been found in chronic lymphocytic leukaemias and malignant B-cell lymphomas.

Proposed aetiology:

Signature 9 is characterized by a pattern of mutations that has been attributed to polymerase η, which is implicated with the activity of AID during somatic hypermutation.

Additional mutational features:


Chronic lymphocytic leukaemias that possess immunoglobulin gene hypermutation (IGHV-mutated) have elevated numbers of mutations attributed to Signature 9 compared to those that do not have immunoglobulin gene hypermutation.

Signature 10

Cancer types:

Signature 10 has been found in six cancer types, notably colorectal and uterine cancer, usually generating huge numbers of mutations in small subsets of samples.

Proposed aetiology:

It has been proposed that the mutational process underlying this signature is altered activity of the error-prone polymerase POLE. The presence of large numbers of Signature 10 mutations is associated with recurrent POLE somatic mutations, viz., Pro286Arg and Val411Leu.

Additional mutational features:

Signature 10 exhibits strand bias for C>A mutations at TpCpT context and T>G mutations at TpTpT context.


Signature 10 is associated with some of most mutated cancer samples. Samples exhibiting this mutational signature have been termed ultra-hypermutators.

Signature 11

Cancer types:

Signature 11 has been found in melanoma and glioblastoma.

Proposed aetiology:

Signature 11 exhibits a mutational pattern resembling that of alkylating agents. Patient histories have revealed an association between treatments with the alkylating agent temozolomide and Signature 11 mutations.

Additional mutational features:

Signature 11 exhibits a strong transcriptional strand-bias for C>T substitutions indicating that mutations occur on guanine and that these mutations are effectively repaired by transcription-coupled nucleotide excision repair.


Signature 12

Cancer types:

Signature 12 has been found in liver cancer.

Proposed aetiology:

The aetiology of Signature 12 remains unknown.

Additional mutational features:

Signature 12 exhibits a strong transcriptional strand-bias for T>C substitutions.


Signature 12 usually contributes a small percentage (<20%) of the mutations observed in a liver cancer sample.

Signature 13

Cancer types:

Signature 13 has been found in 22 cancer types and seems to be commonest in cervical and bladder cancers. In most of these 22 cancer types, Signature 13 is present in at least 10% of samples.

Proposed aetiology:

Signature 13 has been attributed to activity of the AID/APOBEC family of cytidine deaminases converting cytosine to uracil. On the basis of similarities in the sequence context of cytosine mutations caused by APOBEC enzymes in experimental systems, a role for APOBEC1, APOBEC3A and/or APOBEC3B in human cancer appears more likely than for other members of the family. Signature 13 causes predominantly C>G mutations. This may be due to generation of abasic sites after removal of uracil by base excision repair and replication over these abasic sites by REV1.

Additional mutational features:

Transcriptional strand bias of mutations has been observed in exons, but is not present or is weaker in introns.


Signature 2 is usually found in the same samples as Signature 13. It has been proposed that activation of AID/APOBEC cytidine deaminases is due to viral infection, retrotransposon jumping or to tissue inflammation. Currently, there is limited evidence to support these hypotheses. A germline deletion polymorphism involving APOBEC3A and APOBEC3B is associated with the presence of large numbers of Signature 2 and 13 mutations and with predisposition to breast cancer. Mutations of similar patterns to Signatures 2 and 13 are commonly found in the phenomenon of local hypermutation present in some cancers, known as kataegis, potentially implicating AID/APOBEC enzymes in this process as well.

Signature 14

Cancer types:

Signature 14 has been observed in four uterine cancers and a single adult low-grade glioma sample.

Proposed aetiology:

The aetiology of Signature 14 remains unknown.

Additional mutational features:


Signature 14 generates very high numbers of somatic mutations (>200 mutations per MB) in all samples in which it has been observed.

Signature 15

Cancer types:

Signature 15 has been found in several stomach cancers and a single small cell lung carcinoma.

Proposed aetiology:

Signature 15 is associated with defective DNA mismatch repair.

Additional mutational features:

Signature 15 is associated with high numbers of small (shorter than 3bp) insertions and deletions at mono/polynucleotide repeats.


Signature 15 is one of four mutational signatures associated with defective DNA mismatch repair and is often found in the same samples as Signatures 6, 20, and 26.

Signature 16

Cancer types:

Signature 16 has been found in liver cancer.

Proposed aetiology:

The aetiology of Signature 16 remains unknown.

Additional mutational features:

Signature 16 exhibits an extremely strong transcriptional strand bias for T>C mutations at ApTpN context, with T>C mutations occurring almost exclusively on the transcribed strand.


Signature 17

Cancer types:

Signature 17 has been found in oesophagus cancer, breast cancer, liver cancer, lung adenocarcinoma, B-cell lymphoma, stomach cancer and melanoma.

Proposed aetiology:

The aetiology of Signature 17 remains unknown.

Additional mutational features:


Signature 1Signature 18

Cancer types:

Signature 18 has been found commonly in neuroblastoma. Additionally, Signature 18 has been also observed in breast and stomach carcinomas.

Proposed aetiology:

The aetiology of Signature 18 remains unknown.

Additional mutational features:


Signature 19

Cancer types:

Signature 19 has been found only in pilocytic astrocytoma.

Proposed aetiology:

The aetiology of Signature 19 remains unknown.

Additional mutational features:


Signature 20

Cancer types:

Signature 20 has been found in stomach and breast cancers.

Proposed aetiology:

Signature 20 is believed to be associated with defective DNA mismatch repair.

Additional mutational features:

Signature 20 is associated with high numbers of small (shorter than 3bp) insertions and deletions at mono/polynucleotide repeats.


Signature 20 is one of four mutational signatures associated with defective DNA mismatch repair and is often found in the same samples as Signatures 6, 15, and 26.

Signature 21

Cancer types:

Signature 21 has been found only in stomach cancer.

Proposed aetiology:

The aetiology of Signature 21 remains unknown.

Additional mutational features:


Signature 21 is found only in four samples all generated by the same sequencing centre. The mutational pattern of Signature 21 is somewhat similar to the one of Signature 26. Additionally, Signature 21 is found only in samples that also have Signatures 15 and 20. As such, Signature 21 is probably also related to microsatellite unstable tumours.

Signature 22

Cancer types:

Signature 22 has been found in urothelial (renal pelvis) carcinoma and liver cancers.

Proposed aetiology:

Signature 22 has been found in cancer samples with known exposures to aristolochic acid. Additionally, the pattern of mutations exhibited by the signature is consistent with the one previous observed in experimental systems exposed to aristolochic acid.

Additional mutational features:

Signature 22 exhibits a very strong transcriptional strand bias for T>A mutations indicating adenine damage that is being repaired by transcription-coupled nucleotide excision repair.


Signature 22 has a very high mutational burden in urothelial carcinoma; however, its mutational burden is much lower in liver cancers.

Signature 23

Cancer types:

Signature 23 has been found only in a single liver cancer sample.

Proposed aetiology:

The aetiology of Signature 23 remains unknown.

Additional mutational features:

Signature 23 exhibits very strong transcriptional strand bias for C>T mutations.


Signature 24

Cancer types:

Signature 24 has been observed in a subset of liver cancers.

Proposed aetiology:

Signature 24 has been found in cancer samples with known exposures to aflatoxin. Additionally, the pattern of mutations exhibited by the signature is consistent with that previous observed in experimental systems exposed to aflatoxin.

Additional mutational features:

Signature 24 exhibits a very strong transcriptional strand bias for C>A mutations indicating guanine damage that is being repaired by transcription-coupled nucleotide excision repair.


Signature 25

Cancer types:

Signature 25 has been observed in Hodgkin lymphomas.

Proposed aetiology:

The aetiology of Signature 25 remains unknown.

Additional mutational features:

Signature 25 exhibits transcriptional strand bias for T>A mutations.


This signature has only been identified in Hodgkin’s cell lines. Data is not available from primary Hodgkin lymphomas.

Signature 26

Cancer types:

Signature 26 has been found in breast cancer, cervical cancer, stomach cancer and uterine carcinoma.

Proposed aetiology:

Signature 26 is believed to be associated with defective DNA mismatch repair.

Additional mutational features:

Signature 26 is associated with high numbers of small (shorter than 3bp) insertions and deletions at mono/polynucleotide repeats.


Signature 26 is one of four mutational signatures associated with defective DNA mismatch repair and is often found in the same samples as Signatures 6, 15 and 20.

Signature 27

Cancer types:

Signature 27 has been observed in a subset of kidney clear cell carcinomas.

Proposed aetiology:

The aetiology of Signature 27 remains unknown.

Additional mutational features:

Signature 27 exhibits very strong transcriptional strand bias for T>A mutations. Signature 27 is associated with high numbers of small (shorter than 3bp) insertions and deletions at mono/polynucleotide repeats.


Signature 28

Cancer types:

Signature 28 has been observed in a subset of stomach cancers.

Proposed aetiology:

The aetiology of Signature 28 remains unknown.

Additional mutational features:


Signature 29

Cancer types:

Signature 29 has been observed only in gingivo-buccal oral squamous cell carcinoma.

Proposed aetiology:

Signature 29 has been found in cancer samples from individuals with a tobacco chewing habit.

Additional mutational features:

Signature 29 exhibits transcriptional strand bias for C>A mutations indicating guanine damage that is most likely repaired by transcription-coupled nucleotide excision repair. Signature 29 is also associated with CC>AA dinucleotide substitutions.


The Signature 29 pattern of C>A mutations due to tobacco chewing appears different from the pattern of mutations due to tobacco smoking reflected by Signature 4.

Signature 30

Cancer types:

Signature 30 has been observed in a small subset of breast cancers.

Proposed aetiology:

The aetiology of Signature 30 remains unknown.



Examples in the literature of deposits into or analysis from the COSMIC database

The Genomic Landscapes of Human Breast and Colorectal Cancers from Wood 318 (5853): 11081113 Science 2007

“analysis of exons representing 20,857 transcripts from 18,191 genes, we conclude that the genomic landscapes of breast and colorectal cancers are composed of a handful of commonly mutated gene “mountains” and a much larger number of gene “hills” that are mutated at low frequency. ”

  • found cellular pathways with multiple pathways
  • analyzed a highly curated database (Metacore, GeneGo, Inc.) that includes human protein-protein interactions, signal transduction and metabolic pathways
  • There were 108 pathways that were found to be preferentially mutated in breast tumors. Many of the pathways involved phosphatidylinositol 3-kinase (PI3K) signaling
  • the cancer genome landscape consists of relief features (mutated genes) with heterogeneous heights (determined by CaMP scores). There are a few “mountains” representing individual CAN-genes mutated at high frequency. However, the landscapes contain a much larger number of “hills” representing the CAN-genes that are mutated at relatively low frequency. It is notable that this general genomic landscape (few gene mountains and many gene hills) is a common feature of both breast and colorectal tumors.
  • developed software to analyze multiple mutations and mutation frequencies available from Harvard Bioinformatics at





R Software for Cancer Mutation Analysis (download here)


CancerMutationAnalysis Version 1.0:

R package to reproduce the statistical analyses of the Sjoblom et al article and the associated Technical Comment. This package is build for reproducibility of the original results and not for flexibility. Future version will be more general and define classes for the data types used. Further details are available in Working Paper 126.

CancerMutationAnalysis Version 2.0:

R package to reproduce the statistical analyses of the Wood et al article. Like its predecessor, this package is still build for reproducibility of the original results and not for flexibility. Further details are available in Working Paper 126










Update 04/27/2019

Review 2018. The COSMIC Cancer Gene Census: describing genetic dysfunction across all human cancers. Z. Sondka et al. Nature Reviews. 2018.

The Catalogue of Somatic Mutations in Cancer (COSMIC) Cancer Gene Census (CGC) reevaluates the cancer genome landscape periodically and curates the findings into a database of genetic changes occurring in various tumor types.  The 2018 CGC describes in detail the effect of 719 cancer driving genes.  The recent expansion includes functional and mechanistic descriptions of how each gene contributes to disease etiology and in terms of the cancer hallmarks as described by Hanahan and Weinberg.  These functional characteristics show the complexity of the cancer mutational landscape and genome and suggest ” multiple cancer-related functions for many genes, which are often highly tissue-dependent or tumour stage-dependent.”  The 2018 CGC expands a second tier of genes, expanding the list of cancer related genes.

Criteria for curation of genes into CGC (curation process)

  • choosing candidate genes are selected from published literature, conference abstracts, large cancer genome screens deposited in databases, and analysis of current COSMIC database
  • COSMIC data are analyzed to determine presence of patterns of somatic mutations and frequency of such mutations in cancer
  • literature review to determine the role of the gene in cancer
  • Minimum evidence

– at least two publications from different groups shows increased mutation frequency in at least one type of cancer (PubMed)

–  at least two publications from different groups showing experimental evidence of functional involvement in at least one hallmark of cancer in order to classify the mutant gene as oncogene, tumor suppressor, or fusion partner (like BCR-Abl)

  • independent assessment by at least two postdoctoral fellows
  • gene must be classified as either Tier 1 of Tier 2 CGC gene
  • inclusion in database
  • continued curation efforts


Tier 1 gene: genes which have strong evidence from both mutational and functional analysis as being involved in cancer

Tier 2 gene: genes with mutational patterns typical of cancer drivers but not functionally characterized as well as genes with published mechanistic description of involvement in cancer but without proof of somatic mutations in cancer

Current Status of Tier 1 and Tier 2 genes in CGC

Tier 1 genes (574 genes): include 79 oncogenes, 140 tumor suppressor genes, 93 fusion partners

Tier 2 genes (719 genes): include 103 oncogenes, 181 tumor suppressors, 134 fusion partners and 31 with unknown function

Updated 7/26/2019

The COSMIC database is undergoing an extensive update and reannotation, in order to ensure standardisation and modernisation across COSMIC data. This will substantially improve the identification of unique variants that may have been described at the genome, transcript and/or protein level. The introduction of a Genomic Identifier, along with complete annotation across multiple, high quality Ensembl transcripts and improved compliance with current HGVS syntax, will enable variant matching both within COSMIC and across other bioinformatic datasets.

As a result of these updates there will be significant changes in the upcoming releases as we work through this process. The first stage of this work was the introduction of improvedHGVS syntax compliance in our May release. The majority of the changes will be reflected in COSMIC v90, which will be released in late August or early September, and the remaining changes will be introduced over the next few releases.

The significant changes in v90 include:

  • Updated genes, transcripts and proteins from Ensembl release 93 on both the GRCh37 and GRCh38 assemblies.
  • Full reannotation of COSMIC variants with known genomic coordinates using Ensembl’s Variant Effect Predictor (VEP). This provides accurate and standardised annotation uniformly across all relevant transcripts and genes that include the genomic location of the variant.
  • New stable genomic identifiers (COSV) that indicate the definitive position of the variant on the genome. These unique identifiers allow variants to be mapped between GRCh37 and GRCh38 assemblies and displayed on a selection of transcripts.
  • Updated cross-reference links between COSMIC genes and other widely-used databases such as HGNC, RefSeq, Uniprot and CCDS.
  • Complete standardised representation of COSMIC variants, following the most recent HGVS recommendations, where possible.
  • Remapping of gene fusions on the updated transcripts on both the GRCh37 and GRCh38 assemblies, along with the genomic coordinates for the breakpoint positions.
  • Reduced redundancy of mutations. Duplicate variants have been merged into one representative variant.

Key points for you

COSMIC variants have been annotated on all relevant Ensembl transcripts across both the GRCh37 and GRCh38 assemblies from Ensembl release 93. New genomic identifiers (e.g. COSV56056643) are used, which refers to the variant change at the genomic level rather than gene, transcript or protein level and can thus be used universally. Existing COSM IDs will continue to be supported and will now be referred to as legacy identifiers e.g. COSM476. The legacy identifiers (COSM) are still searchable. In the case of mutations without genomic coordinates, hence without a COSV identifier, COSM identifiers will continue to be used.

All relevant Ensembl transcripts in COSMIC (which have been selected based on Ensembl canonical classification and on the quality of the dataset to include only GENCODE basic transcripts) will now have both accession and version numbers, so that the exact transcript is known, ensuring reproducibility. This also provides transparency and clarity as the data are updated.

How these changes will be reflected in the download files

As we are now mapping all variants on all relevant Ensembl transcripts, the number of rows in the majority of variant download files has increased significantly. In the download files, additional columns are provided including the legacy identifier (COSM) and the new genomic identifier (COSV). An internal mutation identifier is also provided to uniquely represent each mutation, on a specific transcript, on a given assembly build. The accession and version number for each transcript are included. File descriptions for each of the download files will be available from the downloads page for clarity. We have included an example of the new columns below.

For example: COSMIC Complete Mutation Data (Targeted screens)

    1. [17:Q] Mutation Id – An internal mutation identifier to uniquely represent each mutation on a specific transcript on a given assembly build.
    1. [18:R] Genomic Mutation Id – Genomic mutation identifier (COSV) to indicate the definitive position of the variant on the genome. This identifier is trackable and stable between different versions of the release.
    1. [19:S] Legacy Mutation Id – Legacy mutation identifier (COSM) that will represent existing COSM mutation identifiers.

We will shortly have some sample data that can be downloaded in the new table structure, to give you real data to manipulate and integrate, this will be available on the variant updates page.

How this affects you

We are aware that many of the changes we are making will affect integration into your pipelines and analytical platforms. By giving you advance notice of the changes, we hope much of this can be mitigated, and the end result of having clean, standardised data will be well worth any disruption. The variant updates page on the COSMIC website will provide a central point for this information and further technical details of the changes that we are making to COSMIC.

Kind Regards,
Wellcome Sanger Institute
Wellcome Genome Campus,
Hinxton CB10 1SA










Read Full Post »

Loss of Gene Islands May Promote a Cancer Cell’s Survival, Proliferation and Evolution: A new Hypothesis (and second paper validating model) on Oncogenesis from the Elledge Laboratory

Writer, Curator: Stephen J. Williams, Ph.D.

It is well established that a critical event in the transformation of a cell to the malignant state involves the mutation of hosts of oncogenes and tumor suppressor genes, which in turn, confer on a cell the inability to properly control its proliferation.    On a genomic scale, these mutations can result in gene amplifications, loss of heterozygosity (LOH), and epigenetic changes resulting in tumorigenesis.  The “two hit hypothesis”, proposed by Dr. Al Knudson of Fox Chase Cancer Center[1], proposes that two mutations in the same gene are required for tumorigenesis, initially proposed to explain the progression of retinoblastoma in children, indicating a recessive disease.

(Excerpts from a great article explaining the two-hit-hypothesis is given at the end of this post).

And, although many tumor genomes display haploinsufficeint tumor suppressor genes, and fit the two hit model quite nicely, recent data show that most tumors display hemizygous recurrent deletions within their genomes.  Tumors display numerous recurrent hemizygous focal deletions that seem to contain no known tumor suppressor genes. For instance a recent analysis of over three thousand tumors including breast, bladder, pancreatic, ovarian and gastric cancers averaged greater than 10 deletions/tumor and 82 regions of recurrent focal deletions,

It has been proposed these great number of hemizygous deletions may be a result of:

  • a recessive tumor suppressor gene requiring mutation or silencing of second allele
  • the mutation may recur as they are located in fragile sites (unstable genomic regions)
  • single-copy loss may provide selective advantage regardless of the other allele

Note: some definitions of hemizygosity are given below.  In general at any locus, each parental chromosome can have 3 deletion states:

  1. wild type
  2. large deletion
  3. small deletion

Hemizygous deletions only involve one allele, not both alleles which is unlike the classic tumor suppressor like TP53

To see if it is possible that only one mutated allele of a tumor suppressor gene may be a casual event for tumorigenesis, Dr. Nicole Solimini and colleagues, from Dr. Stephen Elledge’s lab at Harvard, proposed a hypothesis they termed the cancer gene island model, after analyzing the regions of these hemizygous deletions for cancer related genes[2].  Dr. Soliin and colleagues analyzed whole-genome sequence data for 526 tumors in the COSMIC database comparing to a list generated from the Cancer Gene Census for homozygous loss-of-function mutations (mutations which result in a termination codon or frame-shift mutation: {this produces a premature stop in the protein or an altered sequence leading to a nonfunctional protein}.

Results of this analysis revealed:

  1. although tumors have a wide range of deletions per tumor (most epithelial high grade like ovarian, bladder, pancreatic, and esophageal adenocarcinomas had 10-14 deletions per tumor
  2. and although tumors exhibited a wide range (2- 16 ) loss of function mutations
  3. ONLY 14 of 82 recurrent deletions contained a known tumor suppressor gene and was a low frequency event
  4. Most recurrent cancer deletions do not contain putative tumor suppressor genes.

Therefore, as the authors suggest, an alternate method to the two-hit hypothesis may account for a selective growth advantage for these types of deletions, defining these low frequency hemizygous mutations in two general classes

  1. STOP genes: suppressors of tumor growth and proliferation
  2. GO genes: growth enhancers and oncogenes

Identifying potential STOP genes

To identify the STOP and GO genes the authors performed a primary screen of an shRNA library in telomerase (hTERT) immortalized human mammary epithelial cells using increased PROLIFERATION as a screening endpoint to determine STOP genes and decreased proliferation and lethality (essential genes) to determine possible GO genes. An initial screen identified 3582 possible STOP genes.  Using further screens and higher stringency criteria which focused on:

  • Only genes which increased proliferation in independent triplicate screens
  • Validated by competition assays
  • Were enriched more than four fold in three independent shRNA screens

the authors were able to focus on and validate 878 genes to determine the molecular pathways involved in proliferation.

These genes were involved in cell cycle regulation, apoptosis, and autophagy (which will be discussed in further posts).

To further validate that these putative STOP genes are relevant in human cancer, the list of validated STOP genes found in the screen was compared to the list of loss-of-function mutations in the 526 tumors in the COSMIC databaseSurprisingly, the validated STOP gene list were significantly enriched for known and possibly NOVEL tumor suppressor genes and especially loss of function and deletion mutations but also clustered in gene deletions in cancer.  This not only validated the authors’ model system and method but suggests that hemizygous deletions in multiple STOP genes may contribute to tumorigenesis

as the function of the majority of STOP genes is to restrain tumorigenesis

A few key conclusions from this study offer strength to an alternative view of oncogenesis NAMELY:

  • Loss of multiple STOP genes per deletion optimize a cancer cell’s proliferative capacity
  • Cancer cells display an insignificant loss of GO genes, minimizing negative impacts on cellular fitness
  • Haploinsufficiency in multiple STOP genes can result in similar alteration of function similar to complete loss of both alleles of
  • Cancer evolution may result from selection of hemizygous loss of high number of STOP and low number of GO genes
  • Leads to a CANCER GENE ISLAND model where there is a clonal evolution of transformed cells due to selective pressures

A link to the supplemental data containing STOP and GO genes found in validation screens and KEGG analysis can be found at the following link:


A link to an interview with the authors, originally posted on Harvard’s site can be found here.

Cumulative Haploinsufficiency and Triplosensitivity Drive Aneuploidy Patterns and Shape the Cancer Genome; a new paper from the Elledge group in the journal Cell


A concern of the authors was the extent to which gene silencing could have on their model in tumors.  The validation of the model was performed in cancer cell lines and compared to tumor genome sequence in publicly available databases however a followup paper by the same group shows that haploinsufficiency contributes a greater impact on the cancer genome than these studies have suggested.

In a follow-up paper by the Elledge group in the journal Cell[3], Theresa Davoli and colleagues, after analyzing 8,200 tumor-normal pairs, show there are many more cancer driver genes than once had been predicted.  In addition, the distribution and potency of STOP genes, oncogenes, and essential genes (GO) contribute to the complex picture of aneuploidy seen in many sporadic tumors.  The authors proposed that, together with these and their previous findings, that haploinsufficiency plays a crucial role in shaping the cancer genome.

Hemizygosity and Haploinsufficiency

Below are a few definitions from Wikipedia:

Zygosity is the degree of similarity of the alleles for a trait in an organism.

Most eukaryotes have two matching sets of chromosomes; that is, they are diploid. Diploid organisms have the same loci on each of their two sets of homologous chromosomes, except that the sequences at these loci may differ between the two chromosomes in a matching pair and that a few chromosomes may be mismatched as part of a chromosomal sex-determination system. If both alleles of a diploid organism are the same, the organism is homozygous at that locus. If they are different, the organism is heterozygous at that locus. If one allele is missing, it is hemizygous, and, if both alleles are missing, it is nullizygous.

Haploinsufficiency occurs when a diploid organism has only a single functional copy of a gene (with the other copy inactivated by mutation) and the single functional copy does not produce enough of a gene product (typically a protein) to bring about a wild-type condition, leading to an abnormal or diseased state. It is responsible for some but not all autosomal dominant disorders.

Al Knudsen and The “Two-Hit Hypothesis” of Cancer

Excerpt from a Scientist article by Eugene Russo about Dr. Knudson’s Two hit Hypothesis;

for full article please follow the link http://www.the-scientist.com/?articles.view/articleNo/19649/title/-Two-Hit–Hypothesis/

The “two-hit” hypothesis was, according to many, among the more significant milestones in that rapid evolution of biomedical science. The theory explains the relationship between the hereditary and nonhereditary, or sporadic, forms of retinoblastoma, a rare cancer affecting one in 20,000 children. Years prior to the age of gene cloning, Knudson’s 1971 paper proposed that individuals will develop cancer of the retina if they either inherit one mutated retinoblastoma (Rb) gene and incur a second mutation (possibly environmentally induced) after conception, or if they incur two mutations or hits after conception.3 If only one Rb gene functions normally, the cancer is suppressed. Knudson dubbed these preventive genes anti-oncogenes; other scientists renamed them tumor suppressors.

When first introduced, the “two-hit” hypothesis garnered more interest from geneticists than from cancer researchers. Cancer researchers thought “even if it’s right, it may not have much significance for the world of cancer,” Knudson recalls. “But I had been taught from the early days that very often we learn fundamental things from unusual cases.” Knudson’s initial motivation for the model: a desire to understand the relationship between nonhereditary forms of cancer and the much rarer hereditary forms. He also hoped to elucidate the mechanism by which common cancers, such as those of the breast, stomach, and colon, become more prevalent with age.

According to the then-accepted somatic mutation theory, the more mutations, the greater the risk of cancer. But this didn’t jibe with Knudson’s own studies on childhood cancers, which suggested that, in the case of cancers such as retinoblastoma, disease onset peaks in early childhood. Knudson set out to determine the smallest number of cancer-inducing events necessary to cause cancer and the role of these events in hereditary vs. nonhereditary cancers. Based on existing data on cancer cases and some mathematical deduction, Knudson came up with the “two-hit” hypothesis.

Not until 1986, when researchers at the Whitehead Institute for Biomedical Research in Cambridge, Mass., cloned the Rb gene, would there be solid evidence to back up Knudson’s pathogenesis paradigm.4 “Even with the cloning of the gene, it wasn’t clear how general it would be,” says Knudson. There are, it turns out, several two-hit lesions, including polyposis, neurofibromitosis, and basal cell carcinoma syndrome. Other cancers show only some correspondence with the two-hit model. In the case of Wilm’s tumor, for example, the model accounts for about 15 percent of the cancer incidence; the remaining cases seem to be more complicated.


His seminal paper on the two-hit hypothesis[1]

A.G. Knudson, “Mutation and cancer: statistical study of retinoblastoma,” Proceedings of the National Academy of Sciences, 68:820-3, 1971.

The two hit hypothesis proposed by A.G. Knudson.  A description with video of Dr. Knudson talk at AACR can be found at the following link (photo creditied to A.G. Knudson and Fox Chase Cancer Center at the following link:http://www.fccc.edu/research/research-awards/knudson/index.html


1.            Knudson AG, Jr.: Mutation and cancer: statistical study of retinoblastoma. Proceedings of the National Academy of Sciences of the United States of America 1971, 68(4):820-823.

2.            Solimini NL, Xu Q, Mermel CH, Liang AC, Schlabach MR, Luo J, Burrows AE, Anselmo AN, Bredemeyer AL, Li MZ et al: Recurrent hemizygous deletions in cancers may optimize proliferative potential. Science 2012, 337(6090):104-109.

3.            Davoli T, Xu Andrew W, Mengwasser Kristen E, Sack Laura M, Yoon John C, Park Peter J, Elledge Stephen J: Cumulative Haploinsufficiency and Triplosensitivity Drive Aneuploidy Patterns and Shape the Cancer Genome. Cell 2013, 155(4):948-962.

Other papers on this site on CANCER and MUTATION include:

Cancer Mutations Across the Landscape

Salivary Gland Cancer – Adenoid Cystic Carcinoma: Mutation Patterns: Exome- and Genome-Sequencing @ Memorial Sloan-Kettering Cancer Center

Whole exome somatic mutations analysis of malignant melanoma contributes to the development of personalized cancer therapy for this disease

Breast Cancer and Mitochondrial Mutations

Winning Over Cancer Progression: New Oncology Drugs to Suppress Passengers Mutations vs. Driver Mutations

Hold on. Mutations in Cancer do good.

Rewriting the Mathematics of Tumor Growth; Teams Use Math Models to Sort Drivers from Passengers

How mobile elements in “Junk” DNA promote cancer. Part 1: Transposon-mediated tumorigenesis.

Read Full Post »

Cancer Mutations Across the Landscape

Curator: Larry H. Bernstein, MD, FCAP

This is an up-to-date article about the significance of mutations found in 12 major types of cancer.

Mutational landscape and significance across 12 major cancer types

Cyriac Kandoth1*, Michael D. McLellan1*, Fabio Vandin2, Kai Ye1,3, Beifang Niu1, Charles Lu1, et al.

1The Genome Institute, Washington University in St Louis, Missouri 63108, USA. 2Department of Computer Science, Brown University, Providence, Rhode Island 02912, USA. 3Department of Genetics, Washington University in St Louis, Missouri 63108, USA. 4Department of Medicine, Washington University in St Louis, Missouri 63108, USA. 5Siteman Cancer Center, Washington University in St Louis, Missouri 63108, USA. 6Department of Mathematics, Washington University in St Louis, Missouri 63108, USA.

NATURE 17 Oct 2013;  5 0 2      http://dx.doi.org/10.1038/nature12634

The Cancer Genome Atlas (TCGA) has used the latest sequencing and analysis methods to identify somatic variants across thousands of tumours. Here we present data and analytical results for point mutations and small insertions/deletions from 3,281 tumours across 12 tumour types as part of the TCGA Pan-Cancer effort. We illustrate

  1. the distributions of mutation frequencies,
  2. types and contexts across tumour types, and
  3. establish their links to tissues of origin,
  4. environmental/ carcinogen influences, and
  5. DNA repair defects.

Using the integrated data sets, we identified 127 significantly mutated genes from well-knownand emerging cellular processes in cancer.

  1. (for example, mitogen-activated protein kinase, phosphatidylinositol-3-OH kinase,Wnt/b-catenin and receptor tyrosine kinase signalling pathways, and cell cycle control)
  2. (for example, histone, histone modification, splicing, metabolism and proteolysis)

The average number of mutations in these significantly mutated genes varies across tumour types;

  1. most tumours have two to six, indicating that the number of driver mutations required during oncogenesis is relatively small.
  2. Mutations in transcriptional factors/regulators show tissue specificity, whereas
  3. histone modifiers are often mutated across several cancer types.

Clinical association analysis identifies genes having a significant effect on survival, and

  • investigations of mutations with respect to clonal/subclonal architecture delineate their temporal orders during tumorigenesis.

Taken together, these results lay the groundwork for developing new diagnostics and individualizing cancer treatment


The advancement of DNA sequencing technologies now enables the processing of thousands of tumours of many types for systematic mutation discovery. This expansion of scope, coupled with appreciable progress in algorithms1–5, has led directly to characterization of signifi­cant functional mutations, genes and pathways6–18. Cancer encompasses more than 100 related diseases19, making it crucial to understand the commonalities and differences among various types and subtypes. TCGA was founded to address these needs, and its large data sets are providing unprecedented opportunities for systematic, integrated analysis.

We performed a systematic analysis of 3,281 tumours from 12 cancer types to investigate underlying mechanisms of cancer initiation and progression. We describe variable mutation frequencies and contexts and their associations with environmental factors and defects in DNA repair. We identify 127 significantlymutated genes (SMGs) from diverse signalling and enzymatic processes. The finding of a TP53-driven breast, head and neck, and ovarian cancer cluster with a dearth of other mutations in SMGs suggests common therapeutic strategies might be applied for these tumours. We determined interactions among muta­tions and correlated mutations in BAP1, FBXW7 and TP53 with det­rimental phenotypes across several cancer types. The subclonal structure and transcription status of underlying somatic mutations reveal the trajectory of tumour progression in patients with cancer.

Standardization of mutation data

Stringent filters (Methods) were applied to ensure high quality muta­tion calls for 12 cancer types: breast adenocarcinoma (BRCA), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC), uterine corpus endometrial carcinoma (UCEC), glioblastoma multiforme (GBM), head and neck squamous cell carcinoma (HNSC), colon and rectal carcinoma (COAD, READ),bladder urothelial carcinoma (BLCA), kidney renal clear cell carcinoma (KIRC), ovarian serous carcinoma (OV) and acute myeloid leukaemia (LAML; conventionally called AML) (Supplementary Table 1). A total of 617,354 somatic mutations, consisting of

  • 398,750 missense,
  • 145,488 silent,
  • 36,443 nonsense,
  • 9,778 splice site,
  • 7,693 non-coding RNA,
  • 523 non-stop/readthrough,
  • 15,141 frameshift insertions/deletions (indels) and
  • 3,538 inframe indels,

were included for downstream analyses (Supplementary Table 2).

Distinct mutation frequencies and sequence context

Figure 1a shows that AML has the lowest median mutation frequency and LUSC the highest (0.28 and 8.15 mutations per megabase (Mb), respectively). Besides AML, all types average over 1 mutation per Mb, substantially higher than in pediatric tumours20. Clustering21 illus­trates that

  • mutation frequencies for KIRC, BRCA, OV and AML are normally distributed within a single cluster, whereas
  • other types have several clusters (for example, 5 and 6 clusters in UCEC and COAD/ READ, respectively) (Fig. 1a and Supplementary Table 3a, b).

In UCEC, the largest patient cluster has a frequency of approximately 1.5 muta­tions per Mb, and

  • the cluster with the highest frequency is more than 150 times greater.

Multiple clusters suggest that factors other than age contribute to development in these tumours14,16. Indeed,

  • there is a significant correlation between high mutation frequency and DNA repair pathway genes (for example, PRKDC, TP53 and MSH6) (Sup­plementary Table 3c). Notably,
  • PRKDC mutations are associated with high frequency in BLCA, COAD/READ, LUAD and UCEC, whereas
  • TP53 mutations are related with higher frequencies in AML, BLCA, BRCA, HNSC, LUAD, LUSC and UCEC (all P < 0.05).

Mutations in POLQ and POLE associate with high frequencies in multiple cancer types; POLE association in UCEC is consistent with previous observations14.

Comparison of spectra across the 12 types (Fig. 1b and Supplemen­tary Table 3d) reveals that LUSC and LUAD contain increased C>A transversions, a signature of cigarette smoke exposure10. Sequence context analysis across 12 types revealed

  • the largest difference being in C>T transitions and C>G transversions (Fig. 1c).

The frequency of thymine 1-bp (base pair) upstream of C>G transversions is mark­edly higher in BLCA, BRCA and HNSC than in other cancer types (Extended Data Fig. 1). GBM, AML, COAD/READ and UCEC have similar contexts in that

  • the proportions of guanine 1 base downstream of C>T transitions are between
    • 59% and 67%, substantially higher than the approximately 40% in other cancer types.

Higher frequencies of transition mutations at CpG in gastrointestinal tumours, including colorectal, were previously reported22. We found three additional cancer types (GBM, AML and UCEC) clustered in the C>T mutation at CpG, consistent with previous findings of

  • aberrant DNA methylation in endometrial cancer23 and glioblastoma24.

BLCA has a unique signature for C>T transitions compared to the other types (enriched for TC) (Extended Data Fig. 1).

Significantly mutated genes

Genes under positive selection, either in individual or multiple tumour types, tend to display higher mutation frequencies above background. Our statistical analysis3, guided by expression data and curation (Methods), identified 127 such genes (SMGs; Supplementary Table 4). These SMGs are involved in a wide range of cellular processes, broadly classified into 20 categories (Fig. 2), including

  • transcription factors/regulators, histone modifiers, genome integrity, receptor tyrosine kinase signal­ling, cell cycle, mitogen-activated protein kinases (MAPK) signalling, phosphatidylinositol-3-OH kinase (PI(3)K) signalling, Wnt/ -catenin signalling, histones, ubiquitin-mediatedproteolysis, and splicing (Fig. 2).

The identification of MAPK, PI(3)K and Wnt/ -catenin signaling path­ways is consistent with classical cancer studies. Notably, newer categories (for example, splicing, transcription regulators, metabolism, proteolysis and histones) emerge as exciting guides for the development of new therapeutic targets. Genes categorized as histone modifiers (Z = 0.57), PI(3)K signalling (Z = 1.03), and genome integrity (Z = 0.66) all relate to more than one cancer type, whereas

  • transcription factor/regulator (Z = 0.40), TGF- signalling (Z = 0.66), and Wnt/ -catenin signalling (Z = 0.55) genes tend to associate with single types (Methods).

Notably, 3,053 out of 3,281 total samples (93%) across the Pan-Cancer collection had at least one non-synonymous mutation in at least one SMG. The average number of point mutations and small indels in these genes varies across tumour types, with the highest (,6 mutations per tumour) in UCEC, LUAD and LUSC, and the lowest (,2 mutations per tumour) in AML, BRCA, KIRC and OV. This suggests that the numbers of both cancer-related genes (only 127 identified in this study) and cooperating driver mutations required during oncogenesis are small (most cases only had 2–6) (Fig. 3), although large-scale structural rearrangements were not included in this analysis.

Common mutations

The most frequently mutated gene in the Pan-Cancer cohort is TP53 (42% of samples). Its mutations predominate in serous ovarian (95%) and serous endometrial carcinomas (89%) (Fig. 2). TP53 mutations are also associated with basal subtype breast tumours. PIK3CA is the second most commonly mutated gene, occurring frequently (>10%) in most cancer types except OV, KIRC, LUAD and AML. PIK3CA mutations frequented UCEC (52%) and BRCA (33.6%), being speci­fically enriched in luminal subtype tumours. Tumours lacking PIK3CA mutations often had mutations in PIK3R1, with the highest occur­rences in UCEC (31%) and GBM (11%) (Fig. 2).

Many cancer types carried mutations in chromatin re-modelling genes. In particular, histone-lysine N-methyltransferase genes (MLL2 (also known as KMT2D), MLL3 (KMT2C) and MLL4 (KMT2B)) clus­ter in bladder, lung and endometrial cancers, whereas the lysine (K)-specific demethylase KDM5C is prevalently mutated in KIRC (7%). Mutations in ARID1A are frequent in BLCA, UCEC, LUAD and LUSC, whereas mutations in ARID5B predominate in UCEC (10%) (Fig. 2).

Fig. 1.  Distribution of mutation frequencies across 12 cancer types.

Fig. 1.  | Distribution of mutation frequencies across 12 cancer types.

Dashed grey and solid white lines denote average across cancer types and median for each type, respectively. b, Mutation spectrum of six transition (Ti) and transversion (Tv) categories for each cancer type. c, Hierarchically clustered mutation context (defined by the proportion of A, T, C and G nucleotides within ±2bp of variant site) for six mutation categories. Cancer types correspond to colours in a. Colour denotes degree of correlation: yellow (r = 0.75) and red (r = 1).

Fig. 2.  The 127 SMGs from 20 cellular processes in cancer identified in and Pan-Cancer are shown, with the highest percentage in each gene among 12 (not shown)

Fig. 3.  Distribution of mutations in 127 SMGs across Pan-Cancer cohort.

Fig. 3. | Distribution of mutations in 127 SMGs across Pan-Cancer cohort.

Box plot displays median numbers of non-synonymous mutations, with outliers shown as dots. In total, 3,210 tumours were used for this analysis (hypermutators excluded).

Figure 4 | Unsupervised clustering based on mutation status of SMGs. Tumours having no mutation or more than 500 mutations were excluded. A mutation status matrix was constructed for 2,611 tumours. Major clusters of mutations detected in UCEC, COAD, GBM, AML, KIRC, OV and BRCA were highlighted.
Complete gene list shown in Extended Data Fig. 3.  (not shown)

Fig. 5. Driver initiation and progression mutations and tumour clonal mutation is in the subclone

Figure 5 | Driver initiation and progression mutations and tumour clonal mutation is in the subclone

Survival Analysis

We examined which genes correlate with survival using the Cox proportional hazards model, first analysing individual cancer types using age and gender as covariates; an average of 2 genes (range: 0–4) with mutation frequency 2% were significant (P<_0.05) in each type (Supplementary Table 10a and Extended Data Fig. 6). KDM6A and ARID1A mutations correlate with better survival in BLCA (P = 0.03, hazard ratio (HR) = 0.36, 95% confidence interval (CI): 0.14–0.92) and UCEC (P = 0.03, HR = 0.11, 95% CI: 0.01–0.84), respectively, but mutations in SETBP1, recently identified with worse prognosis in atypical chronic myeloid leukaemia (aCML)31, have a significant detrimental effect in HNSC (P = 0.006, HR = 3.21, 95% CI: 1.39–7.44). BAP1 strongly correlates with poor survival (P = 0.00079, HR = 2.17, 95% CI: 1.38–3.41) in KIRC. Conversely, BRCA2 muta­tions (P = 0.02, HR = 0.31, 95% CI: 0.12–0.85) associate with better survival in ovarian cancer, consistent with previous reports32,33; BRCA1 mutations showed positive correlation with better survival, but did not reach significance here.

We extended our survival analysis across cancer types, restricting our attention to the subset of 97 SMGs whose mutations appeared in 2% of patients having survival data in 2 tumour types. Taking type, age and gender as covariates, we found 7 significant genes: BAP1DNMT3AHGFKDM5CFBXW7BRCA2 and TP53 (Extended Data Table 1).  In particular, BAP1 was highly significant (0.00013, HR = 2.20, 95% CI: 1.47–3.29, more than 53 mutated tumours out of 888 total), with mutations associating with detrimental outcome in four tumour types and notable associations in KIRC (P = 0.00079), consistent with a recent report28, and in UCEC(P = 0.066). Mutations in several other genes are detrimental, including DNMT3A (HR = 1.59), previously identified with poor prognosis in AML34, and KDM5C (HR = 1.63), FBXW7 (HR = 1.57) and TP53 (HR = 1.19). TP53 has significant associations with poor outcome in KIRC (P = 0.012), AML (P = 0.0007) and HNSC (P = 0.00007). Conversely, BRCA2 (P = 0.05, HR = 0.62, 95% CI: 0.38 to 0.99) correlates with survival benefit in six types, including OV and UCEC (Supplementary Table 10a, b). IDH1 mutations are associated with improved prognosis across the Pan-Cancer set (HR = 0.67, P = 0.16) and also in GBM (HR = 0.42, P = 0.09) (Supplementary Table 10a, b), consistent with previous work.35

 Driver mutations and tumour clonal architecture

To understand the temporal order of somatic events, we analysed the variant allele fraction (VAF) distribution of mutations in SMGs across AML, BRCA and UCEC (Fig. 5a and Supplementary Table 11a) and other tumour types (Extended Data Fig. 7). To minimize the effect of copy number alterations, we focused on mutations in copy neutral segments. Mutations in TP53 have higher VAFs on average in all three cancer types, suggesting early appearance during tumorigenesis.

It is worth noting that copy neutral loss of heterozygosity is commonly found in classical tumour suppressors such as TP53, BRCA1, BRCA2 and PTEN, leading to increased VAFs in these genes. In AML, DNMT3A (permutation test P = 0), RUNX1 (P = 0.0003) and SMC3 (P = 0.05) have significantly higher VAFs than average among SMGs (Fig. 5a and Supplementary Table 11b). In breast cancer, AKT1, CBFB, MAP2K4, ARID1A, FOXA1 and PIK3CA have relatively high average VAFs. For endometrial cancer, multiple SMGs (for example, PIK3CA, PIK3R1, PTEN, FOXA2 and ARID1A) have similar median VAFs. Conversely, KRAS and/or NRAS mutations tend to have lower VAFs in all three tumour types (Fig. 5a), suggesting NRAS (for example, P = 0 in AML) and KRAS (for example, P = 0.02 in BRCA) have a progression role in a subset of AML, BRCA and UCEC tumours. For all three cancer types, we clearly observed a shift towards higher expression VAFs in SMGs versus non-SMGs, most apparent in BRCA and UCEC (Extended Data Fig. 8a and Methods).

Previous analysis using whole-genome sequencing (WGS) detected subclones in approximately 50% of AML cases15,36,37; however, ana­lysis is difficult using AML exome owing to its relatively few coding mutations. Using 50 AML WGS cases, sciClone (http://github.com/ genome/sciclone) detected DNMT3A mutations in the founding clone for 100% (8 out of 8) of cases and NRAS mutations in the subclone for 75% (3 out of 4) of cases (Extended Data Fig. 8b). Among 304 and 160 of BRCA and UCEC tumours, respectively, with enough coding muta­tions for clustering, 35% BRCA and 44% UCEC tumours contained subclones. Our analysis provides the lower bound for tumour hetero­geneity, because only coding mutations were used for clustering. In BRCA, 95% (62 out of 65) of cases contained PIK3CA mutations in the founding clone, whereas 33% (3 out of 9) of cases had MLL3 muta­tions in the subclone. Similar patterns were found in UCEC tumours, with 96% (65 out of 68) and 95% (62 out of 65) of tumours containing PIK3CA and PTEN mutations, respectively, in the founding clone, and 9% (2 out of22) ofKRAS and 14% (1 out of 7) ofNRAS mutations in the subclone (Extended Data Fig. 8b and Supplementary Table 12).

Mutation con­text (-2 to +2 bp) was calculated for each somatic variant in each mutation category, and hierarchical clustering was then performed using the pairwise mutation context correlation across all cancer types. The mutational significance in cancer (MuSiC)3 package was used to identify significant genes for both indi­vidual tumour types and the Pan-Cancer collective. An R function ‘hclust’ was used for complete-linkage hierarchical clustering across mutations and samples, and Dendrix30 was used to identify sets of approximately mutual exclusive muta­tions. Cross-cancer survival analysis was based on the Cox proportional hazards model, as implemented in the R package ‘survival’ (http://cran.r-project.org/web/ packages/survival/), and the sciClone algorithm (http://github.com/genome/sci-clone) generated mutation clusters using point mutations from copy number neutral segments. A complete description of the materials and methods used to generate this data set and its results is provided in the Methods.

References (20 of 38)

  1. Larson, D. E. et al. SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics 28, 311–317 (2012).
  2. Koboldt, D. C. et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 22, 568–576 (2012).
  3. Dees, N. D. et al. MuSiC: Identifying mutational significance in cancer genomes. Genome Res. 22, 1589–1598 (2012).
  4. Roth, A. et al. JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data. Bioinformatics 28, 907–913 (2012).
  5. Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nature Biotechnol. 31, 213–219 (2013).
  6. Jones, S. et al. Core signaling pathways in human pancreatic cancers revealed by global genomic analyses. Science 321, 1801–1806 (2008).
  7. Parsons, D. W. et al. An integrated genomic analysis of human glioblastoma multiforme. Science 321, 1807–1812 (2008).
  8. Sjo¨blom, T. etal. The consensuscodingsequences of human breast and colorectal cancers. Science 314, 268–274 (2006).
  9. The Cancer Genome Atlas Research Network. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455, 1061–1068 (2008).
  10. Ding, L. et al. Somatic mutations affect key pathways in lung adenocarcinoma. Nature 455, 1069–1075 (2008).
  11. Wood, L. D. etal. The genomic landscapesof human breast and colorectal cancers. Science 318, 1108–1113 (2007).
  12. The Cancer Genome Atlas Research Network. Integrated genomic analyses of ovarian carcinoma. Nature 474, 609–615 (2011).
  13. The Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70 (2012).
  14. Cancer Genome Atlas Research Network. Integrated genomic characterization of endometrial carcinoma. Nature 497, 67–73 (2013).
  15. The Cancer Genome Atlas Research Network. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. N. Engl. J. Med. 368, 2059–2074 (2013).
  16. The Cancer Genome Atlas Network. Comprehensive molecular characterization of human colon and rectal cancer. Nature 487, 330–337 (2012).
  17. Ellis, M. J. et al. Whole-genome analysis informs breast cancer response to aromatase inhibition. Nature 486, 353–360 (2012).
  18. The Cancer Genome Atlas Research Network. Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature 499, 43–49 (2013).
  19. Hanahan, D. & Weinberg, R. A. The hallmarks of cancer. Cell 100, 57–70 (2000).
  20. Downing, J. R. et al. The Pediatric Cancer Genome Project. Nature Genet. 44, 619–622 (2012).

Read Full Post »

Larry H Bernstein, MD, FCAP
Pharmaceutical Intelligence

Clinical Trials Revisited


Cancer Clinical Trials of Tomorrow

Advances in genomics and cancer biology will alter the design of human cancer studies

By Tomasz M. Beer | April 1, 2013   The Scientist
We stand on the cusp of significant change in the fundamental structure of cancer clinical trials, as the emphasis begins to shift from large-scale studies of relatively unselected patients to smaller studies testing more narrowly targeted therapies in molecularly characterized populations.
The previous (and still current) generation of trials established the cancer treatment standards used today. Trials that demonstrated the value of combination chemotherapy in the adjuvant treatment of breast cancer are an excellent example. Meticulous development of treatment regimens through Phase 1 and Phase 2 trials, followed by large-scale comparisons of the new regimens to established treatment protocols, have defined the modern practice of oncology for the last 4 decades. Future cancer clinical trials will be very different from those of the past, adopting a more personalized, sometimes called “precision,” approach.
It is, of course, not entirely true that past clinical trials did not include efforts to target treatments to the right patients. Where possible, targeted therapies are already being implemented. Using the presence of endocrine receptors to guide endocrine therapy for breast cancer was one of the first forays into molecular selection of patients. Unfortunately, the ability to select subgroups of patients for study has been severely curtailed by a still-limited knowledge of human cancer biology.
This is rapidly changing, however, thanks to advances in genomics and comprehensive cancer biology research over the last decade. Large-scale efforts, such as The Cancer Genome Atlas, are comprehensively defining many of the crucial molecular characteristics of human malignancies by illuminating genetic alterations that are clinically and biologically important, and which, by virtue of their functional roles, are viable targets for cancer treatment. At the same time, the ability to design small-molecule inhibitors of specific cancer targets is rapidly accelerating. In 2011, two new agents exemplified the power of these trends: crizotinib was approved for the treatment of lung cancers that harbor a specific mutation in the ALK gene, and vemurafenib was approved for the treatment of melanomas with a specific BRAF mutation. In both cases, the drugs were approved along with companion diagnostic tests that identify patients with the target mutation, who are therefore likely to benefit from treatment.

Smaller, more precise trials ahead

Clinical trials are being transformed by these trends. It will not happen overnight, as the knowledge of cancer biology and the availability of targeted agents are uneven. Unselected populations of patients will still be studied, but it is inevitable that there will be a rise in the number of trials that incorporate molecular tumor testing prior to treatment, with treatment selection informed by the molecular features of each individual’s cancer. Such personalized trials have the potential to yield better outcomes by increasing the probability of response and to employ less toxic therapies by increasingly targeting cancer-specific functions, rather than normal proliferative functions.
To the extent that targeted therapies will prove more effective when given to selected patients, clinical trials should get dramatically smaller. Trial size is largely driven by how effective the treatment is expected to be, so fewer participants are needed when the therapeutic benefit is larger. But the promise of smaller trials will not to be universal; for example, when two targeted agents are compared to one another in the same molecularly selected population, the differences in efficacy may be small and larger trials will be required.
As approaches to cancer treatment advance, there will need to be continual engagement with patients and with cancer survivors.
Furthermore, smaller trials may not necessarily move faster or be easier to complete, as they will require the “right patients,” who may be hard to find. Many of the mutations that represent promising targets are present in a minority of tumors. Today, molecular characterization of tumors is often done as part of the screening process for each trial. Many, and sometimes most, of the patients prove ineligible, making this approach frustrating and difficult to carry out. A better avenue of attack would be to make comprehensive molecular characterization of tumors a routine part of establishing a patient’s eligibility for a range of therapies. With the plummeting cost of genomic analysis, one can envision a day in the near future when a complete cancer genome (and perhaps other molecular evaluations) becomes a standard component of an initial diagnostic evaluation. Patients will be armed with molecular information about their own tumors, and thus able to make more-informed decisions about standard and investigational therapies that match the mutations driving their cancer.

New challenges

The road to personalized and targeted treatment strategies will offer new challenges. For rare targets that are present in a minority of cases across many different types of cancers, one will have to consider clinical trials that include a number of different cancers. There are many design pitfalls to such trials, chiefly the additional clinical and molecular heterogeneity introduced by the inclusion of more than one cancer type. Despite these challenges, it will inevitably make sense in some settings to select patients who share a particular tumor biology, regardless of the tissue of origin.
Another major challenge is how to combine targeted therapies to improve clinical outcomes. To date, targeted therapies have not been able to cure advanced solid tumors. Clinical benefits, while sometimes quite impressive when compared to marginally effective treatments, still fall far short. It stands to reason that redundant survival and growth pathways enable tumors to overcome therapies that inhibit a single target. The simultaneous inhibition of relevant redundant pathways may yield dramatically better results, but will also dramatically increase the complexity of molecularly personalized clinical trials.
As approaches to cancer treatment advance, there will need to be continual engagement with patients and with cancer survivors. Fewer than 5 percent of adult cancer patients participate in a clinical trial. To carry out meaningful clinical trials in the future, that number must increase. This will be most important for treatments that target relatively rare mutations; a large number of potential volunteers will have to be screened to identify a sufficient number who harbor the relevant target. To succeed, we must partner with a much larger fraction of cancer patients.
Designing and executing future cancer clinical trials will not be easy, but physician-scientists are armed with a fast-growing body of omics-informed knowledge with which to surmount these hurdles.
Tomasz M. Beer is deputy director of the Knight Cancer Institute and a professor of medicine at Oregon Health & Science University in Portland. He is the coauthor of Cancer Clinical Trials: A Commonsense Guide to Experimental Cancer Therapies and Clinical Trials. Written for people living with cancer, the book is accompanied by a blog (www.cancer-clinical-trials.com) that seeks to disseminate knowledge about clinical trials.


tumor suppression, tumor heterogeneity, genetics & genomics, disease/medicine, clinical trials, chemotherapy, cancer genomics and cancer

Related articles

Clinical Trials (journal)

Clinical Trials (journal) (Photo credit: Wikipedia)

Contemporary Clinical Trials

Contemporary Clinical Trials (Photo credit: Wikipedia)

Cover of "Cancer Biology (3rd Edition)"
Cover of Cancer Biology (3rd Edition)

Read Full Post »

Curator: Aviva Lev-Ari, PhD, RN

New Institute for Precision Medicine Created at Weill Cornell Medical College and NewYork-Presbyterian Hospital


NEW YORK (Jan. 31, 2013) — Recognizing that medicine is not “one size fits all,” Weill Cornell Medical College and NewYork-Presbyterian Hospital have created the pioneering Institute for Precision Medicine at Weill Cornell and NewYork-Presbyterian/Weill Cornell Medical Center. This new, cutting-edge translational medicine research hub will explore the new frontier of precision medicine, offering optimal targeted, individualized treatment based on each patient’s genetic profile. The institute’s new genomic research discoveries will help develop novel, personalized medical therapies to be tested in innovative clinical trials, while also building a comprehensive biobank to improve research and patient care.

Dr. Mark Rubin

The Institute for Precision Medicine will be led by Dr. Mark Rubin, a renowned pathologist and prostate cancer expert who uses whole genomic sequencing in his laboratory to investigate DNA mutations that lead to disease, particularly prostate cancer. Dr. Rubin currently serves as vice chair for experimental pathology, director of Translational Research Laboratory Services, the Homer T. Hirst III Professor of Oncology, professor of pathology and laboratory medicine and professor of pathology in urology at Weill Cornell and is a pathologist at NewYork-Presbyterian/Weill Cornell.

Dr. Rubin and his team seek to replace the traditional one-size-fits-all medicine paradigm with one that focuses on targeted, individualized patient care using a patient’s own genetic profile and medical history. Physician-scientists at the institute will seek to precisely identify the genetic influencers of a patient’s specific illness — such as cancer, cardiovascular disease, neurodegenerative disease and others — and use this genetic information to design a more-effective course of treatment that targets those specific contributing factors. Also, genomic analyses of tumor tissue will enable researchers to help patients with advanced disease and no current treatment options, as well as to isolate the causes of drug resistance in patients who stop responding to treatments, redirecting them to more successful therapies.

Preventive precision medicine will also be a key initiative at the institute, allowing physician-scientists to help identify a patient’s risk of diseases and take necessary steps to aid in its prevention through medical treatment and/or lifestyle modification. In addition, the Institute for Precision Medicine will leverage an arsenal of innovative genomic sequencing, biobanking and bioinformatics technology to transform the existing paradigm for diagnosing and treating patients.

“This institute will revolutionize the way we treat disease, linking cutting-edge research and next-generation sequencing in the laboratory to the patient’s bedside,” Dr. Rubin says. “We will use advanced technology and the collective wealth of knowledge from our clinicians, basic scientists, pathologists, molecular biologists and computational biologists to pinpoint the molecular underpinnings of disease — information that will spur the discovery of novel treatments and therapies. It’s an exciting time to be involved in precision medicine and I look forward to advancing this game-changing field of medicine.”

“Precision medicine is the future of medicine, and its application will help countless patients,” says Dr. Laurie H. Glimcher, the Stephen and Suzanne Weiss Dean of Weill Cornell Medical College. “The Institute for Precision Medicine, with Dr. Rubin’s expertise and strong leadership, will accelerate our understanding of the human genome, provide key insights into the causes of disease and enable our physician-scientists to translate this knowledge from the lab to the clinical setting to help deliver personalized treatments to the sickest of our patients.”

Three main resources will facilitate the institute’s groundbreaking precision medicine work:

  • genomics sequencing,
  • biobanking and
  • bioinformatics.

Weill Cornell and NewYork-Presbyterian will invest in state-of-the-art technology to conduct sequencing, a more expansive biobank for all patient specimens and tissue samples and dedicated bioinformaticians who will closely analyze patient data, searching for genetic mutations and other abnormalities to identify and target with treatment.

“The Institute for Precision Medicine will enable our doctors to tailor effective treatments for individual patients and also predict the diseases that are likely to affect a patient long before they develop,” says Dr. Steven J. Corwin, CEO of NewYork-Presbyterian Hospital. “By harnessing the full potential of our enhanced understanding of the human genome, and extending its reach into the clinical realm, the institute will transform patient care at NewYork-Presbyterian/Weill Cornell Medical Center and beyond.”

Dr. Rubin, the institute’s inaugural director, is a board-certified pathologist and physician-scientist with specific expertise in genitourinary pathology and an internationally recognized leader in prostate cancer genomics and biomarker research. His groundbreaking research investigating molecular biomarkers distinguishing indolent from aggressive disease has led to landmark discoveries that revolutionized the understanding of prostate cancer’s molecular underpinnings. This includes co-discovering two of the most common mutations in prostate cancer,

  • the TMPRSS2-ETS rearrangements and 
  • SPOP mutations.

Dr. Rubin is one of the “Dream Team” principal investigators of a multi-institutional $10 million grant from Stand Up 2 Cancer (SU2C) and the Prostate Cancer Foundation, addressing patients with advanced prostate cancer through a multi-phase approach employing next generation sequencing to help inform the direction of future clinical trials. Additionally, Dr. Rubin serves as a co-principal investigator on the National Cancer Institute‘s (NCI) Early Detection Research Network (EDRN) Biomarker Discovery Laboratory and worked for many years as part of the NCI Prostate Cancer Specialized Programs of Research Excellence (SPORE).

Dr. Rubin has authored more than 275 peer-reviewed publications, predominantly in prostate cancer, and holds multiple NCI-funded grants in prostate cancer genomics and biomarker development. He is a member of the World Health Organization Prostate Cancer Tumor Classification and the Prostate TCGA (The Cancer Genome Atlas) Working Group. He serves as an ad hoc reviewer for multiple publications including Nature, Science, Cancer Cell, Cancer Discovery and the New England Journal of Medicine. Dr. Rubin also serves as the chair of the EDRN Prostate Cancer Working Group and is a member of the ERDN Steering Committee. He is active in the NCI/NHGRI-sponsored TCGA serving on the Prostate Cancer Working Group and he is an external advisor for the Canadian International Cancer Genome Consortium (ICGC). He served on the NCI Cancer Biomarker Study Section for five years and as an ad hoc reviewer for other NCI and international granting organizations.

Dr. Rubin is the recipient of the Arthur Purdy Stout Society of Surgical Pathologists Annual Prize (2003), the Young Investigator Award (2004) given by the United States and Canadian Academy of Pathology and the Huggins Medal (2012), the highest award bestowed by the Society of Urologic Oncology. Finally, Dr. Rubin was a co-team leader with his long-term collaborator, Arul M. Chinnaiyan (University of Michigan) for the first annual American Association of Cancer Research Team Science Award (2007) in recognition for their groundbreaking work on TMPRSS2-ETS fusion prostate cancer.


Clinical Laboratory Improvement Amendments (CLIA)

The Centers for Medicare & Medicaid Services (CMS) regulates all laboratory testing (except research) performed on humans in the U.S. through the Clinical Laboratory Improvement Amendments (CLIA). In total, CLIA covers approximately 225,000 laboratory entities. The Division of Laboratory Services, within the Survey and Certification Group, under the Office of Clinical Standards and Quality (OCSQ) has the responsibility for implementing the CLIA Program.

The objective of the CLIA program is to ensure quality laboratory testing. Although all clinical laboratories must be properly certified to receive Medicare or Medicaid payments, CLIA has no direct Medicare or Medicaid program responsibilities.

For the following information, refer to the downloads/links listed below:

  • For additional information about a particular laboratory, contact the appropriate State Agency or Regional Office CLIA contact (refer to State Agency or Regional Office CLIA link found on the left-hand navigation plane);
  • Information about direct access testing (DAT) and the CLIA regulations is included in the Direct Access Testing download;
  • OIG reports relating to CLIA;
  • Guidance for Coordination of CLIA Activities Among CMS Central Office, CMS Regional Offices, State Agencies (including State with Licensure Requirements), Accreditation Organizations and States with CMS Approved State Laboratory Programs is contained in the Partners in Laboratory Oversight download;
  • Quality control (QC) highlights from the regulations published in the Federal Register on January 24, 2003 are listed under the QC Highlights download;
  • Micro sample pipetting information for laboratories;
  • Information on alternative (non-traditional) laboratory is contained in the Special Alert download;
  • Identifying Best Practices in Laboratory Medicine – a Battelle Project for the Centers for Disease Control and Prevention (CDC); and
  • FDA Safety Tip for laboratories on how workload should be calculated when using currently FDA-approved semi-automated gynecologic cytology screening devices.

For specific information about the quality assurance guidelines for testing using the rapid HIV-1 antibody tests waived under CLIA, refer to the CDC Division of Laboratory Systems website listed under the related links outside CMS section below.

Complaint Reporting

To report a complaint about a laboratory, contact the appropriate State Agency that is found on the State Agency & Regional Office CLIA Contacts page located in the left navigation bar in this section.


New Weill Cornell Precision Medicine Institute Plans to Offer Genomically Guided Treatment after CLIA Approval

February 06, 2013

Through a newly created Institute for Precision Medicine,Weill Cornell Medical College and New York Presbyterian Hospital plan to begin offering targeted, individualized treatment informed by patients’ genomes.

The institute first plans to guide treatment decisions for cancer patients using their genomic data, and then broaden the effort to those with common illnesses, such as cardiovascular disease and neurodegenerative disorders.

The new institute is currently awaiting regulatory approval from CLIA and New York State, according to its leader, Mark Rubin, a professor of pathology at Weill Cornell.

With that approval in hand, the center will begin using genome sequencing and other tools to inform treatment strategies for patients – first focusing on cancer, and then eventually broadening to other disease areas, he said.

While Rubin did not detail how the institute will recruit patients, he said the center plans to see cancer patients who can benefit from single-gene tests or other molecular diagnostics to inform treatment decisions, those with advanced diseases without treatment options, and patients who stop responding to standard treatments and could be redirected to other therapies.

“For some patients, there are very clear indications of whether they need a specific targeted therapy. Those are pretty straightforward,” Rubin said.

“And then there is emerging data that sequencing, either exome or whole-genome, can provide insight on which treatments cancer patients might need that are not considered standard treatments,” he said.

Insights from advanced sequencing technologies are also changing how researchers study patients, sometimes facilitating N-of-1 trials. “There are a few examples where treatments have been implemented and shown to be effective in a clinical trial of one,” Rubin said, “where they are the only person on the trial because of their specific mutations.”

He said the institute plans to be agnostic in terms of what technologies it uses for sequencing, but currently it relies on Illumina technology.

“We will have a number of different approaches, but the key is to do as best as possible in the clinical setting so that the results can be used in the management of patients,” Rubin said.

According to Rubin, the institute aims to find the optimal ways to collect genomic data, analyze it, and store it. As the center gears up and sees larger numbers of patients, Rubin said it also plans to use data and samples it collects to support larger retrospective or prospective studies, for which the institute is considering partnering with the New York Genome Center.

But not all patients the institute sees may require large-scale genome sequencing, Rubin reiterated.

“It may turn out that the most efficient way to determine if someone has a certain mutation, like EGFR, is to run the single-gene test up front. That’s not going to change for some types of disease,” he noted. “So, what I see our role being is developing these more complex approaches, which could be whole-genome sequencing, or using multiplex panels of genes.”

The institute will focus its efforts first on cancer patients because the development of genomically targeted therapies is relatively accelerated compared to other disease areas, so the potential to match a mutation in a cancer patient’s genome to a potential treatment may be greater than for those with other illnesses. But according to Rubin, the institute does plan to expand to other populations, like cardiovascular conditions, neurodegenerative diseases, and possibly infectious diseases.

In addition, he said the Institute is also discussing how it might use prognostic genomic information to look at disease risk, with the potential to inform early interventional treatment decision making.

“Because we are a hospital that sees well patients being followed by their doctors, that’s something we’re contemplating as pilot,” he said.

“We don’t have a plan in mind yet, but those types of studies are probably very important in specific disease entities, for patients at risk for a particular constitutional disease … or you could imagine we might screen large numbers of our patient population to look for risk factors that may not have been identified yet,” he said.

Several commercial and academic groups have recently begun offering clinical cancer sequencing and other genomic analyses to potentially guide therapeutic decision making.

Foundation Medicine, for example, sells a test that sequences the exons of nearly 200 genes known to be mutated in solid tumors and provides a report informing doctors of actionable mutations.

Other firms, like Caris Life Sciences, provide reports to doctors and patients matching gene-expression or sequence data to potentially actionable therapies.

The University of Michigan and the International Genomics Consortium announced last fall that they were creating a non-profit company called Paradigm to provide a targeted-sequencing-based diagnostic service to guide personalized treatment for cancer and, eventually, other diseases (PGx Reporter 7/5/2012).

Though the new Weill Cornell institute plans to start fairly narrowly, Rubin’s overall goals for the center may put it in a position to offer a potentially more comprehensive service than many of these other groups.

“The most challenging part right now is for us to understand that sequencing is just a test, something that may or may not be useful in itself. It’s in working in the clinical setting that we are going to really define it,” Rubin said.

The institute, in the early days of its operations, is still in a learning phase, but “expectations are high” for the effort to succeed, according to Rubin. “Our job is to live up to the promise [of sequencing] to help identify novel targets for patients who may not have any choices with respect to treatment,” Rubin said. “And also to make discoveries that may be useful for a larger population.”

Molika Ashford is a GenomeWeb contributing editor and covers personalized medicine and molecular diagnostics. E-mail Molika Ashford.

Related Stories


Complaint Reporting

To report a complaint about a laboratory, contact the appropriate State Agency that is found on the State Agency & Regional Office CLIA Contacts page located in the left navigation bar in this section.

Read Full Post »

Reporter: Aviva Lev-Ari, PhD, RN

arrayMap: A Reference Resource for Genomic Copy Number Imbalances in Human Malignancies

Haoyang Cai#, Nitin Kumar#, Michael Baudis*Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland

Abstract Top


The delineation of genomic copy number abnormalities (CNAs) from cancer samples has been instrumental for identification of tumor suppressor genes and oncogenes and proven useful for clinical marker detection. An increasing number of projects have mapped CNAs using high-resolution microarray based techniques. So far, no single resource does provide a global collection of readily accessible oncogenomic array data.

Methodology/Principal Findings

We here present arrayMap, a curated reference database and bioinformatics resource targeting copy number profiling data in human cancer. The arrayMap database provides a platform for meta-analysis and systems level data integration of high-resolution oncogenomic CNA data. To date, the resource incorporates more than 40,000 arrays in 224 cancer types extracted from several resources, including the NCBI’s Gene Expression Omnibus (GEO), EBI’s ArrayExpress (AE), The Cancer Genome Atlas (TCGA), publication supplements and direct submissions. For the majority of the included datasets, probe level and integrated visualization facilitate gene level and genome wide data review. Results from multi-case selections can be connected to downstream data analysis and visualization tools.


To our knowledge, currently no data source provides an extensive collection of high resolution oncogenomic CNA data which readily could be used for genomic feature mining, across a representative range of cancer entities. arrayMap represents our effort for providing a long term platform for oncogenomic CNA data independent of specific platform considerations or specific project dependence. The online database can be accessed at http//www.arraymap.org.

Citation: Cai H, Kumar N, Baudis M (2012) arrayMap: A Reference Resource for Genomic Copy Number Imbalances in Human Malignancies. PLoS ONE 7(5): e36944. doi:10.1371/journal.pone.0036944

Editor: Ying Xu, University of Georgia, United States of America

Received: January 10, 2012; Accepted: April 16, 2012; Published: May 18, 2012

Copyright: © 2012 Cai et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: HC is supported through a personal grant from the China Scholarship Council. NK and MB had received support through the Krebsliga Schweiz and the University of Zurich’s Research Priority Program Systems Biology. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

* E-mail: michael.baudis@imls.uzh.ch

# These authors contributed equally to this work.

Introduction Top

Genomic copy number abnormalities (CNAs) are a relevant feature in the development of basically all forms of human malignancies [1]. Many genomic imbalances are recurrent and display tumor-specific patterns [2],[3]. It is believed that these genomic instabilities reveal mutations in tumor suppressor genes and oncogenes which eventually result in a clone of fully malignant cells. Investigation of CNA hot spots (chromosomal loci frequently involved in CNA) has proven to be an effective methodology to identify novel cancer-causing genes [4][5]. On a systems level, CNA data along with expression or somatic mutation data is used to detect pathways altered in cancers and to deduce functional relevance of pathway members[6][7]. Since many CNAs have been attributed to specific tumor types or clinical risk profiles, in some entities copy number profiling is employed to characterize different biological as well as clinical subtypes with implications for treatment and individual prognosis. Subtype-associated CNA regions are used to predict causative genes, furthering understanding of biological differences and leading to discovery of new therapeutic targets [8][9].

Throughout the last two decades, molecular-cytogenetic techniques have been applied to scan genomic copy number profiles in virtually all types of human neoplasias. For whole genome analysis, these techniques predominantly consist of chromosomal and array comparative genomic hybridization (CGH), including CNA detection by cDNA and single nucleotide polymorphism (SNP) arrays [10][12]. While chromosomal CGH has a limited spatial resolution of several megabases, the resolution of recent array based technologies (aCGH) is mainly limited due to cost/benefit evaluations instead of technical obstacles. In this article, we use the terms “array CGH” and “aCGH” for all technical variants of whole genome copy number arrays. This includes e.g. single color arrays for which regional copy number normalization is performed through bioinformatics procedures applied to external references and internal data distribution.

The flood of new insights into structural genomic changes in health and disease has led to an increased interest in genomic data sets in genetic and cancer research. Several systematic studies of CNAs across many cancer types have been performed [13][14]. These efforts attempt a more complete understanding of functional effect of CNAs in the context of cancer.

The exponential increase of high resolution CNA datasets offers new challenges and opportunities for large-scale genomic data mining, data modeling and functional data integration. Several online resources have been developed, focusing on different aspects of data content as well as representation [6][15][19]. An overview of some of the prominent examples is given in Table 1. In principle, these databases facilitate access and utilization of CNA data. However, they are limited to specific aCGH platforms and/or single institutions as well as limited disease categories, or, as in the cases of GEO [15] and Ensembl ArrayExpress[16], mainly serve as raw data repositories. To the best of our knowledge, no single data source does yet provide an extensive collection of high resolution oncogenomic CNA data which readily could be used for genomic feature mining, across a representative range of cancer entities.

Table 1. Prominent online resources of genomic data.


Here we present “arrayMap”, a web-based reference database for genomic copy number data sets in cancer. We have generated a pipeline to accumulate and process oncogenomic array data into a unified and structured format. The resource incorporates associated histopathological and clinical information where accessible.

So far, arrayMap contains more than 40,000 arrays on 224 cancer types from five main data sources: NCBI GEO, EBI ArrayExpress, The Cancer Genome Atlas, publication supplements and user submitted data. Samples of interest can be browsed, visualized and analyzed via an intuitive interface. Computational tools are provided for biostatistical data analysis such as CNA clustering for case specific or for subset data and basic clinical correlations. arrayMap is publicly available at www.arraymap.org.

Results Top

Data Content

Our combination of both “top-down” (publication driven) as well as “bottom-up” (array data driven) approaches allowed us to identify a comprehensive set of accessible aCGH based cancer CNA data sets and to estimate the ratio of accessible data of the overall published/deposited data.

As main result of the array data driven approach, we extracted 495 series comprising of 32002 arrays, generated on 237 platforms from NCBI’s GEO. Among those, raw data files of approximately 29000 whole genome arrays were suitable for inclusion into our data processing pipeline. When reviewing the content of AE, we found that the majority of AE cancer genome data sets were also submitted to GEO. At the time of writing, 11 datasets including 712 arrays not present in GEO had been processed based on AE specific series. Detailed information on the GEO/AE data sets is provided in Table S1.

The top-down procedure was based on our group’s continuous monitoring of cancer related articles utilizing genome copy number screening approaches, as established for our “Progenetix” project (www.progenetix.org[19]). The census date for the literature based data collection was August 15 2011. At this point, we had identified 931 articles discussing a total of 53213 genomic cancer CNA profiles based on aCGH techniques. Of these, 8728 cases out of 199 articles so far had been extracted from publication related sources (e.g. supplementary data tables) and annotated and made been accessible through Progenetix. This data included cases for which only supervised information but no probe data was available (e.g. author annotated Golden Path or cytogenetic CNA regions). Literature based data sets containing probe specific data or with the respective data presented to us by the authors (640 samples) were included into our arrayMap data processing pipeline.

The data content of arrayMap is summarized in Table 2. Current numbers on the website will include changes based on ongoing annotation efforts (i.e. addition of data sets, removal of low quality arrays).

Table 2. aCGH data integrated in arrayMap.


As a by-product of our data collection and annotation efforts, we are able to provide estimates of content and trends for the platform usage and cancer entity coverage for the majority of published data. According to the assigned ICD-O 3 (International Classification of Diseases for Oncology, 3rd Edition) code and descriptive diagnostic text, breast carcinoma predominates as single largest clinical entity with 6459 arrays.Table S2 presents sample sets in arrayMap classified by ICD-O code.

The most widely available array CGH platforms are either based on large insert clones (BAC/P1 arrays) or based on shorter single-stranded DNA molecules (oligonucleotide arrays), which may or may not include single-nucleotide polymorphism specific probe sequences (SNP arrays). Also, although designed for gene expression profiling, cDNA arrays were used by several laboratories for measuring genomic copy number changes. Although all these platforms are considered suitable for whole genome CNA analysis, their probe densities and other parameters can affect specific features of the analysis results [20][23]Table S3 lists the general platform types and corresponding overall numbers of the data registered in arrayMap.

In reviewing the technical platform composition, two related trends become apparent (Figure 1). Originally developed in groups with expertise in molecular cytogenetics and cancer genome analysis, printed large insert clone arrays (BAC/P1) were the first whole genome CNA screening tools with a spatial resolution surpassing that of chromosomal CGH. Other groups re-employed cDNA arrays, developed for expression screening, for genomic hybridizations. However, over the last years one can observe the overwhelming use of various industrially produced oligonucleotide array platforms, which compensate their low single probe fidelity through a probe density at 1–3 orders of magnitude higher than common for BAC/P1 arrays. Another reason for the success of oligonucleotide arrays is the integration of SNP specific probes, which in principle allows to use of the same experiments for genetic association studies and the evaluation of copy number neutral loss of heterozygosity regions [12][24][25].

Figure 1. Distribution of resolutions and techniques of GEO platforms.

Each point represents a genomic array. The Y axis is labeled with probe number in log scale. The X axis denotes the time sequence of array data generation. From left to right are years from 2001 to 2011.



Data Access and Usage Scenarios

Based on our experience from the Progenetix project, a strong emphasis was put on a user friendly data interface. Here, we followed a “dual user type” scenario: Users without bioinformatics background should be able to intuitively visualize core data features as well as to perform standard analysis procedures, while for bioinformaticians the formatted database content should be accessible to use with their analysis tools of choice.

Query interface.

Data browsing in arrayMap is based on two types of query methods: search by experimental series metadata and search by sample features.

In the series query form, users can perform various search options by specifying (i) descriptive diagnosis text; (ii) disease classification (ICD-O 3 code(s)); (iii) disease locus (ICD topography code(s)); (iv) PubMed ID; (v) technique(s); (vi) series ID. For sample specific queries, additional features are available: sample ID; platform ID or description; and single or combined regional CNAs. Users can input gene name(s) in “regional CAN” search field. When at least two characters are entered into the field, suggestions based on a HUGO gene list are displayed for selection. Gene selections will be converted to genomic locations.

In the results table, associated array information is displayed. A number of links to additional and/or outside data is provided, according to the information available: the corresponding PubMed entries; the original GEO/AE accession display page for more complete information; the case and publication entries on the Progenetix website for further analysis; and importantly the array specific data visualization page.

Data download options.

On pages resulting from sample queries or sample data processing, users are presented with options to download sample data based on the current queryÕs return. Currently, three different file types are offered: JSON files, tab separated feature files and segments list files. These files enable bioinformaticians to perform further analyses based on their tools of choice. Particularly, the JSON format can be used for direct database import (e.g. MongoDB) or can be deparsed by common libraries (e.g. JSON.pm), or being read into web applications.

Array probe data visualization.

In the array plot interface, original plots of genomic array data sets can be searched and visualized (Figure S1). Default threshold parameters which were either provided with the data or assigned during the initial visualization will be loaded. In single array visualization, the general view of probe distribution and post-thresholding segmentation results are displayed for the whole genome as well as for each individual chromosome. If multiple arrays are retrieved, users can select sample data for downstream analysis procedures. Figure S2 shows the screenshot of single array visualization.

Users can segment the raw data values and re-plot the results after revising the following parameters:

  • Golden path edition, default HG18/NCBI Build 36. This is still the commonly used version of the human reference genome assembly. At the moment, coordinates of probes from all platforms were remapped to HG18. For the near future, we intend to allow for a selection of updated genome editions.
  • Chromosomes to plot, default 1 to 22. Single or all chromosomes can be selected for re-plotting. To avoid gender bias, most platforms do not contain probes in chromosome X and Y during the design.
  • Loss/gain thresholds. Cut-offs from which a segment is considered a genomic loss or gain. The optimum thresholds may vary between platforms.
  • Region size in kb. Sets a filter to remove CNA below (e.g. probable noise) or above (e.g. exclude non-focal CNA) a certain size range.
  • Minimal probe numbers for segments. This parameter can be used to limit the minimal number of probes required for a segment to be considered (e.g. to remove aberrant segmentation due to probe level noise). Empirical examples would be values of 2–3 for high quality BAC arrays and 6–10 for Affymetrix SNP 6 arrays.
  • Plot region. Single genomic region to be plotted, overriding the chromosome selection above. When selected, plots with this region will be generated for all current arrays. This is valuable to e.g. display the CNA status and copy number transition points for specific genes of interest (Figure S3).
Zoom-in visualization of focal CNA.

Figure 2 shows the visualization of focal genomic imbalances, e.g. to identify genes of interest targeted by focal CNA. The whole genome view of GSM535547 (human high grade glioma sample analyzed by Agilent Human Genome CGH Microarray 244A) shows a small regional deletion in chromosome 9p21. When plotting the approximate locus of the deletion (specified as chr9:21600000-22400000), genes, probes and chromosome bands in this zoomed in region are shown. Two genes, MTAP and CDKN2A can be seen as being localized in a potential homozygously deleted region. The focal deletion of these known tumor suppressor genes [26][27] points to their specific involvement in the glioblastoma sample analyzed here.

Figure 2. Zoom-in visualization of focal CNA.

(A) GSM535547 (human high grade glioma, Agilent CGH 244A) shows high quality of probe hybridization signal. CNAs are easy to distinguish. (B) When zoom-in the whole chromosome 9, an approximately 80 MB deletion is displayed, with two breakpoints located in p and q arm respectively. In addition, a small regional deletion in 9p21 is quite clear. Color bars in lower region of the panel represent 848 genes located in chromosome 9. (C) Zoom in the potential homozygously deleted region in 9p21 by specifying the exact region: chr9:21600000-22400000. The zoomed-in plot shows probes, chromosome band and two tumor suppressor genes, MTAP and CDKN2A. Gene name and location will be given while mouse hover. They link to UCSC genome browser with additional information.


Querying compound CNA.

The concept of focal CNA detection can be integrated with a global search for arrays containing gene specific regional imbalances. As an example, we demonstrate the search for arrays displaying imbalances in 4 gene loci associated with glioblastoma: EGFR, a transmembrane receptor and proto-oncogene [28]; PTEN, a tumor suppressor gene [29]; ASPM, frequently overexpressed in glioblastoma relative to normal brain tissue [30]; and CDKN2A (see above). In the “Search Samples” form, the “Match (Multiptle) Regions & Types” can be used to specify the genomic regions of those four genes including the expected CNA type: for EGFR (chr7:55054219-55242524:1), PTEN (chr10:89613175-89718511:-1), ASPM (chr1:195319885-195382287:1) and CDKN2A (chr9:21957751-21984490:-1), respectively. When executing the query, these regions were matched with the whole database and returned cases which have imbalances overlapping all these regions. When excluding controls and “worst quality” datasets, 303 out of 42421 arrays could be identified matching all four CNA regions. In addition to glioblastoma, several other types of cancer cases were among the results, including e.g. neuroblastomas, breast carcinomas, melanomas and lung carcinomas, which is in accordance with some previous observations [31][34]. CNA and associated data of those cases can be processed by online tools for further analysis and visualization (Figure S4) or downloaded for offline processing.

Copy number profiling of selected cancer entities.

One aim of arrayMap is to allow researchers to conveniently perform aCGH meta-analysis across different platforms. By selecting a single or several cancer entities e.g. based on their ICD entity codes or diagnostic keywords, users are able to generate disease specific CNA frequency profiles or to compare profiles of different cancer types.

As an example, we used ICD-O code 9440/3 (glioblastoma, NOS) to query the database. 1478 arrays from 25 publications were returned and passed to our suite of online analysis tools. Chromosomal ideograms and histograms were generated representing the frequency of copy number aberrations identified over the whole dataset (Figure 3A). In the overall aberration profile, the most common genomic imbalances included whole chromosome 7 gain and chromosome 10 loss, as well as focal gains e.g. on bands 1q21 and 17q21. In our example dataset, a prominent focal deletion hot-spot was centered around 9p21.3 (921 of 1478 arrays, 62.31%) which had been discussed previously [35]. The distribution of CNAs over the individual arrays was visualized through a matrix plot (Figure 3B). As additional information to the frequency histograms, this form of visualization facilitates e.g. the detection of CNA patterns among individual arrays as well as the concordance of individual CNAs (e.g. here the arm-level changes in chromosome 7 and 10).

Figure 3. Copy number profiling of glioblastoma.

(A) Chromosomal ideogram and histogram showing frequency of copy number aberrations. Percentage values corresponding to gains (yellow) and losses (blue) identified over the whole dataset. The most frequent imbalances include gain of chromosome 7 and loss of chromosome 10, 9p21.3. (B) Matrix plot of 1478 glioblastoma cases. The Y axis represents individual samples. The distribution of genomic copy number imbalances reveals the individual aberration patterns of glioblastoma. (C) Heatmap of regional CNA frequencies for 1478 arrays. The intensity of green and red color components correlates to the relative gain and loss frequencies, respectively. If dataset contains cancer subtypes, cancers with similar CNA frequency profiles will be clustered together, such that differences between subtypes will be revealed (e.g. see Figure S4H).



In the matrix plot, clicking on a certain segment would open the related view in the UCSC genome browser[36], for detailed information related to this genomic region (SVG plot only). The plot order of arrays can be re-sorted according to ICD morphology, ICD topography, clinical group or PubMed ID, which can be helpful in associating CNA patterns to external classification categories. For the selected classification criterium (default: ICD morphology), regional CNA frequencies for cases matching the different values will be visualized through a heatmap (Figure 3C); this feature is especially useful when comparing a number of different primary classification criteria.

An Overall Genomic Copy Number Profile of Cancer

Our high quality core dataset in arrayMap was used to generate an overall cancer copy number aberration profile based on 29,137 arrays (Figure 4). This data represented 177 cancer types according to ICD-O 3 code, with 59 types among them contained more than 50 arrays. Overall, one of the most common genomic alteration is copy-number gain of chromosome band 8q24, which is found in 30% of total samples. According to the COSMIC [37] database, the most significant cancer gene in this region is MYC. It is a well-documented oncogene codes for a transcription factor that is believed to regulate the expression of 15% of all genes, including genes involved in cell division, growth, and apoptosis [38][39]. Other common imbalances observed in at least 25% of oncogenomic arrays included gains of regions on e.g. 17q21 (29%), 1q21 (33%) and loss of regions on e.g. 8p23 (32%) and 9p21 (25%), including focal deletions of the CDKN2A/B locus (Figure 2).

Figure 4. The overall cancer copy number aberration profile consisted of 29137 arrays.

This plot represents 177 cancer types according to ICD-O 3 code. Percentage values in Y axis corresponding to numbers of gains (green) and losses (red) account for the whole dataset.



While the overall CNA frequency distribution points towards DNA features targeted in multiple entities, this information is insufficient for deriving molecular mechanisms associated with specific cancer types. The genomic heterogeneity of different neoplasias is reflected in the varying patterns of regional CNA frequencies. Based on our core dataset, we have generated a heatmap-style visualization of frequency profiles for all ICD-O entities containing more than 50 arrays (Figure S5). The striking patterning of the CNA profiles indicates the non-random occurrence of CNAs, and should be seen as an invitation to explore e.g. CNA similarities shared by separate histopathological entities, as a way to transpose knowledge about pathophysiological mechanisms.

Discussion Top

arrayMap was developed to facilitate the progress of oncogenomic research. Our aim is to provide high-quality genomic copy number profiles of human tumors, along with a set of tools for accessing and analyzing CNA data. The service has been implemented with a straightforward web interface, including search options for CNA features and clinical annotation data. All assembled datasets are processed into platform independent segmentation and, for the vast majority of arrays, probe level data files, and are presented in consistent formats. Importantly, the direct access to precomputed probe level data plots supports a rapid evaluation of experiments for features of interest. As a curated database using standardized annotation schemes (e.g. ICD classification), arrayMap facilitates the exploration of cancer type specific CNA data, as well as the statistical association of genomic features to clinical parameters.

arrayMap is a dynamic database that is being continuously expanded and improved. We will review the existing and newly published articles to update the database periodically. Over the past decade, we have witnessed a rapidly increasing number of aCGH publications, which gives us sufficient evidences to anticipate that cases in our database will continue to be deposited at a high rate. Although arrayMap is not a user driven repository, we welcome and support users interested in using the site for yet undisclosed data, if they agree on data sharing upon publication.

Although, in contrast to the continuous data from expression analysis, copy number analysis explores discrete value spaces (countable number of DNA copies, for segments defined by genomic base positions), interpretation of the data can vary due to different low level (e.g. signal/background correction) and higher level (e.g. segmentation algorithms, regional or size based filtering) procedures. In that respect, we have to emphasize that the results of our data processing and annotation procedures are open to scrutiny. We encourage a critical review of individual results, and are open for suggestions regarding improved processing procedures for specific platforms.

In this paper, we have provided example scenarios of using arrayMap on different levels, i.e. locus centric and for entity profiling. We believe that systematic analyses will help researchers to discover features which are indiscernible in individual studies, and thus bring new insights for understanding of disease pathology and the development of new therapeutic approaches [40][43]. We expect that researchers will integrate arrayMap data with their own analysis efforts, e.g. to increase sample size or for result verification purposes. We hope that this database will promote further evolution of microarray data meta-analysis. ArrayMap provides access to more than 200 tumor types, which makes it suitable for research across cancer entities. Furthermore, normal sample controls are of vital importance for genomic imbalances studies. ArrayMap includes more than 3000 normal samples from healthy individuals or from normal tissues of cancer patients. These data could be integrated as reference dataset e.g. to account for copy number variation data superimposed on the tumor profiling results.

In the near future, with the continuous accumulation of very high resolution CNA data from genomic arrays and next-generation sequencing experiments, it will become possible to integrate these data into systems biology methods to elucidate effects of genomic instability, and describe the results from more perspectives. Envisioned examples would be e.g. the identification of genes that are involved in metastasis and treatment response; identification of chromosomal breakpoints distribution in cancer; and modeling functional networks in cancer by systems biology approaches.

Methods Top

Dataset Collection

Raw experimental data from a variety of platforms and repositories were extracted. They were converted to an uniform format which is suited to our reanalysis and visualization system. After a series of parsing procedures, the called copy number data is stored in arrayMap. The flowchart of arrayMap data collection and analysis is as shown in Figure 5. Five main data sources are integrated into arrayMap:


Figure 5. The flowchart of arrayMap data collection and analysis procedures.

Publicly available raw data or segmented data was collected from the respective data sources. Files were re-processed by distinct procedures, according to the different data types. Probe coordinates were remapped to the most commonly encountered human reference genome assembly (NCBI Build 36/hg18). All probe specific ratios were converted to log2 values. Thresholds for genomic gain and loss were obtained from the original publications or series annotations; if not available, empirical thresholds were assigned. A minimum of 2 probes was required for calling a CNA segment, with higher values used on high-density arrays and/or in cases of excessive probe level noise. Processed probe and segment information was converted to uniform formats and stored in per-sample text files, which are accessed through the arrayMap web applications.



For extracting appropriate data Series from GEO/AE, two basic criteria have to be fulfilled. First, the raw data has to be from human malignancies analyzed by BAC, cDNA, aCGH or oligonucleotide arrays. Second, the array platform must be genome wide, with the optional omission of the sex chromosomes. Chromosome or region specific arrays were excluded because they were not able to reveal the whole genomic profile of the respective cancer. Associated clinical data was extracted if available.


Segmentation data with available clinical information was extracted and incorporated into the database. Due to data sharing restrictions, TCGA data is an exception in that, so far no probe level data is incorporated into arrayMap. This exception was accepted since users will be able to access individual TCGA datasets through the projects web portal at http://tcga-data.nci.nih.gov/tcga/.


Many aCGH datasets can be found in the text or supplementary files of publications. In order to collect data from publications, we relied on our Progenetix projectÕs setup. Data in Progenetix is manually curated. The collection strategies are:

  • literature mining using complex search parameters through PubMed
  • identification of called aCGH data, in GP annotation or tabular format (article, supplementary tables)
  • evaluation of supplementary files for probe specific data tables
  • follow-up on article links outs, to repository entries or referenced datasets
User submission.

User submitted data was provided in a number of formats which were converted to the standard format as described. Although we accept and support private datasets, we insist on integration of at least the genomic and core clinical data (e.g. disease classifiers) upon publication of the datasets analysis results.

Dataset Analysis

Probe remapping.

A pipeline has been generated for determining the genomic positions for the tens to hundreds of thousands array probes with reference to a common genome Golden Path edition. For each array platform, the genome positions of probes were remapped to the current commonly used version of the human reference genome assembly (NCBI Build 36.1/hg18). Specific mapping procedures were employed for different types of probes. BAC clones were firstly remapped according to the clone sets information of Sanger/DECIPHER database [44]. If the probe position was not available, the UCSC Genome annotation database [36] (release hg18) was used for compensation. After these two steps, a mean of 98% of the BAC clones were remapped. For IMAGE clone sets, only the UCSC Genome annotation database was used. The average remapping rate of IMAGE clones was 91%. Affymetrix raw CEL data files were analyzed based on hg18 library files, namely the output segments have hg18 coordinates. The summary of the percentage of mapped probes is given in Table 3. The mapping details for each platform can be found in the (Table S4).

Table 3. Percentage of remapped probes according to platform types.


Probe signal normalization.

The array data available was given in a variety of formats, most frequently as log2 ratio of probe hybridization intensity. In order to make data from different platforms directly comparable, all other types of normalized values were converted to log2. For dye swap experiments, reference/tumor intensity ratios data was “reversed” representing a tumor/reference value. For some two-color arrays for which only raw signal intensity were provided, the normalized log2 ratio for each probe was calculated by.

where T and T represent tumor sample intensity and tumor channel background intensity respectively, and R and R represent reference sample intensity and reference channel background intensity respectively. If multiple instances of the same clone exist, the average signal intensity of the certain clone was considered.

To call gains and losses according to normalized log2 ratio is an important step to identify copy number imbalances. For each re-analyzable dataset, related publications were explored to obtain original threshold descriptions. If this information was not available, empirical thresholds were assigned and resulting CNA calls were visually compared with probe value plots. Processing method and threshold information for each array are provided in the Table S5.

Affymetrix genotyping arrays.

For the widely used Affymetrix GenomeWide SNP arrays, raw CEL files were downloaded and underwent a massive re-analysis using the R package aroma.affymetrix [45] with the CRMAv.2 method [46]. During the processing step, approximately 50 normal sample arrays were employed as a reference set for each array type to reduce the noise level. Normal tissue arrays from different labs were extracted and used to build the reference dataset. In order to obtain high quality arrays, we excluded arrays which contain segments greater than 3 mega-bases, since copy number variations are always smaller than 3 mega-bases. The list of normal tissue reference arrays is giving in Table S6.

Quality control.

In our review of array data deposited in GEO or collected from publication supplements we encountered a large number of individual data sets with insufficient or limited probe quality. Also, for samples of unprocessed raw data (e.g. Affymetrix CEL files), we found that QC measures reported previously (e.g. call rate [47], NUSE [48], RLE [48]) only had a limited accuracy for detection of arrays with inadequate probe level data. Currently, the most viable strategy for quality assessment of processed, heterogeneous copy number arrays is the visual inspection of probe plotting and segmentation results through an experienced researcher. For the first arrayMap edition we generated a quality classification system, which contains a total of 4 categories based on inspections of genome-wide array plots:

  • Excellent. Probe signal distribution is significantly different between normal regions and imbalance regions. Signal baseline is distinct and unique, making segmentation threshold realistic appearing. Chromosomal changes are pretty clear.
  • Good. In general good quality. Probe signal may contain some noise, but tolerable. Chromosomal changes are distinguishable.
  • Hypersegmented. Serrated distribution of probe signal intensities, causing dozens of separate peaks and discontinuous segments. Chromosomal changes are always up to several hundreds and smaller than 5 mega-bases.
  • Noisy. Probe signal intensities are highly scattered, but well-distributed, with high standard deviation, resulting in the inability to differentiate copy number changes.

Depending on the intended research purpose this basic classification system can be used for a pre-analysis triage of copy number data. Applying stringent review criteria we identified a core dataset with “excellent” quality arrays accounting for approximately 60 percent of total arrays. We are currently working on a platform independent quality assessment system for genomic arrays, which will be implemented in future versions of the arrayMap resource.

Associated data.

For arrayMap, data is stored with separate datasets for each array. This is in contrast to the Progenetix database, for which technical replicates where available are combined into case specific CNA profiles. In arrayMap, technical replicates are assigned an identical case identifier to facilitate downstream statistical procedures including e.g. clinical data correlations. The assignment of the correct diagnostic entity to each sample is an essential step in generating a binding between genomic and associated data points. At the same time, to ensure annotation consistency and make the retrieval process more efficient, for all CNA profiles the following data points were manually collected from GEO/ArrayExpress and published papers if available.

  • Descriptive diagnostic text, as available through the original source
  • Diagnostic classification according to the International Classification of Diseases in Oncology (ICDO 3, morphology with code)
  • Tumor locus according to ICD (ICD topography with code)
  • Source of material (e.g. primary tumor, cell line, metastasis)
  • Clinical parameters where available, including age, gender, grade, clinical stage (TNM coded), recurrence/progression, time to recurrence/progression, death and followup
Web Server.

An online interface of arrayMap database was created using Perl common gateway interface (CGI) and R scripts running on Mac OS X Server. Sample and series data is stored using a MongoDB database eingine (http://www.mongodb.org). Precomputed array plots are stored as flat files, mostly in both SVG and PNG versions. The online release of the service has been optimized to be compatible with major browsers supporting current web standards (CSS2, HTML5, XML with inline SVG; e.g. Safari > = 3.0, Firefox > = 3.0, InternetExplorer > = 9, Google Chrome) with limited fallback support. Dynamic graphics provided in the array plot module were implemented as server side services by technologies including XML/XHTML, JavaScript, SVG and HTML5 Canvas.

For the future, we intend a quarterly database content revision to ensure inclusion of newly published articles and GEO/AE entries. Archived versions of the sample annotations will be made available upon special request. Additional feature and small data updates will be performed as seen necessary. The “News” page of Progenetix/arrayMap will be used for feature and content announcements.

Supporting Information Top

Figure S1.

Array data sets visualization. Original plots and optimized parameters for GSE21530 which contains 8 intimal sarcoma samples hybridized on Agilent CGH Microarray 244A platform. The normalized probe signal log2 ratios and post-thresholding segmentation results for each array are intuitively displayed. Genomic alterations are represented by horizontal green (gain) and red (loss) lines. Alterations defined here as regions with log2 ratio >0.15 or <−0.15. Simplified schemas of CNAs link to UCSC genome browser for further review.


Figure S2.

Screenshot of single array visualization. ArrayMap plots for GSM630977 (acute myelogenous leukemia). Besides the whole genome view, subviews of each chromosome are displayed as well. From these plots, different kinds of genetic variation events are clearly revealed, e.g. massive genomic rearrangement in chromosome 6; arm-level gain of chromosome 8q and 3MB focal change around 1p31.3. Through the “Plot Array Data” interface, users can segment the raw data values and re-plot the results with customized parameters.


Figure S3.

Plot single genomic region. In the “Plot Array Data” interface, input the precise location (chr5:1100000-1400000) in “Plot Region” field. Plots with this region were generated for all 8 arrays in the current series (GSE21530). In this region, there are 5 genes which are shown schematically as colored boxes. CNA status and copy number transition points for these genes are displayed.


Figure S4.

Compound CNA query. (A) Four gene loci associated with glioblastoma (EGFR, PTEN, ASPM and CDKN2A) were inserted into “Match (Multiple) Regions & Types” field. 303 out of 42421 arrays were returned. (B) Classification information of these 303 arrays were displayed and can be selected for the following analysis. (C) Statistical and plot parameters can be customized. Associated data was processed by online tools, and returned results included: (D) Chromosomal ideogram and (E) histogram, show frequency of copy number aberrations; (F) Matrix plot reveals the aberration pattern of selected arrays; (G) Array classification tree generated by hierarchical Ward clustering, arrays with similar frequency of CNA are part of the tree branch. (H) Heatmap of CNA frequencies clustered by clinical group.


Figure S5.

Heatmap of frequency profiles for 59 cancer types. Heatmap visualization of frequency profiles for all ICD-O entities containing more than 50 arrays in our core dataset. Region specific gain/loss frequencies were mapped to 1MB intervals. The intensity of colors (green: gains; losses: red) corresponds to the relative frequency of CNAs for each interval.


Table S1.

Entities extracted from NCBI GEO and EBI ArrayExpress.


Table S2.

Cancer entities grouped by ICD-O code.


Table S3.

Platform type distribution in arrayMap.


Table S4.

Probe remapping rate for platforms.


Table S5.

Processing method and threshold for calling genomic gains and losses.


Table S6.

Normal tissue reference arrays for Affymetrix platforms.


Acknowledgments Top

We want to thank Christian von Mering, Homayoun Bagheri, Henrik Bengtsson and Nuria Lopez-Bigas for helpful discussions.

Author Contributions Top

Conceived and designed the experiments: HC NK MB. Performed the experiments: HC MB. Analyzed the data: HC NK MB. Contributed reagents/materials/analysis tools: HC NK MB. Wrote the paper: HC MB.

References Top

  1. Stallings RL (2007) Are chromosomal imbalances important in cancer? Trends in genetics : TIG 23: 278–283. doi: 10.1016/j.tig.2007.03.009FIND THIS ARTICLE ONLINE
  2. Myllykangas S, Himberg J, Böhling T, Nagy B, Hollmén J, et al. (2006) DNA copy number amplification profiling of human neoplasms. Oncogene 25: 7324–7332. FIND THIS ARTICLE ONLINE
  3. Weir BA, Woo MS, Getz G, Perner S, Ding L, et al. (2007) Characterizing the cancer genome in lung adenocarcinoma. Nature 450: 893–898. FIND THIS ARTICLE ONLINE
  4. Wiedemeyer R, Brennan C, Heffernan TP, Xiao Y, Mahoney J, et al. (2008) Feedback circuit among INK4 tumor suppressors constrains human glioblastoma development. Cancer cell 13: 355–364.FIND THIS ARTICLE ONLINE
  5. Mullighan CG, Goorha S, Radtke I, Miller CB, Coustan-Smith E, et al. (2007) Genome-wide analysis of genetic alterations in acute lymphoblastic leukaemia. Nature 446: 758–764. FIND THIS ARTICLE ONLINE
  6. Cancer Genome Atlas Research Network (2008) Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455: 1061–1068. FIND THIS ARTICLE ONLINE
  7. Kan Z, Jaiswal BS, Stinson J, Janakiraman V, Bhatt D, et al. (2010) Diverse somatic mutation patterns and pathway alterations in human cancers. Nature 466: 869–873. FIND THIS ARTICLE ONLINE
  8. Bergamaschi A, Kim YH, Wang P, Sørlie T, Hernandez-Boussard T, et al. (2006) Distinct patterns of DNA copy number alteration are associated with different clinicopathological features and geneexpression subtypes of breast cancer. Genes, chromosomes & cancer 45: 1033–1040. FIND THIS ARTICLE ONLINE
  9. Hu X, Stern HM, Ge L, O’Brien C, Haydu L, et al. (2009) Genetic alterations and oncogenic pathways associated with breast cancer subtypes. Molecular cancer research : MCR 7: 511–522. FIND THIS ARTICLE ONLINE
  10. Kallioniemi A, Kallioniemi OP, Sudar D, Rutovitz D, Gray JW, et al. (1992) Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science (New York, NY) 258: 818–821. FIND THIS ARTICLE ONLINE
  11. Pollack JR, Perou CM, Alizadeh AA, Eisen MB, Pergamenschikov A, et al. (1999) Genome-wide analysis of DNA copy-number changes using cDNA microarrays. Nature genetics 23: 41–46. FIND THIS ARTICLE ONLINE
  12. Bignell GR, Huang J, Greshock J, Watt S, Butler A, et al. (2004) High-resolution analysis of DNA copy number using oligonucleotide microarrays. Genome research 14: 287–295. FIND THIS ARTICLE ONLINE
  13. Baudis M (2007) Genomic imbalances in 5918 malignant epithelial tumors: an explorative metaanalysis of chromosomal CGH data. BMC cancer 7: 226. FIND THIS ARTICLE ONLINE
  14. Alloza E, Al-Shahrour F, Cigudosa JC, Dopazo J (2011) A large scale survey reveals that chromosomal copy-number alterations significantly affect gene modules involved in cancer initiation and progression. BMC medical genomics 4: 37. FIND THIS ARTICLE ONLINE
  15. Barrett T, Troup DB, Wilhite SE, Ledoux P, Evangelista C, et al. (2011) NCBI GEO: archive for functional genomics data sets–10 years on. Nucleic acids research 39: D1005–10. FIND THIS ARTICLE ONLINE
  16. Parkinson H, Sarkans U, Kolesnikov N, Abeygunawardena N, Burdett T, et al. (2010) ArrayExpress update–an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic acids research 39: D1002–D1004. FIND THIS ARTICLE ONLINE
  17. Scheinin I, Myllykangas S, Borze I, Böhling T, Knuutila S, et al. (2008) CanGEM: mining gene copy number changes in cancer. Nucleic acids research 36: D830–5. FIND THIS ARTICLE ONLINE
  18. Cao Q, Zhou M, Wang X, Meyer CA, Zhang Y, et al. (2011) CaSNP: a database for interrogating copy number alterations of cancer genome from SNP array data. Nucleic acids research 39: D968–74.FIND THIS ARTICLE ONLINE
  19. Baudis M, Cleary ML (2001) Progenetix.net: an online repository for molecular cytogenetic aberration data. Bioinformatics (Oxford, England) 17: 1228–1229. FIND THIS ARTICLE ONLINE
  20. Baumbusch LO, Aarøe J, Johansen FE, Hicks J, Sun H, et al. (2008) Comparison of the Agilent, ROMA/NimbleGen and Illumina platforms for classification of copy number alterations in human breast tumors. BMC genomics 9: 379. FIND THIS ARTICLE ONLINE
  21. Curtis C, Lynch AG, Dunning MJ, Spiteri I, Marioni JC, et al. (2009) The pitfalls of platform comparison: DNA copy number array technologies assessed. BMC genomics 10: 588. FIND THIS ARTICLE ONLINE
  22. Greshock J, Feng B, Nogueira C, Ivanova E, Perna I, et al. (2007) A comparison of DNA copy number profiling platforms. Cancer research 67: 10173–10180. FIND THIS ARTICLE ONLINE
  23. Bengtsson H, Ray A, Spellman P, Speed TP (2009) A single-sample method for normalizing and combining full-resolution copy numbers from multiple platforms, labs and analysis methods. Bioinformatics (Oxford, England) 25: 861–867. FIND THIS ARTICLE ONLINE
  24. Heinrichs S, Look T (2007) Identification of structural aberrations in cancer by SNP array analysis. Genome biology. pp. 1–5.
  25. Carter NP (2007) Methods and strategies for analyzing copy number variation using DNA microarrays. Nature genetics 39: S16–S21. FIND THIS ARTICLE ONLINE
  26. Lubin M, Lubin A (2009) Selective killing of tumors deficient in methylthioadenosine phosphorylase: a novel strategy. PloS one 4: e5735. FIND THIS ARTICLE ONLINE
  27. Krasinskas AM, Bartlett DL, Cieply K, Dacic S (2010) CDKN2A and MTAP deletions in peritoneal mesotheliomas are correlated with loss of p16 protein expression and poor survival. Modern pathology : an official journal of the United States and Canadian Academy of Pathology, Inc 23: 531–538. FIND THIS ARTICLE ONLINE
  28. Smith JS, Tachibana I, Passe SM, Huntley BK, Borell TJ, et al. (2001) PTEN mutation, EGFR amplification, and outcome in patients with anaplastic astrocytoma and glioblastoma multiforme. Journal of the National Cancer Institute 93: 1246–1256. FIND THIS ARTICLE ONLINE
  29. Li J (1997) PTEN, a Putative Protein Tyrosine Phosphatase Gene Mutated in Human Brain, Breast, and Prostate Cancer. Science (New York, NY) 275: 1943–1947. FIND THIS ARTICLE ONLINE
  30. Horvath S, Zhang B, Carlson M, Lu KV, Zhu S, et al. (2006) Analysis of oncogenic signaling networks in glioblastoma identifies ASPM as a molecular target. Proceedings of the National Academy of Sciences of the United States of America 103: 17402–17407. FIND THIS ARTICLE ONLINE
  31. Zhang W, Zhu J, Bai J, Jiang H, Liu F, et al. (2010) Comparison of the inhibitory effects of three transcriptional variants of CDKN2A in human lung cancer cell line A549. Journal of experimental & clinical cancer research : CR 29: 74. FIND THIS ARTICLE ONLINE
  32. van der Rhee JI, Krijnen P, Gruis NA, de Snoo FA, Vasen HFA, et al. (2011) Clinical and histologic characteristics of malignant melanoma in families with a germline mutation in CDKN2A. Journal of the American Academy of Dermatology.
  33. Bourdeaut F, Isidor B, Ferrand S, Thomas C, Moreau A, et al. (2011) Homozygous PTEN deletion in neuroblastoma arising in a child with Cowden syndrome. American journal of medical genetics Part A 155: 1763–1766. FIND THIS ARTICLE ONLINE
  34. Jin K, Kong X, Shah T, Penet MF, Wildes F, et al. (2011) Breast Cancer Special Feature: The HOXB7 protein renders breast cancer cells resistant to tamoxifen through activation of the EGFR pathway. Proceedings of the National Academy of Sciences of the United States of America.
  35. Wiltshire RN, Rasheed BK, Friedman HS, Friedman AH, Bigner SH (2000) Comparative genetic patterns of glioblastoma multiforme: potential diagnostic tool for tumor classification. Neurooncology 2: 164–173. FIND THIS ARTICLE ONLINE
  36. Fujita PA, Rhead B, Zweig AS, Hinrichs AS, Karolchik D, et al. (2011) The UCSC Genome Browser database: update 2011. Nucleic acids research 39: D876–82. FIND THIS ARTICLE ONLINE
  37. Forbes SA, Bindal N, Bamford S, Cole C, Kok CY, et al. (2011) COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic acids research 39: D945–50.FIND THIS ARTICLE ONLINE
  38. Gearhart J, Pashos EE, Prasad MK (2007) Pluripotency redux–advances in stem-cell research. The New England journal of medicine 357: 1469–1472. FIND THIS ARTICLE ONLINE
  39. Dalla-Favera R, Bregni M, Erikson J, Patterson D, Gallo RC, et al. (1982) Human c-myc onc gene is located on the region of chromosome 8 that is translocated in Burkitt lymphoma cells. Proceedings of the National Academy of Sciences of the United States of America Vol. 79: 7824–7827. FIND THIS ARTICLE ONLINE
  40. Climent J, Dimitrow P, Fridlyand J, Palacios J, Siebert R, et al. (2007) Deletion of chromosome 11q predicts response to anthracycline-based chemotherapy in early breast cancer. Cancer research 67: 818–826. FIND THIS ARTICLE ONLINE
  41. Chin K, DeVries S, Fridlyand J, Spellman PT, Roydasgupta R, et al. (2006) Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. Cancer cell 10: 529–541. FIND THIS ARTICLE ONLINE
  42. Stevens KN, Fredericksen Z, Vachon CM, Wang X, Margolin S, et al. (2012) 19p13.1 is a triple negative-specific breast cancer susceptibility locus. Cancer research.
  43. Park NI, Rogan PK, Tarnowski HE, Knoll JHM (2012) Structural and genic characterization of stable genomic regions in breast cancer: Relevance to chemotherapy. Molecular oncology.
  44. Firth HV, Richards SM, Bevan AP, Clayton S, Corpas M, et al. (2009) DECIPHER: Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources. The American Journal of Human Genetics 84: 524–533. FIND THIS ARTICLE ONLINE
  45. Bengtsson H, Simpson K, Bullard J, Hansen K (2008) aroma.affymetrix: A genetic framework in R for analyzing small to very large Affymetrix data sets in bounded memory. Tech Report #745 Department of Statistics, University of California, Berkeley.
  46. Bengtsson H, Wirapati P, Speed TP (2009) A single-array preprocessing method for estimating fullresolution raw copy numbers from all Affymetrix genotyping arrays including GenomeWideSNP 5 & 6. Bioinformatics (Oxford, England) 25: 2149–2156. FIND THIS ARTICLE ONLINE
  47. Laurie CC, Doheny KF, Mirel DB, Pugh EW, Bierut LJ, et al. (2010) Quality control and quality assurance in genotypic data for genome-wide association studies. Genetic Epidemiology 34: 591–602.FIND THIS ARTICLE ONLINE
  48. F C, AL A, SA K, TP S, VL SM (2005) NUSE and RLE: Quality assessment of oligonucleotide microarray data to quantify systemic variation. 2005 Meeting of the Federation of Clinical Immunology Societies Boston, MA.



Read Full Post »

How Mobile Elements in “Junk” DNA Promote Cancer – Part 1: Transposon-mediated Tumorigenesis

Author, Writer and Curator: Stephen J. Williams, Ph.D.



Landscape of Somatic Retrotransposition in Human Cancers. Science (2012); Vol. 337:967-971. (1)

Sequencing of the human genome via massive programs such as the Cancer Genome Atlas Program (CGAP) and the Encyclopedia of DNA Elements (ENCODE) consortium in conjunction with considerable bioinformatics efforts led by the National Center for Biotechnology Information (NCBI) have unlocked a myriad of yet unclassified genes (for good review see (2).  The project encompasses 32 institutions worldwide which, so far, have generated 1640 data sets, initially depending on microarray platforms but now moving to the more cost effective new sequencing technology.  Initially the ENCODE project focused on three types of cells: an immature white blood cell line GM12878, leukemic line K562, and an approved human embryonic cell line H1-hESC.  The analysis was rapidly expanded to another 140 cell types.  DNA sequencing had revealed 20,687 known coding regions with hints of 50 more coding regions.  Another 11,224 DNA stretches were classified as pseudogenes.  The ENCODE project reveals that many genes encode for an RNA, not protein product, so called regulatory RNAs.

However some of the most recent and interesting results focus on the noncoding regions of the human genome, previously discarded as uninteresting or “junk” DNA .  Only 2% of the human genome contains coding regions while 98% of this noncoding part of the genome is actually found to be highly active “with about 4 million constantly communicating switches” (3).  Some of these “switches” in the noncoding portion contain small, repetitive elements which are mobile throughout the genome, and can control gene expression and/or predispose to disease such as cancer.  These mobile elements, found in almost all organisms, are classified as transposable elements (TE), inserting themselves into far-reaching regions of the genome.  Retro-transposons are capable of generating new insertions through RNA intermediates.  These transposable elements are normally kept immobile by epigenetic mechanisms(4-6) however some TEs can escape epigenetic repression and insert in areas of the genome, a process described as insertional mutagenesis as the process can lead to gene alterations seen in disease(7).  In addition, this insertional mutagenesis can lead to the transformation of cells and, as described in Post 2, act as a model system to determine drivers of oncogenesis. This insertional mutagenesis is a different mechanism of genetic alteration and rearrangement seen in cancer like recombination and fusion of gene fragments as seen with the Philadelphia chromosome and BCR/ABL fusion protein (8).  The mechanism of transposition and putative effects leading to mutagenesis are described in the following figure:


Figure.  Insertional mutagenesis based on transposon-mediated mechanism.  A) Basic structure of  transposon contains gene/sequence flanked by two inverted repeats (IR) and/or direct repeats (DR).  An enzyme, the transposase (red hexagon) binds and cuts at the IR/DR and transposon is pasted at another site in DNA, containing an insertion site.  B)   Multiple transpositions may results in oncogenic events by inserting in promoters leading to altered expression of genes driving oncogenesis or inserting within coding regions and inactivating tumor suppressors or activating oncogenes.  Deep sequencing of the resultant tumor genomes ( based on nested PCR from IR/DRs) may reveal common insertion sites (CIS) and oncogenic mutations could be identified.

In a bioinformatics study Eunjung Lee et al.(1), in collaboration with the Cancer Genome Atlas Research Network, the authors had analyzed 43 high-coverage whole-genome sequencing datasets from five cancer types to determine transposable element insertion sites.  Using a novel computational method, the authors had identified 194 high-confidence somatic TE insertion sites present in cancers of epithelial origin such as colorectal, prostate and ovarian, but not in brain or blood cancers.  Sixty four of the 194 detected somatic TE insertions were located within 62 annotated genes. Genes with TE insertion in colon cancers have commonly high mutation rates and enriched genes were associated with cell adhesion functions (CDH12, ROBO2,NRXN3, FPR2, COL1A1, NEGR1, NTM and CTNNA2) or tumor suppressor functions (NELL1m ROBO2, DBC1, and PARK2).  None of the somatic events were located within coding regions, with the TE sequences being detected in untranslated regions (UTR) or intronic regions.  Previous studies had shown insertion in these regions (UTR or intronic) can disrupts gene expression (9). Interestingly, most of the genes with insertion sites were down-regulated, suggested by a recent paper showing that local changes in methylation status of transposable elements can drive retro-transposition (10,11).  Indeed, the authors found that somatic insertions are biased toward the hypomethylated regions in cancer cell DNA.  The authors also confirmed that the insertion sites were unique to cancer and were somatic insertions, not germline (germline: arising during embryonic development) in origin by analyzing 44 normal genomes (41 normal blood samples from cancer patients and three healthy individuals).

The authors conclude:

“that some TE insertions provide a selective advantage during tumorigenesis,

rather than being merely passenger events that precede clonal expansion(1).”

The authors also suggest that more bioinformatics studies, which utilize the expansive genomic and epigenetic databases, could determine functional consequences of such transposable elements in cancerThe following Post will describe how use of transposon-mediated insertional mutagenesis is leading to discoveries of the drivers (main genetic events) leading to oncogenesis.

1.            Lee, E., Iskow, R., Yang, L., Gokcumen, O., Haseley, P., Luquette, L. J., 3rd, Lohr, J. G., Harris, C. C., Ding, L., Wilson, R. K., Wheeler, D. A., Gibbs, R. A., Kucherlapati, R., Lee, C., Kharchenko, P. V., and Park, P. J. (2012) Science 337, 967-971

2.            Pennisi, E. (2012) Science 337, 1159, 1161

3.            Park, A. (2012) Don’t Trash These Genes. “Junk DNA may lead to valuable cures. in Time, Time, Inc., New York, N.Y.

4.            Maksakova, I. A., Mager, D. L., and Reiss, D. (2008) Cellular and molecular life sciences : CMLS 65, 3329-3347

5.            Slotkin, R. K., and Martienssen, R. (2007) Nature reviews. Genetics 8, 272-285

6.            Yang, N., and Kazazian, H. H., Jr. (2006) Nature structural & molecular biology 13, 763-771

7.            Hancks, D. C., and Kazazian, H. H., Jr. (2012) Current opinion in genetics & development 22, 191-203

8.            Sattler, M., and Griffin, J. D. (2001) International journal of hematology 73, 278-291

9.            Han, J. S., Szak, S. T., and Boeke, J. D. (2004) Nature 429, 268-274

10.          Reichmann, J., Crichton, J. H., Madej, M. J., Taggart, M., Gautier, P., Garcia-Perez, J. L., Meehan, R. R., and Adams, I. R. (2012) PLoS computational biology 8, e1002486

11.          Byun, H. M., Heo, K., Mitchell, K. J., and Yang, A. S. (2012) Journal of biomedical science 19, 13

Other research paper on ENCODE and Cancer were published on this Scientific Web site as follows:

Expanding the Genetic Alphabet and linking the genome to the metabolome

Junk DNA codes for valuable miRNAs: non-coding DNA controls Diabetes

ENCODE Findings as Consortium

Reveals from ENCODE project will invite high synergistic collaborations to discover specific targets

ENCODE: the key to unlocking the secrets of complex genetic diseases

Impact of evolutionary selection on functional regions: The imprint of evolutionary selection on ENCODE regulatory elements is manifested between species and within human populations

Metabolite Identification Combining Genetic and Metabolic Information: Genetic association links unknown metabolites to functionally related genes

Advances in Separations Technology for the “OMICs” and Clarification of Therapeutic Targets

Commentary on Dr. Baker’s post “Junk DNA codes for valuable miRNAs: non-coding DNA controls Diabetes”

Cancer Genomics – Leading the Way by Cancer Genomics Program at UC Santa Cruz

Read Full Post »

Older Posts »