Funding, Deals & Partnerships: BIOLOGICS & MEDICAL DEVICES; BioMed e-Series; Medicine and Life Sciences Scientific Journal – http://PharmaceuticalIntelligence.com
Mozilla Science Lab Promotes Data Reproduction Through Open Access: Report from 9/10/2015 Online Meeting
Reporter: Stephen J. Williams, Ph.D.
Mozilla Inc. is developing a platform for scientists to discuss the issues related to developing a framework to share scientific data as well as tackle the problems of scientific reproducibility in an Open Access manner. According to their blog
We’re excited to announce the launch of the Mozilla Science Lab, a new initiative that will help researchers around the world use the open web to shape science’s future.
Scientists created the web — but the open web still hasn’t transformed scientific practice to the same extent we’ve seen in other areas like media, education and business. For all of the incredible discoveries of the last century, science is still largely rooted in the “analog” age. Credit systems in science are still largely based around “papers,” for example, and as a result researchers are often discouraged from sharing, learning, reusing, and adopting the type of open and collaborative learning that the web makes possible.
The Science Lab will foster dialog between the open web community and researchers to tackle this challenge. Together they’ll share ideas, tools, and best practices for using next-generation web solutions to solve real problems in science, and explore ways to make research more agile and collaborative.
On their blog they highlight various projects related to promoting Open Access for scientific data
On September 10, 2015 Mozilla Science Lab had their scheduled meeting on scientific data reproduce ability. The meeting was free and covered by ethernet and on social media. The Twitter hashtag for updates and meeting discussion is #mozscience (https://twitter.com/search?q=%23mozscience )
Questions regarding coding projects – Abby will coordinate efforts on coding into their codebase
The journal will publish and authors and reviewers get a badge and their efforts and comments will appear on GigaScience: Giga Science will give credit for your reviews – supports an Open Science Discussion
Miss the submission deadline? You can still apply to join our Open Research Accelerator and join us for the event (PLUS get a DOI for your submission and 1:1 help)
ReScience is dedicated to publishing replications of previously published computational studies, along with all the code required to replicate the results.
ReScience lives entirely on GitHub. Submissions take the form of a Git repository, and review takes place in the open through GitHub issues. This also means that ReScience is free for everyone (authors, readers, reviewers, editors… well, I said everyone, right?), as long as GitHub is willing to host it.
ReScience was launched just a few days ago and is evolving quickly. To stay up to date, follow @ReScienceEds on Twitter. If you want to volunteer as a reviewer, please contact the editorial board.
The ReScience Journal Reproducible Science is Good. Replicated Science is better.
ReScience is a peer-reviewed journal that targets computational research and encourages the explicit reproduction of already published research promoting new and open-source implementations in order to ensure the original research is reproducible. To achieve such a goal, the whole editing chain is radically different from any other traditional scientific journal. ReScience lives on github where each new implementation is made available together with the comments, explanations and tests. Each submission takes the form of a pull request that is publicly reviewed and tested in order to guarantee any researcher can re-use it. If you ever reproduced computational result from the literature, ReScience is the perfect place to publish this new implementation. The Editorial Board
Notes from his talk:
– must be able to replicate paper’s results as written according to experimental methods
– All authors on ReScience need to be on GitHub
– not accepting MatLab replication; replication can involve computational replication;
Research Ideas and Outcomes Journal – Daniel Mietchen @EvoMRI
Postdoc at Natural Museum of London doing data mining; huge waste that 90% research proposals don’t get used so this journal allows for publishing proposals
Learned how to write proposals by finding a proposal online open access
Reviewing system based on online reviews like GoogleDocs where people view, comment
Growing editorial and advisory board; venturing into new subject areas like humanities, economics, biological research so they are trying to link diverse areas under SOCIAL IMPACT labeling
BIG question how to get scientists to publish their proposals especially to improve efficiency of collaboration and reduce too many duplicated efforts as well as reagent sharing
Crowdfunding platform used as post publication funding mechanism; still in works
They need a lot of help on the editorial board so if have a PhD PLEASE JOIN
This portion of the transcription series deals with transcription factors and the effects of their binding on metabolism. This also has implications for pharmaceutical target identification.
The Functional Consequences of Variation in Transcription Factor Binding
DA. Cusanovich, B Pavlovic, JK. Pritchard*, Y Gilad*
1 Department of Human Genetics, 2 Howard Hughes Medical Institute, University of Chicago, Chicago, IL 3 Departments of Genetics and Biology and Howard Hughes Medical Institute, Stanford University, Stanford, CA.
PLoS Genet 2014;10(3):e1004226. http://dx.doi.org:/10.1371/journal.pgen.1004226
One goal of human genetics is to understand how the information for precise and dynamic gene expression programs is encoded in the genome. The interactions of transcription factors (TFs) with DNA regulatory elements clearly
play an important role in determining gene expression outputs, yet
the regulatory logic underlying functional transcription factor binding is poorly understood.
An important question in genomics is to understand how a class of proteins called ‘‘transcription factors’’ controls the expression level of other genes in the genome in a cell type-specific manner – a process that is essential to human development. One major approach to this problem is to study where these transcription factors bind in the genome, but this does not tell us about the effect of that binding on gene expression levels and
it is generally accepted that much of the binding does not strongly influence gene expression.
To address this issue, we artificially reduced the concentration of 59 different transcription factors in the cell and then
examined which genes were impacted by the reduced transcription factor level.
Our results implicate some attributes
that might influence what binding is functional, but they also suggest that
a simple model of functional vs. non-functional binding may not suffice.
Many studies have focused on characterizing the genomic locations of TF binding, but
it is unclear whether TF binding at any specific locus has
functional consequences with respect to gene expression output.
We knocked down 59 TFs and chromatin modifiers in one HapMap lymphoblastoid cell line
to evaluate the context of functional TF binding.
We then identified genes whose expression was affected by the knockdowns
by intersecting the gene expression data with transcription factor binding data
(based on ChIP-seq and DNase-seq)
within 10 kb of the transcription start sites of expressed genes.
This combination of data allowed us to infer functional TF binding.
Only a small subset of genes bound by a factor were
differentially expressed following the knockdown of that factor,
suggesting that most interactions between TF and chromatin
do not result in measurable changes in gene expression levels
of putative target genes.
We found that functional TF binding is enriched
in regulatory elements that harbor a large number of TF binding sites,
at sites with predicted higher binding affinity, and
at sites that are enriched in genomic regions annotated as ‘‘active enhancers.’’
We aim to be able to predict the expression pattern of a gene based on its regulatory
sequence alone. However, the regulatory code of the human genome is much more complicated than
the triplet code of protein coding sequences, and is highly context-specific,
depending on cell-type and other factors.
Moreover, regulatory regions are not necessarily organized into
discrete, easily identifiable regions of the genome and
may exert their influence on genes over large genomic distances
Genomic studies addressing questions of the regulatory logic of the human genome have largely taken one of two approaches.
collecting transcription factor binding maps using techniques such as ChIPseq
and DNase-seq
mapping various quantitative trait loci (QTL), such as gene expression levels
(eQTLs) [7], DNA methylation (meQTLs) [8] and chromatin accessibility (dsQTLs)
Cumulatively, binding map studies and QTL map studies have
led to many insights into the principles and mechanisms of gene regulation.
However, there are questions that neither mapping approach on its own is well equipped to address. One outstanding issue is
the fraction of factor binding in the genome that is ‘‘functional’’,
which we define here to mean that
disturbing the protein-DNA interaction leads to a measurable
downstream effect on gene regulation.
Transcription factor knockdown could be used to address this problem, whereby
the RNA interference pathway is employed to greatly reduce
the expression level of a specific target gene by using small interfering RNAs (siRNAs).
The response to the knockdown can then be measured by collecting RNA after the knockdown and
measuring global changes in gene expression patterns
after specifically attenuating the expression level of a given factor.
Combining a TF knockdown approach with TF binding data can help us to
distinguish functional binding from non-functional binding
This approach has previously been applied to the study of human TFs, although for the most part studies have only focused on
the regulatory relationship of a single factor with its downstream targets.
The FANTOM consortium knocked down 52 different transcription factors in
the THP-1 cell line, an acute monocytic leukemia-derived cell line, and
used a subset of these to validate certain regulatory predictions based on binding motif enrichments.
We and others previously studied the regulatory architecture of gene expression in
the model system of HapMap lymphoblastoid cell lines (LCLs) using both
binding map strategies and QTL mapping strategies.
We now sought to use knockdown experiments targeting transcription factors in a HapMap LCL
to refine our understanding of the gene regulatory circuitry of the human genome.
Therefore, We integrated the results of the knockdown experiments with previous data on TF binding to
better characterize the regulatory targets of 59 different factors and
to learn when a disruption in transcription factor binding
is most likely to be associated with variation in the expression level of a nearby gene.
Gene expression levels following the knockdown were compared to
expression data collected from six samples that were transfected with negative control siRNA.
The expression data from all samples were normalized together using
quantile normalization followed by batch correction using the RUV-2 method.
We then performed several quality control analyses to confirm
that the quality of the data was high,
that there were no outlier samples, and
that the normalization methods reduced the influence of confounders
In order to identify genes that were expressed at a significantly different level
in the knockdown samples compared to the negative controls,
we used likelihood-ratio tests within the framework of a fixed effect linear model.
Following normalization and quality control of the arrays,
we identified genes that were differentially expressed between
the three knockdown replicates of each factor and the six controls.
Depending on the factor targeted, the knockdowns resulted in
between 39 and 3,892 differentially expressed genes at an FDR of 5%
(Figure 1B; see Table S3 for a summary of the results).
The knockdown efficiency for the 59 factors ranged
from 50% to 90% (based on qPCR; Table S1).
The qPCR measurements of the knockdown level were significantly
correlated with estimates of the TF expression levels
based on the microarray data (P =0.001; Figure 1C).
Reassuringly, we did not observe a significant correlation between
the knockdown efficiency of a given factor and
the number of genes classified as differentially expressed foci.
Because we knocked down 59 different factors in this experiment
we were able to assess general patterns associated with the perturbation of transcription factors
beyond merely the number of affected target genes.
Globally, despite the range in the number of genes we identified as
differentially expressed in each knockdown,
the effect sizes of the differences in expression were relatively modest and
consistent in magnitude across all knockdowns.
The median effect size following the knockdown experiment for genes classified as
differentially expressed at an FDR of 5% in any knockdown was
a 9.2% difference in expression level between the controls and the knockdown (Figure 2),
while the median effect size for any individual knockdown experiment ranged between 8.1% and 11.0%.
(this was true whether we estimated the knockdown effect based on qPCR (P = 0.10; Figure 1D) or microarray (P = 0.99; not shown) data.
Nor did we observe a correlation between
variance in qPCR-estimated knockdown efficiency (between replicates) and
the number of genes differentially expressed (P = 0.94; Figure 1E).
We noticed that the large variation in the number of differentially expressed genes
extended even to knockdowns of factors from the same gene family.
Figure 1. Differential expression analysis.
(a) Examples of differential expression analysis results for the genes HCST and IRF4. The top two panels are ‘MA plots’ of the mean Log2(expression level) between the knockdown arrays and the controls for each gene (x-axis) to the Log2(Fold-Change) between the knockdowns and controls (y-axis). Differentially expressed genes at an FDR of 5% are plotted in yellow (points 50% larger). The gene targeted by the siRNA is highlighted in red. The bottom two panels are ‘volcano plots’ of the Log2(Fold-Change) between the knockdowns and controls (x-axis) to the P-value for differential expression (y-axis). The dashed line marks the 5% FDR threshold. Differentially expressed genes at an FDR of 5% are plotted in yellow (points 50% larger). The red dot marks the gene targeted by the siRNA.
(b) Barplot of number of differentially expressed genes in each knockdown experiment.
(c) Comparison of the knockdown level measured by qPCR (RNA sample collected 48 hours posttransfection) and the knockdown level measured by microarray.
(d) Comparison of the level of knockdown of the transcription factor at 48 hrs (evaluated by qPCR; x-axis) and the number of genes differentially expressed in the knockdown experiment (y-axis). (e) Comparison of the variance in knockdown efficiency between replicates for each transcription factor (evaluated by qPCR; x-axis) and the number of differentially expressed genes in the knockdown experiment (y-axis).
Figure 2. Effect sizes for differentially expressed genes.
Boxplots of absolute Log2(fold-change) between knockdown arrays and control arrays for all genes identified as differentially expressed in each experiment. Outliers are not plotted. The gray bar indicates the interquartile range across all genes differentially expressed in all knockdowns. Boxplots are ordered by the number of genes differentially expressed in each experiment. Outliers were not plotted.
Knocking down SREBF2 (1,286 genes differentially expressed), a key regulator of cholesterol homeostasis,
results in changes in the expression of genes that are
significantly enriched for cholesterol and sterol biosynthesis annotations.
While not all factors exhibited striking enrichments for relevant functional categories and pathways,
the overall picture is that perturbations of many of the factors
primarily affected pathways consistent with their known biology.
In order to assess functional TF binding, we next incorporated
binding maps together with the knockdown expression data.
We combined binding data based on DNase-seq footprints in 70 HapMap LCLs, reported by Degner et al. (Table S5)
and from ChIP-seq experiments in LCL GM12878, published by ENCODE.
We were thus able to obtain genome wide binding maps for a total of 131 factors that were either
directly targeted by an siRNA in our experiment (29 factors) or were
differentially expressed in one of the knockdown experiments.
We classified a gene as a bound target of a particular factor when
binding of that factor was inferred within 10kb of the transcription start site (TSS) of the target gene.
Using this approach, we found that the 131 TFs were bound
in proximity to a median of 1,922 genes per factor (range 11 to 7,053 target genes).
We considered binding of a factor to be functional if the target gene
was differentially expressed after perturbing the expression level the bound transcription factor.
We then asked about the concordance between
the transcription factor binding data and the knockdown expression data.
the extent to which differences in gene expression levels following the knockdowns
might be predicted by binding of the transcription factors
within the putative regulatory regions of the responsive genes. and also
what proportion of putative target (bound) genes of a given TF were
differentially expressed following the knockdown of the factor.
Focusing only on the binding sites classified using the DNase-seq data
(which were assigned to a specific instance of the binding motif, unlike the ChIP data),
we examined sequence features that might distinguish functional binding.
In particular, whether binding at conserved sites was more likely to be functional and
whether binding sites that better matched the known PWM for the factor were more likely to be functional.
We did not observe a significant shift in the conservation of functional binding sites (Wilcoxon rank sum P = 0.34),
but we did observe that binding around differentially expressed genes occurred at sites
that were significantly better matches to the canonical binding motif.
Figure 3. Intersecting binding data and expression data for each knockdown.
(a) Example Venn diagrams showing the overlap of binding and differential expression for the knockdowns of HCST and IRF4 (the same genes as in Figure 1).
(b) Boxplot summarizing the distribution of the fraction of all expressed genes that are bound by the targeted gene or downstream factors.
(c) Boxplot summarizing the distribution of the fraction of bound genes that are classified as differentially expressed, using an FDR of either 5% or 20%.
Intersecting binding data and expression data for each knockdown
Considering bound targets determined from either the ChIP-seq or DNase-seq data, we observed that
differentially expressed genes were associated with both
a higher number of binding events for the relevant factors within 10 kb of the TSS (P,10216; Figure 4A)
as well as with a larger number of different binding factors
(considering the siRNA-targeted factor and any TFs that were DE in the knockdown; P,10216; Figure 4B).
Figure 4. Degree of binding correlated with function. Boxplots comparing
(a) the number of sites bound, and
(b) the number of differentially expressed transcription factors binding events near functionally or non-functionally bound genes. We considered binding for siRNA-targeted factor and any factor differentially expressed in the knockdown.
(c) Focusing only on genes differentially expressed in common between each pairwise set of knockdowns we tested for enrichments of functional binding (y-axis). Pairwise comparisons between knockdown experiments were binned by the fraction of differentially expressed transcription factors in common between the two experiments. For these boxplots, outliers were not plotted.
We examined the distribution of binding about the TSS. Most factor binding was concentrated
near the TSS whether or not the genes were classified as differentially expressed (Figure 5A).
the distance from the TSS to the binding sites was significantly longer for differentially expressed genes (P,10216; Fig. 5B).
Figure 5. Distribution of functional binding about the TSS.
(a) A density plot of the distribution of bound sites within 10 kb of the TSS for both functional and non-functional genes. Inset is a zoom-in of the region +/21 kb from the TSS (b) Boxplots comparing the distances from the TSS to the binding sites for functionally bound genes and non-functionally bound genes. For the boxplots, 0.001 was added before log10 transforming the distances and outliers were not plotted.
We investigated the distribution of factor binding across various chromatin states, as defined by Ernst et al. This dataset lists
regions of the genome that have been assigned to different activity states
based on ChIP-seq data for various histone modifications and CTCF binding.
For each knockdown, we separated binding events
by the genomic state in which they occurred and then
tested whether binding in that state was enriched around differentially expressed genes.
After correcting for multiple testing of genes that were differentially expressed.
19 knockdowns showed significant enrichment for binding in ‘‘strong enhancers’’
four knockdowns had significant enrichments for ‘‘weak enhancers’’,
eight knockdowns showed significant depletion of binding in ‘‘active promoters’’ ,
six knockdowns had significant depletions for ‘‘transcription elongation’’,
Did the factors tended to have a consistent effect (either up- or down-regulation)
on the expression levels of genes they purportedly regulated?
All factors we tested are associated with both up- and down-regulation of downstream targets (Figure 6).
A slight majority of downstream target genes were expressed at higher levels
following the knockdown for 15 of the 29 factors for which we had binding information (Figure 6B).
The factor that is associated with the largest fraction (68.8%) of up-regulated target genes following the knockdown is EZH2,
the enzymatic component of the Polycomb group complex.
On the other end of the spectrum was JUND, a member of the AP-1 complex, for which
66.7% of differentially expressed targets were down-regulated following the knockdown.
Figure 6. Magnitude and direction of differential expression after knockdown.
(a) Density plot of all Log2(fold-changes) between the knockdown arrays and controls for genes that are differentially expressed at 5% FDR in one of the knockdown experiments as well as bound by the targeted transcription factor.
(b) Plot of the fraction of differentially expressed putative direct targets that were up-regulated in each of the knockdown experiments.
Magnitude and direction of differential expression after knockdown
We found no correlation between the number of paralogs and the fraction of bound targets that were differentially expressed. We also did not observe a significant correlation when we considered whether
the percent identity of the closest paralog might be predicative of
the fraction of bound genes that were differentially expressed following the knockdown (Figure S8).
While there is compelling evidence for our inferences, the current chromatin functional annotations
do not fully explain the regulatory effects of the knockdown experiments.
For example, the enrichments for binding in ‘‘strong enhancer’’ regions of the genome range from 7.2% to 50.1% (median = 19.2%),
much beyond what is expected by chance alone, but far from accounting for all functional binding.
In addition to considering
the distinguishing characteristics of functional binding, we also examined
the direction of effect that perturbing a transcription factor had on the expression level of its direct targets.
We specifically addressed whether
knocking down a particular factor tended to drive expression of its putatively direct (namely, bound) targets up or down,
which can be used to infer that the factor represses or activates the target, respectively.
Transcription factors have traditionally been thought of primarily as activators, and previous work from our group is consistent with that notion. Surprisingly, the most straightforward inference from the present study is that
many of the factors function as repressors at least as often as they function as activators.
EZH2 had a negative regulatory relationship with the largest fraction of direct targets (68.8%),
consistent with – the known role of EZH2 as the active member of the Polycomb group complex PC2
while JUND seemed to have a positive regulatory relationship with the largest fraction of direct targets (66.7%),
and with – the biochemical characterization of the AP-1 complex (of which JUND is a component) as a transactivator.
More generally, however, our results, combined with the previous work from our group and others make for a complicated view
of the role of transcription factors in gene regulation as
it seems difficult to reconcile the inference from previous work that
many transcription factors should primarily act as activators with the results presented here.
One somewhat complicated hypothesis, which nevertheless can resolve the apparent discrepancy, is that
the ‘‘repressive’’ effects we observe for known activators may be
at sites in which the activator is acting as a weak enhancer of transcription and
that reducing the cellular concentration of the factor
releases the regulatory region to binding by an alternative, stronger activator.
To more explicitly address the effect that our proximity-based definition of target genes might have on our analyses, we reanalyzed
the overlap between factor binding and differential expression following the knockdowns
using an independent, empirically determined set of target genes.
Thurman et al. used correlations in DNase hypersensitivity between
intergenic hypersensitive sites and promoter hypersensitive sites across diverse tissues
to assign intergenic regulatory regions to specific genes,
independently of proximity to a particular promoter.
We performed this alternative analysis in which we
assigned binding events to genes based on the classification of Thurman et al.
We then considered the overlap between binding and differential expression in this new data set. The results were largely
consistent with our proximity-based observations.
A median of 9.5% of genes that were bound by a factor were
also differentially expressed following the knockdown of that factor
(compared to 11.1% when the assignment of binding sites to genes is based on proximity).
From the opposite perspective, a median of 28.0% of differentially expressed genes were bound by that factor
(compared to 32.3% for the proximity based definition). The results of this analysis are summarized in Table S7.
Our results should not be considered a comprehensive census of regulatory events in the human genome. Instead, we adopted a gene-centric approach,
focusing only on binding events near the genes for which we could measure expression
to learn some of the principles of functional transcription factor binding.
In light of our observations a reassessment of our estimates of binding may be warranted. In particular, because functional binding is skewed away from promoters (our system is apparently not well-suited to observe functional promoter binding, perhaps because of protection by large protein complexes),
a more conservative estimate of the fraction of binding that is indeed functional would not consider data within the promoter.
Importantly, excluding the putative promoter region from our analysis (i.e. only considering a window .1 kb from the TSS and ,10 kb from the TSS)
does not change our conclusions.
Considering this smaller window,
a median of 67.0% of expressed genes are still classified as bound by
either the knocked down transcription factor or
a downstream factors that is differentially expressed in each experiment,
yet a median of only 8.1% of the bound genes are
also differentially expressed after the knockdowns.
Much of what distinguishes functional binding (as we define it) has yet to be explained. We are unable to explain much of the differential expression observed in our experiments by the presence of least one relevant binding event. This may not be altogether surprising, as
we are only considering binding in a limited window around the transcription start site.
To address these issues, more factors should be perturbed to further evaluate the robustness of our results and to add insight. Together, such studies will help us develop a more sophisticated understanding of functional transcription factor binding in particular, the gene regulatory logic more generally.
Assessing quality and completeness of human transcriptional regulatory pathways on a genome-wide scale
Recently the biological pathways have become a common and probably the most popular form of representing biochemical information for hypothesis generation and validation. These maps store wide knowledge of complex molecular interactions and regulations occurring in the living organism in a simple and obvious way, often using intuitive graphical notation. Two major types of biological pathways could be distinguished.
Metabolic pathways incorporate complex networks of protein-based interactions and modifications, while
signal transduction and transcriptional regulatory pathways are usually considered to provide information on mechanisms of transcription
While there are a lot of data collected on human metabolic processes,
the content of signal transduction and transcriptional regulatory pathways varies greatly in quality and completeness.
An indicative comparison of MYC transcriptional targets reported in ten different pathway databases reveals that these databases differ greatly from each other (Figure 1). Given that MYC is involved
in the transcriptional regulation of approximately 15% of all genes,
one cannot argue that the majority of pathway databases that contain
less than thirty putative transcriptional targets of MYC are even close to complete.
More importantly, to date there have been no prior genome-wide evaluation studies (that are based on genome-wide binding and gene expression assays) assessing pathway databases
Background: While pathway databases are becoming increasingly important in most types of biological and translational research, little is known about the quality and completeness of pathways stored in these databases. The present study conducts a comprehensive assessment of transcriptional regulatory pathways in humans for seven well-studied transcription factors:
MYC,
NOTCH1,
BCL6,
TP53,
AR,
STAT1,
RELA.
The employed benchmarking methodology first involves integrating
genome-wide binding with functional gene expression data
to derive direct targets of transcription factors.
Then the lists of experimentally obtained direct targets
are compared with relevant lists of transcriptional targets from 10 commonly used pathway databases.
Results: The results of this study show that for the majority of pathway databases,
the overlap between experimentally obtained target genes and
targets reported in transcriptional regulatory pathway databases is
surprisingly small and often is not statistically significant.
The only exception is MetaCore pathway database which
yields statistically significant intersection with experimental results in 84% cases.
The lists of experimentally derived direct targets obtained in this study can be used
to reveal new biological insight in transcriptional regulation, and we
suggest novel putative therapeutic targets in cancer.
Conclusions: Our study opens a debate on validity of using many popular pathway databases to obtain transcriptional regulatory targets. We conclude that the choice of pathway databases should be informed by
solid scientific evidence and rigorous empirical evaluation.
In the current study we perform
(1) an evaluation of ten commonly used pathway databases,
assessing the transcriptional regulatory pathways, considered in the current study as
the interactions of the type ‘transcription factor-transcriptional targets’.
This involves integration of human genome wide functional microarray or RNA-seq gene expression data with
protein-DNA binding data from ChIP-chip, ChIP-seq, or ChIP-PET platforms
to find direct transcriptional targets of the seven well known transcription factors:
MYC, NOTCH1, BCL6, TP53, AR, STAT1, and RELA.
The choice of transcription factors is based on their important role in oncogenesis and availability of binding and expression data in the public domain.
(2) the lists of experimentally derived direct targets are used to assess the quality and completeness of 84 transcriptional regulatory pathways from four publicly available (BioCarta, KEGG, WikiPathways and Cell Signaling Technology) and six commercial (MetaCore, Ingenuity Pathway Analysis, BKL TRANSPATH, BKL TRANSFAC, Pathway Studio and GeneSpring Pathways) pathway databases.
(3) We measure the overlap between pathways and experimentally obtained target genes and assess statistical significance of this overlap, and we demonstrate that experimentally derived lists of direct transcriptional targets
can be used to reveal new biological insight on transcriptional regulation.
We show this by analyzing common direct transcriptional targets of
MYC, NOTCH1 and RELA
that act in interconnected molecular pathways.
Detection of such genes is important as it could reveal novel targets of cancer therapy.
Figure 1 Number of genes in common between MYC transcriptional targets derived from ten different pathway databases. Cells are colored according to their values from white (low values) to red (high values). (not shown)
statistical methodology for comparison
Figure 2 Illustration of statistical methodology for comparison between a gold-standard and a pathway database
Since we are seeking to compare gene sets from different studies/databases, it is essential to transform genes to standard identifiers. That is why we transformed all
gene sets to the HUGO Gene Nomenclature Committee approved gene symbols and names. In order to assess statistical significance of the overlap between the resulting gene sets, we used the hypergeometric test at 5% a-level with false discovery rate correction for multiple comparisons by the method of Benjamini and Yekutieli. The alternative hypothesis of this test is that two sets of genes (set A from pathway
database and set B from experiments) have greater number of genes in common than two randomly selected gene sets with the same number of genes as in sets A and B. For example, consider that for some transcription factor there are 300 direct targets in the pathway database #1 and 700 in the experimentally derived list (gold-standard), and their intersection is 16 genes (Figure 2a). If we select on random from a total of
20,000 genes two sets with 300 and 700 genes each, their overlap would be greater or equal to 16 genes in 6.34% times. Thus, this overlap will not be statistically significant at 5% a-level (p = 0.0634). On the other hand, consider that for the pathway database #2, there are 30 direct targets of that transcription factor, and their intersection with the 700-gene gold-standard is only 6 genes. Even though the size of this intersection is rather small, it is unlikely to randomly select 30 genes (out of 20,000) with an overlap greater or equal to 6 genes with a 700-gene gold-standard (p = 0.0005, see Figure 2a). This overlap is statistically significant at 5% a-level.
We also calculate an enrichment fold change ratio (EFC) for every intersection between a gold-standard and a pathway database. For a given pair of a gold-standard and a pathway database, EFC is equal to the observed number of genes in their intersection, divided by the expected size of intersection under the null hypothesis (plus machine epsilon, to avoid division by zero). Notice however that larger values of EFC may correspond to databases that are highly incomplete and contain only a few relations. For example, consider that for some transcription factor there are 300 direct targets in the pathway database #1 and 50 in the experimentally derived list (gold-standard), and their intersection is 30 genes (Figure 2b). If we select on random from a total of 20,000 genes two sets with 300 and 50 genes each, their expected overlap under the null hypothesis will be equal to 0.75. Thus, the EFC ratio will be equal to 40 (= 30/0.75). On the other hand, consider that for the pathway database #2, there are 2 direct
targets of that transcription factor, and their intersection with the 50-gene gold-standard is only 1 gene. Even though the expected overlap under the null hypothesis will be equal to 0.005 and EFC equal to 200 (5 times bigger than for the database #1), the size of this intersection with the gold-standard is 30 times less than for database #1 (Figure 2b).
Figure 3 Comparison between different pathway databases and experimentally derived gold-standards for all considered transcription factors. Value in a given cell is a number of overlapping genes between a gold-standard and a pathway-derived gene set. Cells
are colored according to their values from white (low values) to red (high values). Underlined values in red represent statistically significant intersections. (not shown)
Figure 4 Summary of the pathway databases assessment. Green cells represent statistically significant intersections between experimentally derived gold-standards and transcriptional regulatory pathways. White cells denote results that are not statistically significant. Numbers are the enrichment fold change ratios (EFC) calculated for each intersection. (not shown)
At the core of this study was creation of gold-standards of transcriptional regulation in humans that can be compared with target genes reported in transcriptional regulatory pathways. We focused on seven well known transcription factors and obtained gold-standards
by integrating genome-wide transcription factor-DNA binding data (from ChIP-chip, ChIP-seq, or ChIP-PET platforms)
with functional gene expression microarray and RNA-seq data.
The latter data allows to survey changes in the transcriptomes on a genome-wide scale
after the inhibition or over-expression of the transcription factor in question.
However, change in the expression of a particular gene could be caused either by the direct effect of the removal or introduction of a given transcription factor, as well as by an indirect effect, through the change in expression level of some other gene(s). It is essential
to integrate data from these two sources to
obtain an accurate list of gene targets that are directly regulated by a transcription factor.
It is worth noting that tested pathway databases typically do not give distinction between cell-lines, experimental conditions, and other details relevant to experimental systems in which data were obtained. These databases in a sense propose a ‘universal’ list of transcriptional targets. However, it is known that
transcriptional regulation in a cell is dynamic and works differently for different systems and stimuli.
This accentuates the major limitation of pathway databases and emphasizes
importance of deriving a specific list of transcriptional targets for the current experimental system.
In this study we followed the latter approach by developing gold-standards for specific cell characterized biological systems and experimental conditions.
The approach used here for building gold-standards of direct mechanistic knowledge has several limitations. (see article). Nevertheless, our results suggest that multiple transcription factors can co-operate and control both physiological differentiation and malignant transformation, as demonstrated utilizing combinatorial gene-profiling for
NOTCH1, MYC and RELA targets.
These studies might lead us to multi-pathway gene expression “signatures”
essential for the prediction of genes that could be targeted in cancer treatments.
In agreement with this hypothesis, several of the genes identified in our analysis have been suggested to be putative therapeutic targets in leukemia, with either preclinical or clinical trials underway (CDK4, CDK6, GSK3b, MYC, LCK, NFkB2, BCL2L1, NOTCH1).
Single-molecule tracking in live cells reveals distinct target-search strategies of transcription factors in the nucleus
I Izeddin†, V Récamier†‡, L Bosanac, II Cissé, L Boudarene, et al.
1Functional Imaging of Transcription, Institut de Biologie de l’Ecole Normale Supérieure (IBENS), Inserm, and CNRS UMR; 2Laboratoire Kastler Brossel, CNRS UMR, Departement de Physique et Institut de Biologie
de l’Ecole Normale Supérieure (IBENS), Paris, Fr; 3Transcription Imaging Consortium, Janelia Farm Research Campus, Howard Hughes Medical Institute, Ashburn, US; + more.
Biophysics and structural biology | Cell biology eLife 2014;3:e02230. http://dx.doi.org:/10.7554/eLife.02230
Transcription factors are
proteins that control the expression of genes in the nucleus, and
they do this by binding to other proteins or DNA.
First, however, these regulatory proteins need to overcome the challenge of
finding their targets in the nucleus, which is crowded with other proteins and DNA.
Much research to date has focused on measuring how fast proteins can diffuse and spread out throughout the nucleus. However these measurements only make sense if these proteins have access to the same space within the nucleus.
Now, Izeddin, Récamier et al. have developed a new technique to track
single protein molecules in the nucleus of mammalian cells.
A transcription factor called c-Myc and another protein called P-TEFb
were tracked and while they diffused at similar rates,
they ‘explored’ the space inside the nucleus in very different ways.
Izeddin, Récamier et al. found that c-Myc explores the nucleus in a so-called ‘non-compact’ manner: this means that it
can move almost everywhere inside the nucleus, and has an equal chance
of reaching any target regardless of its position in this space.
P-TEFb, on the other hand, searches
the nucleus in a ‘compact’ way.
This means that it is constrained to follow a specific path
through the nucleus and is therefore guided to its potential targets.
Izeddin, Récamier et al. explain that
the different ‘search strategies’ used by these two proteins
influence how long it takes them to find their targets and
how far they can travel in a given time.
These findings, together with information about
where and when different proteins interact in the nucleus,
will be essential to understand how the organization of the genome within the nucleus
can control the expression of genes.
The next challenge will now be to
uncover what determines a
protein’s search strategy in the nucleus, as well as
the potential ways that this strategy might be regulated.
Mueller et al., 2010; Normanno et al., 2012). These transient interactions are essential to ensure a fine regulation of binding site occupancy—by competition or by altering the TF concentration—but must also be persistent enough to enable the assembly of multicomponent complexes (Dundr, 2002; Darzacq and Singer, 2008; Gorski et al., 2008; Cisse et al., 2013).
In parallel to the experimental evidence of the fast diffusive motion of nuclear factors, our understanding of the intranuclear space has evolved from a homogeneous environment to an organelle where spatial arrangement among genes and regulatory sequences play an important role in transcriptional control (Heard and Bickmore, 2007). The nucleus of eukaryotes displays a hierarchy of organized structures (Gibcus and Dekker, 2013) and is often referred to as a
crowded environment.
How crowding influences transport properties of macromolecules and organelles in the cell is a fundamental question in quantitative molecular biology. While a restriction of the available space for diffusion can slow down transport processes, it can also channel molecules towards their targets increasing their chance to meet interacting partners. A widespread observation in quantitative cell biology is that the diffusion of molecules is anomalous, often attributed to crowding in the nucleoplasm, cytoplasm, or in the membranes of the cell (Höfling and Franosch, 2013). An open debate remains on how to determine whether diffusion is anomalous or normal (Malchus and Weiss, 2009; Saxton, 2012), and the mechanisms behind anomalous diffusion (Saxton, 2007). The answer to these questions bears important consequences for the understanding of the biochemical reactions of the cell.
The problem of diffusing molecules in non-homogenous media has been investigated in different fields. Following the seminal work of de Gennes (1982a), (1982b) in polymer physics, the study of diffusivity of particles and their reactivity has been generalized to random or disordered media (Kopelman, 1986; Lindenberg et al., 1991). These works have set a framework to interpret the mobility of macromolecular complexes in the cell, and recently in terms of kinetics of biochemical reactions (Condamin et al., 2007). Experimental evidence has also been found, showing the influence
of the glass-like properties of the bacterial cytoplasm in the molecular dynamics of intracellular processes (Parry et al., 2014). These studies demonstrate that the geometry of the medium in which diffusion takes place has important repercussions for the search kinetics of molecules. The notion of compact and non-compact exploration was introduced by de Gennes (1982a) in the context of dense polymers and describes two fundamental types of diffusive behavior. While a non-compact explorer leaves a significant number of available sites unvisited, a compact explorer performs a redundant
exploration of the space. In chemistry, the influence of compactness is well established to describe dimensional effects on reaction rates (Kopelman, 1986).
In this study, we aim to elucidate the existence of different types of mobility of TFs in the eukaryotic nucleus, as well as the principles governing nuclear exploration of factors relevant to transcriptional control. To this end, we used single-molecule (SM) imaging to address the relationship between the nuclear geometry and the search dynamics of two nuclear factors having distinct functional roles: the proto-oncogene c-Myc and the positive transcription elongation factor (P-TEFb). c-Myc is a basic helix-loop-helix DNA-binding transcription factor that binds to E-Boxes; 18,000 E-boxes are found in the genome, and c-Myc affects the transcription of numerous genes (Gallant and Steiger, 2009).
Recently, c-Myc has been demonstrated to be a general transcriptional activator upregulating transcription of nearly all genes (Lin et al., 2012; Nie et al., 2012). P-TEFb is an essential actor in the transcription regulation driven by RNA Polymerase II. P-TEFb is a cyclin-dependent kinase, comprising a CDK9 and a Cyclin T subunit. It phosphorylates the elongation control factors SPT5 and NELF to allow productive elongation of class II gene transcription (Wada et al., 1998). The carboxy-terminal domain (CTD) of the catalytic subunit RPB1 of polymerase II is also a major target of P-TEFb (Zhou et al., 2012). c-Myc and P-TEFb are therefore two good examples of transcriptional regulators binding to numerous sites in the nucleus; the latter binds to the transcription machinery itself and the former directly to DNA.
Single particle tracking (SPT) constitutes a powerful method to probe the mobility of molecules in living cells (Lord et al., 2010). In the nucleus, SPT has been first employed to investigate the dynamics of mRNAs (Fusco et al., 2003; Shav-Tal et al., 2004) or for rheological measurements of the nucleoplasm using inert probes (Bancaud et al., 2009). Recently, the tracking of single nuclear factors has been facilitated by the advent of efficient in situ tagging methods such as Halo
tags (Mazza et al., 2012). An alternative approach takes advantage of photoconvertible tags (Lippincott-Schwartz and Patterson, 2009) and photoactivated localization microscopy (PALM) (Betzig et al., 2006; Hess et al., 2006). Single particle tracking PALM (sptPALM) was first used to achieve high-density diffusion maps of membrane proteins (Manley et al., 2008). However, spt-PALM experiments have typically been limited to proteins with slow mobility (Manley et al., 2008) or those that undergo restricted motions (Frost et al., 2010; English et al., 2011).
Recently, by inclusion of light-sheet illumination, it has been used to determine the binding characteristics of TFs to DNA (Gebhardt et al., 2013). In this study, we developed a new sptPALM procedure adapted for the recording of individual proteins rapidly diffusing in the nucleus of mammalian cells. We used the photoconvertible fluorophore Dendra2 (Gurskaya et al., 2006) and took advantage of tilted illumination (Tokunaga et al., 2008). A careful control of the photoconversion rate minimized the background signal due to out-of-focus activated molecules, and we could thus follow the motion of individual proteins freely diffusing within the nuclear volume. With this sptPALM technique, we recorded large data sets (on the order of 104 single translocations in a single imaging session), which were essential for a proper statistical analysis of the search dynamics.
We applied our technique to several nuclear proteins and found that diffusing factors do not sense a unique nucleoplasmic architecture: c-Myc and P-TEFb adopt different nuclear space-exploration strategies, which drastically change the way they reach their specific targets. The differences observed between the two factors were not due to their diffusive kinetic parameters but to the geometry of their exploration path. c-Myc and our control protein, ‘free’ Dendra2, showed free diffusion in a three-dimensional nuclear space. In contrast, P-TEFb explored the nuclear volume by sampling a space of reduced dimensionality, displaying characteristics of exploration constrained in fractal structures.
The role of the space-sampling mode in the search strategy has long been discussed from a theoretical point of view (de Gennes, 1982a; Kopelman, 1986; Lindenberg et al., 1991). Our experimental results support the notion that it could indeed be a key parameter for diffusion-limited chemical reactions in the closed environment of the nucleus (Bénichou et al., 2010). We discuss the implications of our observations in terms of gene expression control, and its relation to the spatial organization of genes within the nucleus.