Posts Tagged ‘scientific validation’

Mozilla Science Lab Promotes Data Reproduction Through Open Access: Report from 9/10/2015 Online Meeting

Reporter: Stephen J. Williams, Ph.D.

Mozilla Inc. is developing a platform for scientists to discuss the issues related to developing a framework to share scientific data as well as tackle the problems of scientific reproducibility in an Open Access manner. According to their blog

We’re excited to announce the launch of the Mozilla Science Lab, a new initiative that will help researchers around the world use the open web to shape science’s future.

Scientists created the web — but the open web still hasn’t transformed scientific practice to the same extent we’ve seen in other areas like media, education and business. For all of the incredible discoveries of the last century, science is still largely rooted in the “analog” age. Credit systems in science are still largely based around “papers,” for example, and as a result researchers are often discouraged from sharing, learning, reusing, and adopting the type of open and collaborative learning that the web makes possible.

The Science Lab will foster dialog between the open web community and researchers to tackle this challenge. Together they’ll share ideas, tools, and best practices for using next-generation web solutions to solve real problems in science, and explore ways to make research more agile and collaborative.

On their blog they highlight various projects related to promoting Open Access for scientific data

On September 10, 2015 Mozilla Science Lab had their scheduled meeting on scientific data reproduce ability.  The meeting was free and covered by ethernet and on social media. The Twitter hashtag for updates and meeting discussion is #mozscience ( )

Open Access Meeting Announcement on Twitter


mozilla science lab

Mozilla Science Lab @MozillaScience

Join @khinsen @abbycabs + @EvoMRI tmrw (11AM ET) to hear about replication, publishing + #openscience. Details: …


  • Mozilla Science Lab Updates
  • Staff welcomes and thank yous:
  • Welcoming Zannah Marsh, our first Instructional Designer
  • Workshopping the “Working Open” guide:
    • Discussion of Future foundation and GitHub projects
    • Discussion of submission for open science project funding
  • Contributorship Badges Pilot – an update! – Abby Cabunoc Mayes – @abbycabs
  • Will be live on GigaScience September 17th!
  • Where you can jump in:
  • Questions regarding coding projects – Abby will coordinate efforts on coding into their codebase
  • The journal will publish and authors and reviewers get a badge and their efforts and comments will appear on GigaScience: Giga Science will give credit for your reviews – supports an Open Science Discussion

Roadmap for

  • Fellows review is in full swing!
  • MozFest update:
  • Miss the submission deadline? You can still apply to join our Open Research Accelerator and join us for the event (PLUS get a DOI for your submission and 1:1 help)

A discussion by Konrad Hinsen (@khinsen) on ReScience, a journal focused on scientific replication will be presented:

  • ReScience – a new journal for replications – Konrad Hinsen @khinsen
  • ReScience is dedicated to publishing replications of previously published computational studies, along with all the code required to replicate the results.
  • ReScience lives entirely on GitHub. Submissions take the form of a Git repository, and review takes place in the open through GitHub issues. This also means that ReScience is free for everyone (authors, readers, reviewers, editors… well, I said everyone, right?), as long as GitHub is willing to host it.
  • ReScience was launched just a few days ago and is evolving quickly. To stay up to date, follow @ReScienceEds on Twitter. If you want to volunteer as a reviewer, please contact the editorial board.

The ReScience Journal Reproducible Science is Good. Replicated Science is better.

ReScience is a peer-reviewed journal that targets computational research and encourages the explicit reproduction of already published research promoting new and open-source implementations in order to ensure the original research is reproducible. To achieve such a goal, the whole editing chain is radically different from any other traditional scientific journal. ReScience lives on github where each new implementation is made available together with the comments, explanations and tests. Each submission takes the form of a pull request that is publicly reviewed and tested in order to guarantee any researcher can re-use it. If you ever reproduced computational result from the literature, ReScience is the perfect place to publish this new implementation. The Editorial Board

Notes from his talk:

– must be able to replicate paper’s results as written according to experimental methods

– All authors on ReScience need to be on GitHub

– not accepting MatLab replication; replication can involve computational replication;

  • Research Ideas and Outcomes Journal – Daniel Mietchen @EvoMRI
    • Postdoc at Natural Museum of London doing data mining; huge waste that 90% research proposals don’t get used so this journal allows for publishing proposals
    • Learned how to write proposals by finding a proposal online open access
    • Reviewing system based on online reviews like GoogleDocs where people view, comment
    • Growing editorial and advisory board; venturing into new subject areas like humanities, economics, biological research so they are trying to link diverse areas under SOCIAL IMPACT labeling
    • BIG question how to get scientists to publish their proposals especially to improve efficiency of collaboration and reduce too many duplicated efforts as well as reagent sharing
    • Crowdfunding platform used as post publication funding mechanism; still in works
    • They need a lot of help on the editorial board so if have a PhD PLEASE JOIN
  • Website:
  • Background:
  • Science article:
  • Some key features:
  • for publishing all steps of the research cycle, from proposals (funded and not yet funded) onwards
  • maps submissions to societal challenges
  • focus on post-publication peer review; pre-submission endorsement; all reviews public
  • lets authors choose which publishing services they want, e.g. whether they’d like journal-mediated peer review
  • collaborative WYSIWYG authoring and publishing platform based on JATS XML

A brief discussion of upcoming events on @MozillaScience

Meetings are held 2nd Thursdays of each month

Additional plugins, coding, and new publishing formats are available at

Other related articles on OPEN ACCESS Publishing were published in this Open Access Online Scientific Journal, include the following:

Archives of Medicine (AOM) to Publish from “Leaders in Pharmaceutical Business Intelligence (LPBI)” Open Access On-Line Scientific Journal

Annual Growth in NIH Clicks: 32% Open Access Online Scientific Journal

Collaborations and Open Access Innovations – CHI, BioIT World, 4/29 – 5/1/2014, Seaport World Trade Center, Boston

Elsevier’s Mendeley and – How We Distribute Scientific Research: A Case in Advocacy for Open Access Journals

Reconstructed Science Communication for Open Access Online Scientific Curation

The Fatal Self Distraction of the Academic Publishing Industry: The Solution of the Open Access Online Scientific Journals



Read Full Post »

Transcription Modulation

Author and Curator: Larry H. Bernstein, MD, FCAP 


This portion of the transcription series deals with transcription factors and the effects of their binding on metabolism. This also has implications for pharmaceutical target identification.

The Functional Consequences of Variation in Transcription Factor Binding
DA. Cusanovich, B Pavlovic, JK. Pritchard*, Y Gilad*
1 Department of Human Genetics, 2 Howard Hughes Medical Institute, University of Chicago, Chicago, IL 3 Departments of Genetics and Biology and Howard Hughes Medical Institute, Stanford University, Stanford, CA.
PLoS Genet 2014;10(3):e1004226.

One goal of human genetics is to understand how the information for precise and dynamic gene expression programs is encoded in the genome. The interactions of transcription factors (TFs) with DNA regulatory elements clearly

  • play an important role in determining gene expression outputs, yet
  • the regulatory logic underlying functional transcription factor binding is poorly understood.

An important question in genomics is to understand how a class of proteins called ‘‘transcription factors’’ controls the expression level of other genes in the genome in a cell type-specific manner – a process that is essential to human development. One major approach to this problem is to study where these transcription factors bind in the genome, but this does not tell us about the effect of that binding on gene expression levels and

  • it is generally accepted that much of the binding does not strongly influence gene expression.

To address this issue, we artificially reduced the concentration of 59 different transcription factors in the cell and then

  • examined which genes were impacted by the reduced transcription factor level.

Our results implicate some attributes

  • that might influence what binding is functional, but they also suggest that
  • a simple model of functional vs. non-functional binding may not suffice.

Many studies have focused on characterizing the genomic locations of TF binding, but

  • it is unclear whether TF binding at any specific locus has
  • functional consequences with respect to gene expression output.

We knocked down 59 TFs and chromatin modifiers in one HapMap lymphoblastoid cell line

  • to evaluate the context of functional TF binding.

We then identified genes whose expression was affected by the knockdowns

  • by intersecting the gene expression data with transcription factor binding data
    (based on ChIP-seq and DNase-seq)
  • within 10 kb of the transcription start sites of expressed genes.

This combination of data allowed us to infer functional TF binding.
Only a small subset of genes bound by a factor were

  • differentially expressed following the knockdown of that factor,
  • suggesting that most interactions between TF and chromatin
  • do not result in measurable changes in gene expression levels
  • of putative target genes.

We found that functional TF binding is enriched

  • in regulatory elements that harbor a large number of TF binding sites,
  • at sites with predicted higher binding affinity, and
  • at sites that are enriched in genomic regions annotated as ‘‘active enhancers.’’

We aim to be able to predict the expression pattern of a gene based on its regulatory
sequence alone. However, the regulatory code of the human genome is much more complicated than

  • the triplet code of protein coding sequences, and is highly context-specific,
  • depending on cell-type and other factors.

Moreover, regulatory regions are not necessarily organized into

  • discrete, easily identifiable regions of the genome and
  • may exert their influence on genes over large genomic distances

Genomic studies addressing questions of the regulatory logic of the human genome have largely taken one of two approaches.

  1. collecting transcription factor binding maps using techniques such as ChIPseq
    and DNase-seq
  2. mapping various quantitative trait loci (QTL), such as gene expression levels
    (eQTLs) [7], DNA methylation (meQTLs) [8] and chromatin accessibility (dsQTLs)

Cumulatively, binding map studies and QTL map studies have

  • led to many insights into the principles and mechanisms of gene regulation.

However, there are questions that neither mapping approach on its own is well equipped to address. One outstanding issue is

  • the fraction of factor binding in the genome that is ‘‘functional’’,
    which we define here to mean that
  • disturbing the protein-DNA interaction leads to a measurable
  • downstream effect on gene regulation.

Transcription factor knockdown could be used to address this problem, whereby

  • the RNA interference pathway is employed to greatly reduce
  • the expression level of a specific target gene by using small interfering RNAs (siRNAs).

The response to the knockdown can then be measured by collecting RNA after the knockdown and

  • measuring global changes in gene expression patterns
  • after specifically attenuating the expression level of a given factor.

Combining a TF knockdown approach with TF binding data can help us to

  • distinguish functional binding from non-functional binding

This approach has previously been applied to the study of human TFs, although for the most part studies have only focused on

  • the regulatory relationship of a single factor with its downstream targets.

The FANTOM consortium knocked down 52 different transcription factors in

  • the THP-1 cell line, an acute monocytic leukemia-derived cell line, and
  • used a subset of these to validate certain regulatory predictions based on binding motif enrichments.

We and others previously studied the regulatory architecture of gene expression in

  • the model system of HapMap lymphoblastoid cell lines (LCLs) using both
  • binding map strategies and QTL mapping strategies.

We now sought to use knockdown experiments targeting transcription factors in a HapMap LCL

  • to refine our understanding of the gene regulatory circuitry of the human genome.

Therefore, We integrated the results of the knockdown experiments with previous data on TF binding to

  • better characterize the regulatory targets of 59 different factors and
  • to learn when a disruption in transcription factor binding
  • is most likely to be associated with variation in the expression level of a nearby gene.

Gene expression levels following the knockdown were compared to

  • expression data collected from six samples that were transfected with negative control siRNA.

The expression data from all samples were normalized together using

  • quantile  normalization followed by batch correction using the RUV-2 method.

We then performed several quality control analyses to confirm

  1. that the quality of the data was high,
  2. that there were no outlier samples, and
  3. that the normalization methods reduced the influence of confounders

In order to identify genes that were expressed at a significantly different level

  • in the knockdown samples compared to the negative controls,
  • we used likelihood-ratio tests within the framework of a fixed effect linear model.

Following normalization and quality control of the arrays,

  • we identified genes that were differentially expressed between
  • the three knockdown replicates of each factor and the six controls.

Depending on the factor targeted, the knockdowns resulted in

  • between 39 and 3,892 differentially expressed genes at an FDR of 5%
    (Figure 1B; see Table S3 for a summary of the results).

The knockdown efficiency for the 59 factors ranged

  • from 50% to 90% (based on qPCR; Table S1).

The qPCR measurements of the knockdown level were significantly

  • correlated with estimates of the TF expression levels
  • based on the microarray data (P =0.001; Figure 1C).

Reassuringly, we did not observe a significant correlation between

  • the knockdown efficiency of a given factor and
  • the number of genes classified as differentially expressed foci.

Because we knocked down 59 different factors in this experiment

  • we were able to assess general patterns associated with the perturbation of transcription factors
  • beyond merely the number of affected target genes.

Globally, despite the range in the number of genes we identified as

  • differentially expressed in each knockdown,
  • the effect sizes of the differences in expression were relatively modest and
  • consistent in magnitude across all knockdowns.

The median effect size following the knockdown experiment for genes classified as

  • differentially expressed at an FDR of 5% in any knockdown was
  • a 9.2% difference in expression level between the controls and the knockdown (Figure 2),
  • while the median effect size for any individual knockdown experiment ranged between 8.1% and 11.0%.
    (this was true whether we estimated the knockdown effect based on qPCR (P = 0.10; Figure 1D) or microarray (P = 0.99; not shown) data.

Nor did we observe a correlation between

  • variance in qPCR-estimated knockdown efficiency (between replicates) and
  • the number of genes differentially expressed (P = 0.94; Figure 1E).

We noticed that the large variation in the number of differentially expressed genes

  • extended even to knockdowns of factors from the same gene family.

Figure 1. Differential expression analysis.
(a) Examples of differential expression analysis results for the genes HCST and IRF4. The top two panels are ‘MA plots’ of the mean Log2(expression level) between the knockdown arrays and the controls for each gene (x-axis) to the Log2(Fold-Change) between the knockdowns and controls (y-axis). Differentially expressed genes at an FDR of 5% are plotted in yellow (points 50% larger). The gene targeted by the siRNA is highlighted in red. The bottom two panels are ‘volcano plots’ of the Log2(Fold-Change) between the knockdowns and controls (x-axis) to the P-value for differential expression (y-axis). The dashed line marks the 5% FDR threshold. Differentially expressed genes at an FDR of 5% are plotted in yellow (points 50% larger). The red dot marks the gene targeted by the siRNA.
(b) Barplot of number of differentially expressed genes in each knockdown experiment.
(c) Comparison of the knockdown level measured by qPCR (RNA sample collected 48 hours posttransfection) and the knockdown level measured by microarray.
(d) Comparison of the level of knockdown of the transcription factor at 48 hrs (evaluated by qPCR; x-axis) and the number of genes differentially expressed in the knockdown experiment (y-axis).
(e) Comparison of the variance in knockdown efficiency between replicates for each transcription factor (evaluated by qPCR; x-axis) and the number of differentially expressed genes in the knockdown experiment (y-axis).

Differential expression analysis

Differential expression analysis

Figure 2. Effect sizes for differentially expressed genes.
Boxplots of absolute Log2(fold-change) between knockdown arrays and control arrays for all genes identified as differentially expressed in each experiment. Outliers are not plotted. The gray bar indicates the interquartile range across all genes differentially expressed in all knockdowns. Boxplots are ordered by the number of genes differentially expressed in each experiment. Outliers were not plotted.

Effect sizes for differentially expressed genes

Effect sizes for differentially expressed genes

Knocking down SREBF2 (1,286 genes differentially expressed), a key regulator of cholesterol homeostasis,

  • results in changes in the expression of genes that are
  • significantly enriched for cholesterol and sterol biosynthesis annotations.

While not all factors exhibited striking enrichments for relevant functional categories and pathways,

  • the overall picture is that perturbations of many of the factors
  • primarily affected pathways consistent with their known biology.

In order to assess functional TF binding, we next incorporated

  • binding maps together with the knockdown expression data.

We combined binding data based on DNase-seq footprints in 70 HapMap LCLs, reported by Degner et al. (Table S5)

  • and from ChIP-seq experiments in LCL GM12878, published by ENCODE.

We were thus able to obtain genome wide binding maps for a total of 131 factors that were either

  • directly targeted by an siRNA in our experiment (29 factors) or were
  • differentially expressed in one of the knockdown experiments.

We classified a gene as a bound target of a particular factor when

  • binding of that factor was inferred within 10kb of the transcription start site (TSS) of the target gene.

Using this approach, we found that the 131 TFs were bound

  • in proximity to a median of 1,922 genes per factor (range 11 to 7,053 target genes).

We considered binding of a factor to be functional if the target gene

  • was differentially expressed after perturbing the expression level the bound transcription factor.

We then asked about the concordance between

  • the transcription factor binding data and the knockdown expression data.
  •  the extent to which differences in gene expression levels following the knockdowns
  • might be predicted by binding of the transcription factors
  • within the putative regulatory regions of the responsive genes. and also
  • what proportion of putative target (bound) genes of a given TF were
  • differentially expressed following the knockdown of the factor.

Focusing only on the binding sites classified using the DNase-seq data
(which were assigned to a specific instance of the binding motif, unlike the ChIP data),

  • we examined sequence features that might distinguish functional binding.

In particular, whether binding at conserved sites was more likely to be functional  and

  • whether binding sites that better matched the known PWM for the factor were more likely to be functional.

We did not observe a significant shift in the conservation of functional binding sites (Wilcoxon rank sum P = 0.34),

  • but we did observe that binding around differentially expressed genes occurred at sites
  • that were significantly better matches to the canonical binding motif.

Figure 3. Intersecting binding data and expression data for each knockdown.
(a) Example Venn diagrams showing the overlap of binding and differential expression for the knockdowns of HCST and IRF4 (the same genes as in Figure 1).
(b) Boxplot summarizing the distribution of the fraction of all expressed genes that are bound by the targeted gene or downstream factors.
(c) Boxplot summarizing the distribution of the fraction of bound genes that are classified as differentially expressed, using an FDR of either 5% or 20%.

Intersecting binding data and expression data for each knockdown

Intersecting binding data and expression data for each knockdown

Considering bound targets determined from either the ChIP-seq or DNase-seq data, we observed that

  • differentially expressed genes were associated with both
  • a higher number of binding events for the relevant factors within 10 kb of the TSS (P,10216; Figure 4A)
  • as well as with a larger number of different binding factors
    (considering the siRNA-targeted factor and any TFs that were DE in the knockdown; P,10216; Figure 4B).

Figure 4. Degree of binding correlated with function. Boxplots comparing
(a) the number of sites bound, and
(b) the number of differentially expressed transcription factors binding events near functionally or non-functionally bound genes. We considered binding for siRNA-targeted factor and any factor differentially expressed in the knockdown.
(c) Focusing only on genes differentially expressed in common between each pairwise set of knockdowns we tested for enrichments of functional binding (y-axis). Pairwise comparisons between knockdown experiments were binned by the fraction of differentially expressed transcription factors in common between the two experiments. For these boxplots, outliers were not plotted.

Degree of binding correlated with function

Degree of binding correlated with function

We examined the distribution of binding about the TSS. Most factor binding was concentrated

  • near the TSS whether or not the genes were classified as differentially expressed (Figure 5A).
  • the distance from the TSS to the binding sites was significantly longer for differentially expressed genes (P,10216; Fig. 5B).

Figure 5. Distribution of functional binding about the TSS.
(a) A density plot of the distribution of bound sites within 10 kb of the TSS for both functional and non-functional genes. Inset is a zoom-in of the region +/21 kb from the TSS (b) Boxplots comparing the distances from the TSS to the binding sites for functionally bound genes and non-functionally bound genes. For the boxplots, 0.001 was added before log10 transforming the distances and outliers were not plotted.

Distribution of functional binding about the TSS

Distribution of functional binding about the TSS


We investigated the distribution of factor binding across various chromatin states, as defined by Ernst et al. This dataset lists

  • regions of the genome that have been assigned to different activity states
  • based on ChIP-seq data for various histone modifications and CTCF binding.

For each knockdown, we separated binding events

  • by the genomic state in which they occurred and then
  • tested whether binding in that state was enriched around differentially expressed genes.

After correcting for multiple testing of genes that were differentially expressed.

  • 19 knockdowns showed significant enrichment for binding in ‘‘strong enhancers’’
  • four knockdowns had significant enrichments for ‘‘weak enhancers’’,
  • eight knockdowns showed significant depletion of binding in ‘‘active promoters’’ ,
  • six knockdowns had significant depletions for ‘‘transcription elongation’’,

Did the factors tended to have a consistent effect (either up- or down-regulation)

  • on the expression levels of genes they purportedly regulated?

All factors we tested are associated with both up- and down-regulation of downstream targets (Figure 6).

A slight majority of downstream target genes were expressed at higher levels

  • following the knockdown for 15 of the 29 factors for which we had binding information (Figure 6B).

The factor that is associated with the largest fraction (68.8%) of up-regulated target genes following the knockdown is EZH2,

  • the enzymatic component of the Polycomb group complex.

On the other end of the spectrum was JUND, a member of the AP-1 complex, for which

  • 66.7% of differentially expressed targets were down-regulated following the knockdown.

Figure 6. Magnitude and direction of differential expression after knockdown.
(a) Density plot of all Log2(fold-changes) between the knockdown arrays and controls for genes that are differentially expressed at 5% FDR in one of the knockdown experiments as well as bound by the targeted transcription factor.
(b) Plot of the fraction of differentially expressed putative direct targets that were up-regulated in each of the knockdown experiments.

Magnitude and direction of differential expression after knockdown

Magnitude and direction of differential expression after knockdown

We found no correlation between the number of paralogs and the fraction of bound targets that were differentially expressed. We also did not observe a significant correlation when we considered whether

  • the percent identity of the closest paralog might be predicative of
  • the fraction of bound genes that were differentially expressed following the knockdown (Figure S8).

While there is compelling evidence for our inferences, the current chromatin functional annotations

  • do not fully explain the regulatory effects of the knockdown experiments.

For example, the enrichments for binding in ‘‘strong enhancer’’ regions of the genome range from 7.2% to 50.1% (median = 19.2%),

  • much beyond what is expected by chance alone, but far from accounting for all functional binding.

In addition to considering

  • the distinguishing characteristics of functional binding, we also examined
  • the direction of effect that perturbing a transcription factor had on the expression level of its direct targets.

We specifically addressed whether

  • knocking down a particular factor tended to drive expression of its putatively direct (namely, bound) targets up or down,
  • which can be used to infer that the factor represses or activates the target, respectively.

Transcription factors have traditionally been thought of primarily as activators, and previous work from our group is consistent with that notion. Surprisingly, the most straightforward inference from the present study is that

  • many of the factors function as repressors at least as often as they function as activators.
  1. EZH2 had a negative regulatory relationship with the largest fraction of direct targets (68.8%),
    consistent with – the known role of EZH2 as the active member of the Polycomb group complex PC2
  2. while JUND seemed to have a positive regulatory relationship with the largest fraction of direct targets (66.7%),
    and with – the biochemical characterization of the AP-1 complex (of which JUND is a component) as a transactivator.

More generally, however, our results, combined with the previous work from our group and others make for a complicated view

  • of the role of transcription factors in gene regulation as
  • it seems difficult to reconcile the inference from previous work that
  • many transcription factors should primarily act as activators with the results presented here.

One somewhat complicated hypothesis, which nevertheless can resolve the apparent discrepancy, is that

  • the ‘‘repressive’’ effects we observe for known activators may be
  • at sites in which the activator is acting as a weak enhancer of transcription and
  • that reducing the cellular concentration of the factor
  • releases the regulatory region to binding by an alternative, stronger activator.

To more explicitly address the effect that our proximity-based definition of target genes might have on our analyses, we reanalyzed

  • the overlap between factor binding and differential expression following the knockdowns
  • using an independent, empirically determined set of target genes.

Thurman et al. used correlations in DNase hypersensitivity between

  • intergenic hypersensitive sites and promoter hypersensitive sites across diverse tissues
  • to assign intergenic regulatory regions to specific genes,
  • independently of proximity to a particular promoter.

We performed this alternative analysis in which we

  • assigned binding events to genes based on the classification of Thurman et al.

We then considered the overlap between binding and differential expression in this new data set. The results were largely

  • consistent with our proximity-based observations.

A median of 9.5% of genes that were bound by a factor were

  • also differentially expressed following the knockdown of that factor
    (compared to 11.1% when the assignment of binding sites to genes is based on proximity).

From the opposite perspective, a median of 28.0% of differentially expressed genes were bound by that factor
(compared to 32.3% for the proximity based definition). The results of this analysis are summarized in Table S7.

Our results should not be considered a comprehensive census of regulatory events in the human genome. Instead, we adopted a gene-centric approach,

  • focusing only on binding events near the genes for which we could measure expression
  • to learn some of the principles of functional transcription factor binding.

In light of our observations a reassessment of our estimates of binding may be warranted. In particular, because functional binding is skewed away from promoters (our system is apparently not well-suited to observe functional promoter binding, perhaps because of protection by large protein complexes),

  • a more conservative estimate of the fraction of binding that is indeed functional would not consider data within the promoter.

Importantly, excluding the putative promoter region from our analysis (i.e. only considering a window .1 kb from the TSS and ,10 kb from the TSS)

  • does not change our conclusions.

Considering this smaller window,

  • a median of 67.0% of expressed genes are still classified as bound by
  1. either the knocked down transcription factor or
  2. a downstream factors that is differentially expressed in each experiment,

yet a median of only 8.1% of the bound genes are

  • also differentially expressed after the knockdowns.

Much of what distinguishes functional binding (as we define it) has yet to be explained. We are unable to explain much of the differential expression observed in our experiments by the presence of least one relevant binding event. This may not be altogether surprising, as

  • we are only considering binding in a limited window around the transcription start site.

To address these issues, more factors should be perturbed to further evaluate the robustness of our results and to add insight. Together, such studies will help us develop a more sophisticated understanding of functional transcription factor binding in particular, the gene regulatory logic more generally.

Assessing quality and completeness of human transcriptional regulatory pathways on a genome-wide scale

E Shmelkov, Z Tang, I Aifantis, A Statnikov*
Biology Direct 2011; 6(15).

Recently the biological pathways have become a common and probably the most popular form of representing biochemical information for hypothesis generation and validation. These maps store wide knowledge of complex molecular interactions and regulations occurring in the living organism in a simple and obvious way, often using intuitive graphical notation. Two major types of biological pathways could be distinguished.

  1. Metabolic pathways incorporate complex networks of protein-based interactions and modifications, while
  2. signal transduction and transcriptional regulatory pathways are usually considered to provide information on mechanisms of transcription

While there are a lot of data collected on human metabolic processes,

  • the content of signal transduction and transcriptional regulatory pathways varies greatly in quality and completeness.

An indicative comparison of MYC transcriptional targets reported in ten different pathway databases reveals that these databases differ greatly from each other (Figure 1). Given that MYC is involved

  • in the transcriptional regulation of approximately 15% of all genes,

one cannot argue that the majority of pathway databases that contain

  • less than thirty putative transcriptional targets of MYC are even close to complete.

More importantly, to date there have been no prior genome-wide evaluation studies (that are based on genome-wide binding and gene expression assays) assessing pathway databases

Background: While pathway databases are becoming increasingly important in most types of biological and translational research, little is known about the quality and completeness of pathways stored in these databases. The present study conducts a comprehensive assessment of transcriptional regulatory pathways in humans for seven well-studied transcription factors:

  1. MYC,
  2. NOTCH1,
  3. BCL6,
  4. TP53,
  5. AR,
  6. STAT1,
  7. RELA.

The employed benchmarking methodology first involves integrating

  • genome-wide binding with functional gene expression data
  • to derive direct targets of transcription factors.

Then the lists of experimentally obtained direct targets

  • are compared with relevant lists of transcriptional targets from 10 commonly used pathway databases.

Results: The results of this study show that for the majority of pathway databases,

  • the overlap between experimentally obtained target genes and
  • targets reported in transcriptional regulatory pathway databases is
  • surprisingly small and often is not statistically significant.

The only exception is MetaCore pathway database which

  • yields statistically significant intersection with experimental results in 84% cases.

The lists of experimentally derived direct targets obtained in this study can be used

  • to reveal new biological insight in transcriptional regulation,  and we
  • suggest novel putative therapeutic targets in cancer.

Conclusions: Our study opens a debate on validity of using many popular pathway databases to obtain transcriptional regulatory targets. We conclude that the choice of pathway databases should be informed by

  • solid scientific evidence and rigorous empirical evaluation.

In the current study we perform

(1) an evaluation of ten commonly used pathway databases,

  • assessing the transcriptional regulatory pathways, considered in the current study as
  • the interactions of the type ‘transcription factor-transcriptional targets’.

This involves integration of human genome wide functional microarray or RNA-seq gene expression data with

  • protein-DNA binding data from ChIP-chip, ChIP-seq, or ChIP-PET platforms
  • to find direct transcriptional targets of the seven well known transcription factors:
  • MYC, NOTCH1, BCL6, TP53, AR, STAT1, and RELA.

The choice of transcription factors is based on their important role in oncogenesis and availability of binding and expression data in the public domain.

(2) the lists of experimentally derived direct targets are used to assess the quality and completeness of 84 transcriptional regulatory pathways from four publicly available (BioCarta, KEGG, WikiPathways and Cell Signaling Technology) and six commercial (MetaCore, Ingenuity Pathway Analysis, BKL TRANSPATH, BKL TRANSFAC, Pathway Studio and GeneSpring Pathways) pathway databases.

(3) We measure the overlap between pathways and experimentally obtained target genes and assess statistical significance of this overlap, and we demonstrate that experimentally derived lists of direct transcriptional targets

  • can be used to reveal new biological insight on transcriptional regulation.

We show this by analyzing common direct transcriptional targets of

  • MYC, NOTCH1 and RELA
  • that act in interconnected molecular pathways.

Detection of such genes is important as it could reveal novel targets of cancer therapy.

Figure 1 Number of genes in common between MYC transcriptional targets derived from ten different pathway databases. Cells are colored according to their values from white (low values) to red (high values). (not shown)

statistical methodology for comparison

statistical methodology for comparison

Figure 2 Illustration of statistical methodology for comparison between a gold-standard and a pathway database

Since we are seeking to compare gene sets from different studies/databases, it is essential to transform genes to standard identifiers. That is why we transformed all
gene sets to the HUGO Gene Nomenclature Committee approved gene symbols and names. In order to assess statistical significance of the overlap between the resulting gene sets, we used the hypergeometric test at 5% a-level with false discovery rate correction for multiple comparisons by the method of Benjamini and Yekutieli. The alternative hypothesis of this test is that two sets of genes (set A from pathway
database and set B from experiments) have greater number of genes in common than two randomly selected gene sets with the same number of genes as in sets A and B. For example, consider that for some transcription factor there are 300 direct targets in the pathway database #1 and 700 in the experimentally derived list (gold-standard), and their intersection is 16 genes (Figure 2a). If we select on random from a total of
20,000 genes two sets with 300 and 700 genes each, their overlap would be greater or equal to 16 genes in 6.34% times. Thus, this overlap will not be statistically significant at 5% a-level (p = 0.0634). On the other hand, consider that for the pathway database #2, there are 30 direct targets of that transcription factor, and their intersection with the 700-gene gold-standard is only 6 genes. Even though the size of this intersection is rather small, it is unlikely to randomly select 30 genes (out of 20,000) with an overlap greater or equal to 6 genes with a 700-gene gold-standard (p = 0.0005, see Figure 2a). This overlap is statistically significant at 5% a-level.

We also calculate an enrichment fold change ratio (EFC) for every intersection between a gold-standard and a pathway database. For a given pair of a gold-standard and a pathway database, EFC is equal to the observed number of genes in their intersection, divided by the expected size of intersection under the null hypothesis (plus machine epsilon, to avoid division by zero). Notice however that larger values of EFC may correspond to databases that are highly incomplete and contain only a few relations. For example, consider that for some transcription factor there are 300 direct targets in the pathway database #1 and 50 in the experimentally derived list (gold-standard), and their intersection is 30 genes (Figure 2b). If we select on random from a total of 20,000 genes two sets with 300 and 50 genes each, their expected overlap under the null hypothesis will be equal to 0.75. Thus, the EFC ratio will be equal to 40 (= 30/0.75). On the other hand, consider that for the pathway database #2, there are 2 direct
targets of that transcription factor, and their intersection with the 50-gene gold-standard is only 1 gene. Even though the expected overlap under the null hypothesis will be equal to 0.005 and EFC equal to 200 (5 times bigger than for the database #1), the size of this intersection with the gold-standard is 30 times less than for database #1 (Figure 2b).

Figure 3 Comparison between different pathway databases and experimentally derived gold-standards for all considered transcription factors. Value in a given cell is a number of overlapping genes between a gold-standard and a pathway-derived gene set. Cells
are colored according to their values from white (low values) to red (high values). Underlined values in red represent statistically significant intersections. (not shown)

Figure 4 Summary of the pathway databases assessment. Green cells represent statistically significant intersections between experimentally derived gold-standards and transcriptional regulatory pathways. White cells denote results that are not statistically significant. Numbers are the enrichment fold change ratios (EFC) calculated for each intersection. (not shown)

At the core of this study was creation of gold-standards of transcriptional regulation in humans that can be compared with target genes reported in transcriptional regulatory pathways. We focused on seven well known transcription factors and obtained gold-standards

  • by integrating genome-wide transcription factor-DNA binding data (from ChIP-chip, ChIP-seq, or ChIP-PET platforms)
  • with functional gene expression microarray and RNA-seq data.

The latter data allows to survey changes in the transcriptomes on a genome-wide scale

  • after the inhibition or over-expression of the transcription factor in question.

However, change in the expression of a particular gene could be caused either by the direct effect of the removal or introduction of a given transcription factor, as well as by an indirect effect, through the change in expression level of some other gene(s). It is essential

  • to integrate data from these two sources to
  • obtain an accurate list of gene targets that are directly regulated by a transcription factor.

It is worth noting that tested pathway databases typically do not give distinction between cell-lines, experimental conditions, and other details relevant to experimental systems in which data were obtained. These databases in a sense propose a ‘universal’ list of transcriptional targets. However, it is known that

  • transcriptional regulation in a cell is dynamic and works differently for different systems and stimuli.

This accentuates the major limitation of pathway databases and emphasizes

  • importance of deriving a specific list of transcriptional targets for the current experimental system.

In this study we followed the latter approach by developing gold-standards for specific cell characterized biological systems and experimental conditions.

The approach used here  for building gold-standards of direct mechanistic knowledge has several limitations. (see article).  Nevertheless, our results suggest that multiple transcription factors can co-operate and control both physiological differentiation and malignant transformation, as demonstrated utilizing combinatorial gene-profiling for

  • NOTCH1, MYC and RELA targets.

These studies might lead us to multi-pathway gene expression “signatures”

  • essential for the prediction of genes that could be targeted in cancer treatments.

In agreement with this hypothesis, several of the genes identified in our analysis have been suggested to be putative therapeutic targets in leukemia, with either preclinical or clinical trials underway (CDK4, CDK6, GSK3b, MYC, LCK, NFkB2, BCL2L1, NOTCH1).

Single-molecule tracking in live cells reveals distinct target-search strategies of transcription factors in the nucleus

I Izeddin†, V Récamier†‡, L Bosanac, II Cissé, L Boudarene, et al.
1Functional Imaging of Transcription, Institut de Biologie de l’Ecole Normale Supérieure (IBENS), Inserm, and CNRS UMR; 2Laboratoire Kastler Brossel, CNRS UMR, Departement de Physique et Institut de Biologie
de l’Ecole Normale Supérieure (IBENS), Paris, Fr; 3Transcription Imaging Consortium, Janelia Farm Research Campus, Howard Hughes Medical Institute, Ashburn, US; + more.
Biophysics and structural biology | Cell biology eLife 2014;3:e02230.

Transcription factors are

  • proteins that control the expression of genes in the nucleus, and
  • they do this by binding to other proteins or DNA.

First, however, these regulatory proteins need to overcome the challenge of

  • finding their targets in the nucleus, which is crowded with other proteins and DNA.

Much research to date has focused on measuring how fast proteins can diffuse and spread out throughout the nucleus. However these measurements only make sense if these proteins have access to the same space within the nucleus.

Now, Izeddin, Récamier et al. have developed a new technique to track

  • single protein molecules in the nucleus of mammalian cells.

A transcription factor called c-Myc and another protein called P-TEFb

  • were tracked and while they diffused at similar rates,
  • they ‘explored’ the space inside the nucleus in very different ways.

Izeddin, Récamier et al. found that c-Myc explores the nucleus in a so-called ‘non-compact’ manner: this means that it

  • can move almost everywhere inside the nucleus, and has an equal chance
  • of reaching any target regardless of its position in this space.

P-TEFb, on the other hand, searches

  • the nucleus in a ‘compact’ way.

This means that it is constrained to follow a specific path

  • through the nucleus and is therefore guided to its potential targets.

Izeddin, Récamier et al. explain that

  • the different ‘search strategies’ used by these two proteins
  • influence how long it takes them to find their targets and
  • how far they can travel in a given time.

These findings, together with information about

  • where and when different proteins interact in the nucleus,

will be essential to understand how the organization of the genome within the nucleus

  • can control the expression of genes.

The next challenge will now be to

  • uncover what determines a
  • protein’s search strategy in the nucleus, as well as
  • the potential ways that this strategy might be regulated.

Mueller et al., 2010; Normanno et al., 2012). These transient interactions are essential to ensure a fine regulation of binding site occupancy—by competition or by altering the TF concentration—but must also be persistent enough to enable the assembly of multicomponent complexes (Dundr, 2002; Darzacq and Singer, 2008; Gorski et al., 2008; Cisse et al., 2013).
In parallel to the experimental evidence of the fast diffusive motion of nuclear factors, our understanding of the intranuclear space has evolved from a homogeneous environment to an organelle where spatial arrangement among genes and regulatory sequences play an important role in transcriptional control (Heard and Bickmore, 2007). The nucleus of eukaryotes displays a hierarchy of organized structures (Gibcus and Dekker, 2013) and is often referred to as a
crowded environment.
How crowding influences transport properties of macromolecules and organelles in the cell is a fundamental question in quantitative molecular biology. While a restriction of the available space for diffusion can slow down transport processes, it can also channel molecules towards their targets increasing their chance to meet interacting partners. A widespread observation in quantitative cell biology is that the diffusion of molecules is anomalous, often attributed to crowding in the nucleoplasm, cytoplasm, or in the membranes of the cell (Höfling and Franosch, 2013). An open debate remains on how to determine whether diffusion is anomalous or normal (Malchus and Weiss, 2009; Saxton, 2012), and the mechanisms behind anomalous diffusion (Saxton, 2007). The answer to these questions bears important consequences for the understanding of the biochemical reactions of the cell.
The problem of diffusing molecules in non-homogenous media has been investigated in different fields. Following the seminal work of de Gennes (1982a), (1982b) in polymer physics, the study of diffusivity of particles and their reactivity has been generalized to random or disordered media (Kopelman, 1986; Lindenberg et al., 1991). These works have set a framework to interpret the mobility of macromolecular complexes in the cell, and recently in terms of kinetics of biochemical reactions (Condamin et al., 2007). Experimental evidence has also been found, showing the influence
of the glass-like properties of the bacterial cytoplasm in the molecular dynamics of intracellular processes (Parry et al., 2014). These studies demonstrate that the geometry of the medium in which diffusion takes place has important repercussions for the search kinetics of molecules. The notion of compact and non-compact exploration was introduced by de Gennes (1982a) in the context of dense polymers and describes two fundamental types of diffusive behavior. While a non-compact explorer leaves a significant number of available sites unvisited, a compact explorer performs a redundant
exploration of the space. In chemistry, the influence of compactness is well established to describe dimensional effects on reaction rates (Kopelman, 1986).
In this study, we aim to elucidate the existence of different types of mobility of TFs in the eukaryotic nucleus, as well as the principles governing nuclear exploration of factors relevant to transcriptional control. To this end, we used single-molecule (SM) imaging to address the relationship between the nuclear geometry and the search dynamics of two nuclear factors having distinct functional roles: the proto-oncogene c-Myc and the positive transcription elongation factor (P-TEFb). c-Myc is a basic helix-loop-helix DNA-binding transcription factor that binds to E-Boxes; 18,000 E-boxes are found in the genome, and c-Myc affects the transcription of numerous genes (Gallant and Steiger, 2009).
Recently, c-Myc has been demonstrated to be a general transcriptional activator upregulating transcription of nearly all genes (Lin et al., 2012; Nie et al., 2012). P-TEFb is an essential actor in the transcription regulation driven by RNA Polymerase II. P-TEFb is a cyclin-dependent kinase, comprising a CDK9 and a Cyclin T subunit. It phosphorylates the elongation control factors SPT5 and NELF to allow productive elongation of class II gene transcription (Wada et al., 1998). The carboxy-terminal domain (CTD) of the catalytic subunit RPB1 of polymerase II is also a major target of P-TEFb (Zhou et al., 2012). c-Myc and P-TEFb are therefore two good examples of transcriptional regulators binding to numerous sites in the nucleus; the latter binds to the transcription machinery itself and the former directly to DNA.

Single particle tracking (SPT) constitutes a powerful method to probe the mobility of molecules in living cells (Lord et al., 2010). In the nucleus, SPT has been first employed to investigate the dynamics of mRNAs (Fusco et al., 2003; Shav-Tal et al., 2004) or for rheological measurements of the nucleoplasm using inert probes (Bancaud et al., 2009). Recently, the tracking of single nuclear factors has been facilitated by the advent of efficient in situ tagging methods such as Halo
tags (Mazza et al., 2012). An alternative approach takes advantage of photoconvertible tags (Lippincott-Schwartz and Patterson, 2009) and photoactivated localization microscopy (PALM) (Betzig et al., 2006; Hess et al., 2006). Single particle tracking PALM (sptPALM) was first used to achieve high-density diffusion maps of membrane proteins (Manley et al., 2008). However, spt-PALM experiments have typically been limited to proteins with slow mobility (Manley et al., 2008) or those that undergo restricted motions (Frost et al., 2010; English et al., 2011).

Recently, by inclusion of light-sheet illumination, it has been used to determine the binding characteristics of TFs to DNA (Gebhardt et al., 2013). In this study, we developed a new sptPALM procedure adapted for the recording of individual proteins rapidly diffusing in the nucleus of mammalian cells. We used the photoconvertible fluorophore Dendra2 (Gurskaya et al., 2006) and took advantage of tilted illumination (Tokunaga et al., 2008). A careful control of the photoconversion rate minimized the background signal due to out-of-focus activated molecules, and we could thus follow the motion of individual proteins freely diffusing within the nuclear volume. With this sptPALM technique, we recorded large data sets (on the order of 104 single translocations in a single imaging session), which were essential for a proper statistical analysis of the search dynamics.
We applied our technique to several nuclear proteins and found that diffusing factors do not sense a unique nucleoplasmic architecture: c-Myc and P-TEFb adopt different nuclear space-exploration strategies, which drastically change the way they reach their specific targets. The differences observed between the two factors were not due to their diffusive kinetic parameters but to the geometry of their exploration path. c-Myc and our control protein, ‘free’ Dendra2, showed free diffusion in a three-dimensional nuclear space. In contrast, P-TEFb explored the nuclear volume by sampling a space of reduced dimensionality, displaying characteristics of exploration constrained in fractal structures.
The role of the space-sampling mode in the search strategy has long been discussed from a theoretical point of view (de Gennes, 1982a; Kopelman, 1986; Lindenberg et al., 1991). Our experimental results support the notion that it could indeed be a key parameter for diffusion-limited chemical reactions in the closed environment of the nucleus (Bénichou et al., 2010). We discuss the implications of our observations in terms of gene expression control, and its relation to the spatial organization of genes within the nucleus.

Read Full Post »