**mRNA data survival analysis**

Curators: Larry H. Bernstein, MD, FCAP and Aviva Lev-Ari, PhD, RN

LPBI

**SURVIV for survival analysis of mRNA isoform variation**

Shihao Shen, Yuanyuan Wang, Chengyang Wang, Ying Nian Wu & Yi Xing

Nature Communications7,Article number:11548 Feb 2016 doi:10.1038/ncomms11548

The rapid accumulation of clinical RNA-seq data sets has provided the opportunity to associate mRNA isoform variations to clinical outcomes. Here we report a statistical method SURVIV (__Surv__ival analysis of mRNA __I__soform __V__ariation), designed for identifying mRNA isoform variation associated with patient survival time. A unique feature and major strength of SURVIV is that it models the measurement uncertainty of mRNA isoform ratio in RNA-seq data. Simulation studies suggest that SURVIV outperforms the conventional Cox regression survival analysis, especially for data sets with modest sequencing depth. We applied SURVIV to TCGA RNA-seq data of invasive ductal carcinoma as well as five additional cancer types. Alternative splicing-based survival predictors consistently outperform gene expression-based survival predictors, and the integration of clinical, gene expression and alternative splicing profiles leads to the best survival prediction. We anticipate that SURVIV will have broad utilities for analysing diverse types of mRNA isoform variation in large-scale clinical RNA-seq projects.

Eukaryotic cells generate remarkable regulatory and functional complexity from a finite set of genes. Production of mRNA isoforms through alternative processing and modification of RNA is essential for generating this complexity. A prevalent mechanism for producing mRNA isoforms is the alternative splicing of precursor mRNA^{1}. Over 95% of the multi-exon human genes undergo alternative splicing^{2, 3}, resulting in an enormous level of plasticity in the regulation of gene function and protein diversity. In the last decade, extensive genomic and functional studies have firmly established the critical role of alternative splicing in cancer^{4, 5, 6}. Alternative splicing is involved in a full spectrum of oncogenic processes including cell proliferation, apoptosis, hypoxia, angiogenesis, immune escape and metastasis^{7, 8}. These cancer-associated alternative splicing patterns are not merely the consequences of disrupted gene regulation in cancer but in numerous instances actively contribute to cancer development and progression. For example, alternative splicing of genes encoding the Bcl-2 family of apoptosis regulators generates both anti-apoptotic and pro-apoptotic protein isoforms^{9}. Alternative splicing of the *pyruvate kinase M* (*PKM*) gene has a significant impact on cancer cell metabolism and tumour growth^{10}. A transcriptome-wide switch of the alternative splicing programme during the epithelial–mesenchymal transition plays an important role in cancer cell invasion and metastasis^{11, 12}.

RNA sequencing (RNA-seq) has become a popular and cost-effective technology to study transcriptome regulation and mRNA isoform variation^{13, 14}. As the cost of RNA-seq continues to decline, it has been widely adopted in large-scale clinical transcriptome projects, especially for profiling transcriptome changes in cancer. For example, as of April 2015 The Cancer Genome Atlas (TCGA) consortium had generated RNA-seq data on over 11,000 cancer patient specimens from 34 different cancer types. Within the TCGA data, breast invasive carcinoma (BRCA) has the largest sample size of RNA-seq data covering over 1,000 patients, and clinical information such as survival times, tumour stages and histological subtypes is available for the majority of the BRCA patients^{15}. Moreover, the median follow-up time of BRCA patients is ~400 days, and 25% of the patients have more than 1,200 days of follow-up. Collectively, the large sample size and long follow-up time of the TCGA BRCA data set allow us to correlate genomic and transcriptomic profiles to clinical outcomes and patient survival times.

To date, systematic analyses have been performed to reveal the association between copy number variation, DNA methylation, gene expression and microRNA expression profiles with cancer patient survival^{16, 17}. By contrast, despite the importance of mRNA isoform variation and alternative splicing, there have been limited efforts in transcriptome-wide survival analysis of alternative splicing in cancer patients. Most RNA-seq studies of alternative splicing in cancer transcriptomes focus on identifying ‘cancer-specific’ alternative splicing events by comparing cancer tissues with normal controls (see refs 18, 19, 20, 21, 22, 23 for examples). A recent analysis of TCGA RNA-seq data identified 163 recurrent differential alternative splicing events between cancer and normal tissues of three cancer types, among which five were found to have suggestive survival signals for breast cancer at a nominal *P*-value cutoff of 0.05 (ref. 21). Some other studies reported a significant survival difference between cancer patient subgroups after stratifying patients with overall mRNA isoform expression profiles^{24, 25}. However, systematic cancer survival analyses of alternative splicing at the individual exon resolution have been lacking. Two main challenges exist for survival analyses of mRNA isoform variation and alternative splicing using RNA-seq data. The first challenge is to account for the estimation uncertainty of mRNA isoform ratios inferred from RNA-seq read counts. The statistical confidence of mRNA isoform ratio estimation depends on the RNA-seq read coverage for the events of interest, with larger read coverage leading to a more reliable estimation^{14}. Modelling the estimation uncertainty of mRNA isoform ratio is an essential component of RNA-seq analyses of alternative splicing, as shown by various statistical algorithms developed for detecting differential alternative splicing from multi-group RNA-seq data^{14, 26, 27, 28,29}. The second challenge, which is a general issue in survival analysis, is to properly model the association of mRNA isoform ratio with survival time, while accounting for missing data in survival time because of censoring, that is, patients still alive at the end of the survival study, whose precise survival time would be uncertain. To date, no algorithm has been developed for survival analyses of mRNA isoform variation that accounts for these sources of uncertainty simultaneously.

Here we introduce SURVIV (__Surv__ival analysis of mRNA __I__soform __V__ariation), a statistical model for identifying mRNA isoform ratios associated with patient survival times in large-scale cancer RNA-seq data sets. SURVIV models the estimation uncertainty of mRNA isoform ratios in RNA-seq data and tests the survival effects of isoform variation in both censored and uncensored survival data. In simulation studies, SURVIV consistently outperforms the conventional Cox regression survival analysis that ignores the measurement uncertainty of mRNA isoform ratio. We used SURVIV to identify alternatively spliced exons whose exon-inclusion levels significantly correlated with the survival times of invasive ductal carcinoma (IDC) patients from the TCGA breast cancer cohort. Survival-associated alternative splicing events are identified in gene pathways associated with apoptosis, oxidative stress and DNA damage repair. Importantly, we show that alternative splicing-based survival predictors outperform gene expression-based survival predictors in the TCGA IDC RNA-seq data set, as well as in TCGA data of five additional cancer types. Moreover, the integration of clinical information, gene expression and alternative splicing profiles leads to the best prediction of survival time.

**SURVIV statistical model**

The statistical model of SURVIV assesses the association between mRNA isoform ratio and patient survival time. While the model is generic for many types of alternative isoform variation, here we use the exon-skipping type of alternative splicing to illustrate the model (Fig. 1a). For each alternative exon involved in exon-skipping, we can use the RNA-seq reads mapping to its exon-inclusion or -skipping isoform to estimate its exon-inclusion level (denoted as *ψ*, or PSI that is Per cent Spliced In^{14}). A key feature of SURVIV is that it models the RNA-seq estimation uncertainty of exon-inclusion level as influenced by the sequencing coverage for the alternative splicing event of interest. This is a critical issue in accurate quantitative analyses of mRNA isoform ratio in large-scale RNA-seq data sets^{14, 26, 27, 28, 29}. Therefore, SURVIV contains two major components: the first to model the association of mRNA isoform ratio with patient survival time across all patients, and the second to model the estimation uncertainty of mRNA isoform ratio in each individual patient (Fig. 1a).

Figure 1: The statistical framework of the SURVIV model.

(**a**) For each patient *k*, the patient’s hazard rate *λ*_{k}(*t*) is associated with the baseline hazard rate *λ*_{0}(*t*) and this patient’s exon-inclusion level *ψ*_{k}. The association of exon-inclusion level with patient survival is estimated by the survival coefficient *β*. The exon-inclusion level *ψ*_{k} is estimated from the read counts for the exon-inclusion isoform *IC*_{k} and the exon-skipping isoform *SC*_{k}. The proportion of the inclusion and skipping reads is adjusted by a normalization function *f* that considers the lengths of the exon-inclusion and -skipping isoforms (see details in Results and Supplementary Methods). (**b**) A hypothetical example to illustrate the association of exon-inclusion level with patient survival probability over time *S*_{k}(*t*), with the survival coefficient *β*=−1 and a constant baseline hazard rate *λ*_{0}(*t*)=1. In this example, patients with higher exon-inclusion levels have lower hazard rates and higher survival probabilities. (**c**) The schematic diagram of an exon-skipping event. The exon-inclusion reads *IC*_{k} are the reads from the upstream splice junction, the alternative exon itself and the downstream splice junction. The exon-skipping reads *SC*_{k} are the reads from the skipping splice junction that directly connects the upstream exon to the downstream exon.

Briefly, for any individual exon-skipping event, the first component of SURVIV uses a proportional hazards model to establish the relationship between patient *k*’s exon-inclusion level *ψ*_{k} and hazard rate *λ*_{k}(*t*).

For each exon, the association between the exon-inclusion level and patient survival time is reflected by the survival coefficient *β*. A positive *β* means increased exon inclusion is associated with higher hazard rate and poorer survival, while a negative *β* means increased exon inclusion is associated with lower hazard rate and better survival. *λ*_{0}(*t*) is the baseline hazard rate estimated from the survival data of all patients (see Supplementary Methods for the detailed estimation procedure). A particular patient’s survival probability over time *S*_{k}(*t*) can be calculated from the patient-specific hazard rate *λ*_{k}(*t*) as . Figure 1b illustrates a simple example with a negative *β*=−1 and a constant baseline hazard rate *λ*_{0}(*t*)=1, where higher exon-inclusion levels are associated with lower hazard rates and higher survival probabilities.

The second component of SURVIV models the exon-inclusion level and its estimation uncertainty in individual patient samples. As illustrated in Fig. 1c, the exon-inclusion level *ψ*_{k} of a given exon in a particular sample can be estimated by the RNA-seq read count specific to the exon inclusion isoform (*IC*_{k}) and the exon-skipping isoform (*SC*_{k}). Other types of alternative splicing and mRNA isoform variation can be similarly modelled by this framework^{29}. Given the effective lengths (that is, the number of unique isoform-specific read positions) of the exon-inclusion isoform (*l*_{I}) and the exon-skipping isoform (*l*_{S}), the exon-inclusion level *ψ*_{k} can be estimated as . Assuming that the exon-inclusion read count *IC*_{k} follows a binomial distribution with the total read count *n*_{k}=*IC*_{k}+*SC*_{k}, we have:

The binomial distribution models the estimation uncertainty of *ψ*_{k} as influenced by the total read count *n*_{k}, in which the parameter *p*_{k} represents the proportion of reads from the exon-inclusion isoform, given the exon-inclusion level *ψ*_{k} adjusted by a length normalization function *f*(*ψ*_{k}) based on the effective lengths of the isoforms. The definitions of effective lengths for all basic types of alternative splicing patterns are described in ref. 29.

Distinct from conventional survival analyses in which predictors do not have estimation uncertainty, the predictors in SURVIV are exon-inclusion levels *ψ*_{k} estimated from RNA-seq count data, and the confidence of *ψ*_{k} estimate for a given exon in a particular sample depends on the RNA-seq read coverage. We use the statistical framework of survival measurement error model^{30} to incorporate the estimation uncertainty of isoform ratio in the proportional hazards model. Using a likelihood ratio test, we test whether the exon-inclusion levels have a significant association with patient survival over the null hypothesis *H*_{0}:*β*=0. The false discovery rate (FDR) is estimated using the Benjamini and Hochberg approach^{31}. Details of the parameter estimation and likelihood ratio test in SURVIV are described in Supplementary Methods.

Figure 2: Simulation studies to assess the performance of SURVIV and the importance of modelling the estimation uncertainty of mRNA isoform ratio.

We compared our SURVIV model with Cox regression using point estimates of exon-inclusion levels, which does not consider the estimation uncertainty of the mRNA isoform ratio. (**a**) To study the effect of RNA-seq depth, we simulated the mean total splice junction read counts equal to 5, 10, 20, 50, 80 and 100 reads. We generated two sets of simulations with and without data-censoring. For each simulation, the true-positive rate (TPR) at 5% false-positive rate is plotted. The inset figure shows the empirical distribution of the mean total splice junction read counts in the TCGA IDC RNA-seq data (*x* axis in the log10 scale). (**b**) To faithfully represent the read count distribution in a real data set, we performed another simulation with read counts directly sampled from the TCGA IDC data. Sampled read counts were then multiplied by different factors ranging from 10 to 300% to simulate data sets with different RNA-seq read depth. Continuous and dashed lines represent the performance of SURVIV and Cox regression, respectively. Red lines represent the area under curve (AUC) of the ROC curve (TPR versus false-positive rate plot). Black lines represent the TPR at 5% false-positive rate.

Using these simulated data, we compared SURVIV with Cox regression in two settings, without or with censoring of the survival time. In the setting without censoring, the death and survival time of each individual is known. In the setting with censoring, certain individuals are still alive at the end of the survival study. Consequently, these patients have unknown death and survival time. Here, in the simulation with censoring, we assumed that 85% of the patients were still alive at the end of the study, similar to the censoring rate of the TCGA IDC data set. In both settings and with different depths of RNA-seq coverage, SURVIV consistently outperformed Cox regression in the true-positive rate at the same false-positive rate of 5% (Fig. 2a). As expected, we observed a more significant improvement in SURVIV over Cox regression when the RNA-seq read coverage was low (Fig. 2a).

To more faithfully recapitulate the read count distribution in a real cancer RNA-seq data set, we performed another simulation study with read counts directly sampled from the TCGA IDC data. To assess the influence of RNA-seq read depth on the performance of SURVIV and Cox regression, sampled read counts were then multiplied by different factors ranging from 10 to 300% to simulate data sets with different RNA-seq read depths (Fig. 2b). The TCGA IDC data set has an average RNA-seq depth of ~60 million paired-end reads per patient. Thus, the read depth of these simulated RNA-seq data sets ranged from ~6 million reads to 180 million reads per patient, representing low-coverage RNA-seq studies designed primarily for gene expression analysis^{32} up to high-coverage RNA-seq studies designed primarily for alternative isoform analysis^{29}. At all levels of RNA-seq depth, SURVIV consistently outperformed Cox regression, as reflected by the area under curve of the receiver operating characteristic (ROC) curve as well as the true-positive rate at 5% false-positive rate (Fig. 2b). The improvement of SURVIV over Cox regression was particularly prominent when the read depth was low. For example, at 10% read depth, SURVIV had 7% improvement in area under curve (68% versus 61%) and 8% improvement in the true-positive rate at 5% false-positive rate (46% versus 38%). Collectively, these simulation results suggest that SURVIV achieves a higher accuracy by accounting for the estimation uncertainty of mRNA isoform ratio in RNA-seq data.

**SURVIV analysis of TCGA IDC breast cancer data**

To illustrate the practical utility of SURVIV, we used it to analyse the overall survival time of 682 IDC patients from the TCGA breast cancer (BRCA) RNA-seq data set (see Methods for details of the data source and processing pipeline). We chose to analyse IDC because it is the most frequent type of breast cancer^{33}, comprising ~70% of patients in the TCGA breast cancer data set. To control for the effects of significant clinical parameters such as tumour stage and subtype and identify alternative splicing events associated with patient outcomes across multiple molecular and clinical subtypes, we followed the procedure of Croce and colleagues in analysing mRNA and microRNA prognostic signature of IDC^{33} and stratified the patients according to their clinical parameters. We then conducted SURVIV analysis in 26 clinical subgroups with at least 50 patients in each subgroup. We identified 229 exon-skipping events associated with patient survival in multiple clinical subgroups that met the criteria of SURVIV *P*-value≤0.01 in at least two subgroups of the same clinical parameter (cancer subtype, stage, lymph node, metastasis, tumour size, oestrogen receptor status, progesterone receptor status, HER2 status and age as shown in Fig. 3). DAVID (Database for Annotation, Visualization and Integrated Discovery) Gene Ontology analyses^{34} of the 229 alternative splicing events suggest an enrichment of genes in cancer-related functional categories such as intracellular signalling, apoptosis, oxidative stress and response to DNA damage (Supplementary Fig. 1). Table 1 shows a few selected examples of survival-associated alternative splicing events in cancer-related genes. Using two-means clustering of each individual exon’s inclusion levels, the 682 IDC patients can be segregated into two subgroups with significantly different survival times as illustrated by the Kaplan–Meier survival plot (Fig. 4). We also carried out hierarchical clustering of IDC patients using 176 survival-associated alternative exons (*P*≤0.01; SURVIV analysis of all IDC patients). Using the exon-inclusion levels of these 176 exons, we clustered IDC patients into three major subgroups, with 95, 194 and 389 patients, respectively. As illustrated by the Kaplan–Meier survival plots, the three subgroups had significantly different survival times (Supplementary Fig. 2).

Figure 3: SURVIV analysis of exon-skipping events in the TCGA IDC RNA-seq data set.

IDC patients are stratified into multiple clinical subgroups based on clinical parameters including cancer subtype, stage, lymph node status, metastasis, tumour size, oestrogen receptor status, progesterone receptor status, HER2 status and age. Only clinical subgroups with at least 50 patients are included in further analyses. Numbers of patients in the subgroups are indicated next to the names of the subgroups. Shown in the heatmap are the log10 SURVIV *P*-values of the 229 exons associated with patient survival (*P*≤0.01) in at least two subgroups of the same class of clinical parameters. Turquoise colour indicates positive correlation that higher exon-inclusion levels are associated with higher survival probabilities. Magenta colour indicates negative correlation that lower exon-inclusion levels are associated with higher survival probabilities.

TABLE 1 (not shown)

Figure 4: Kaplan–Meier survival plots of IDC patients stratified by two-means clustering of the exon-inclusion levels of four survival-associated alternative splicing events.

Clustering was generated for each of the four exons separately. Black lines represent patients with high exon-inclusion levels. Red lines represent patients with low exon-inclusion levels. The *P*-values are from SURVIV analysis of the TCGA IDC RNA-seq data. (**a**) *ATRIP*. (**b**) *BCL2L11*. (**c**) *CD74*. (**d**) *PCBP4*.

Figure 5: Alternative splicing of *STAT5A* exon 5 is significantly associated with IDC patient survival.

(**a**) The gene structure of the *STAT5A* full-length isoform compared to the ΔEx5 isoform skipping the 5th exon. (**b**) Kaplan–Meier survival plot of IDC patients stratified by two-means clustering using exon-inclusion levels of *STAT5A* exon 5. The 420 patients in Group 1 (average exon 5 inclusion level=95%) have significantly higher survival probabilities than the 262 patients in Group 2 (average exon 5 inclusion level=85%) (SURVIV *P*=6.8e−4). (**c**) Exon 5 inclusion levels of IDC patients stratified by two-means clustering using exon 5 inclusion levels. Group 1 has 420 patients with average exon-inclusion level at 95%. Group 2 has 262 patients with average exon-inclusion level at 85%. (**d**) *STAT5A* exon 5 inclusion levels in normal breast tissues versus breast cancer tumour samples. Exon-inclusion levels are extracted from 86 TCGA breast cancer patients with matched normal and tumour samples. Normal breast tissues have average exon 5 inclusion level at 95%, compared to 91% average exon-inclusion level in tumour samples. Error bars represent 95% confidence interval of the mean.

**Network of survival-associated alternative splicing events**

…see http://www.nature.com/ncomms/2016/160609/ncomms11548/full/ncomms11548.html

Figure 6: Splicing factor regulatory network of survival-associated alternative splicing events in IDC.

(**a**–**c**) Kaplan–Meier survival plots of IDC patients stratified by the gene expression levels of three splicing factors: *TRA2B* (**a**, Cox regression *P*=1.8e−4), *HNRNPH1* (**b**, *P*=3.4e−4) and *SFRS3* (**c**, *P*=2.8e−3). Black lines represent patients with high gene expression levels. Red lines represent patients with low gene expression levels. (**d**) The exon-inclusion levels of a *DHX30* alternative exon are negatively correlated with *TRA2B* gene expression levels (robust correlation coefficient *r*=−0.26, correlation *P=*1.2e−17). (**e**) The exon-inclusion levels of a *MAP3K4* alternative exon are positively correlated with*HNRNPH1* gene expression levels (robust correlation coefficient *r*=0.16, correlation *P=*2.6e−06). (**f**) A splicing co-expression network of the three splicing factors and their correlated survival-associated alternative exons. In total, 84 survival-associated alternative exons are significantly correlated with the three splicing factors. The positive/negative correlation between splicing factors and alternative exons is represented by blue/red lines, respectively. Exons whose inclusion levels are positively/negatively correlated with survival times are represented by blue/red dots, respectively. The size of the splicing factor circles is proportional to the number of correlated exons within the network.

…..

**Alternative splicing predictors of cancer patient survival**

see http://www.nature.com/ncomms/2016/160609/ncomms11548/full/ncomms11548.html

Figure 7: Cross-validation of different classes of IDC survival predictors measured by the C-index

A C-index of 1 indicates perfect prediction accuracy and a C-index of 0.5 indicates random guess. The plots indicate the distribution of C-indexes from 100 rounds of cross-validation. The centre value of the box plot is the median C-index from 100 rounds of cross-validation. The notch represents the 95%confidence interval of the median. The box represents the 25 and 75% quantiles. The whiskers extended out from the box represent the 5 and 95% quantiles. Two-sided Wilcoxon test was used to compare different survival predictors. The different classes of predictors are: (a) clinical information (median C-index 0.67). (b) Gene expression (median C-index 0.68). (c) Alternative splicing (median C-index 0.71). (d) Clinical information+gene expression (median C-index 0.69). (e) Clinical information+alternative splicing (median C-index 0.73). (f) Clinical information+gene expression+alternative splicing (median C-index 0.74). Note that ‘Gene’ refers to ‘Gene-level expression’ in these plots.

Next, we carried out the SURVIV analysis in five additional cancer types in TCGA, including GBM (glioblastoma multiforme), KIRC (kidney renal clear cell carcinoma), LGG (lower grade glioma), LUSC (lung squamous cell carcinoma) and OV (ovarian serous cystadenocarcinoma). As expected, the number of significant events at different FDR or *P*-value significance cutoffs varied across cancer types, with LGG having the strongest survival-associated alternative splicing signals with 660 significant exon-skipping events at FDR≤5% (Supplementary Data 3 and 4). Strikingly, regardless of the number of significant events, alternative splicing-based survival predictors outperformed gene expression-based survival predictors across all cancer types (Supplementary Fig. 3), consistent with our initial observation on the IDC data set.

Alternative processing and modification of mRNA, such as alternative splicing, allow cells to generate a large number of mRNA and protein isoforms with diverse regulatory and functional properties. The plasticity of alternative splicing is often exploited by cancer cells to produce isoform switches that promote cancer cell survival, proliferation and metastasis^{7, 8}. The widespread use of RNA-seq in cancer transcriptome studies^{15, 47, 48} has provided the opportunity to comprehensively elucidate the landscape of alternative splicing in cancer tissues. While existing studies of alternative splicing in large-scale cancer transcriptome data largely focused on the comparison of splicing patterns between cancer and normal tissues or between different subtypes of cancer^{18, 21, 49}, additional computational tools are needed to characterize the clinical relevance of alternative splicing using massive RNA-seq data sets, including the association of alternative splicing with phenotypes and patient outcomes.

We have developed SURVIV, a novel statistical model for survival analysis of alternative isoform variation using cancer RNA-seq data. SURVIV uses a survival measurement error model to simultaneously model the estimation uncertainty of mRNA isoform ratio in individual patients and the association of mRNA isoform ratio with survival time across patients. Compared with the conventional Cox regression model that uses each patient’s mRNA isoform ratio as a point estimate, SURVIV achieves a higher accuracy as indicated by simulation studies under a variety of settings. Of note, we observed a particularly marked improvement of SURVIV over Cox regression for low- and moderate-depth RNA-seq data (Fig. 2b). This has important practical value because many clinical RNA-seq data sets have large sample size but relatively modest sequencing depth.

Using the TCGA IDC breast cancer RNA-seq data of 682 patients, SURVIV identified 229 alternative splicing events associated with patient survival time, which met the criteria of SURVIV*P*-values≤0.01 in multiple clinical subgroups. While the statistical threshold seemed loose, several lines of evidence suggest the functional and clinical relevance of these survival-associated alternative splicing events. These alternative splicing events were frequently identified and enriched in the gene functional groups important for cancer development and progression, including apoptosis, DNA damage response and oxidative stress. While some of these events may simply reflect correlation but not causal effect on cancer patient survival, other events may play an active role in regulating cancer cell phenotypes. For example, a survival-associated alternative splicing event involving exon 5 of *STAT5A* is known to regulate the activity of this transcription factor with important roles in epithelial cell growth and apoptosis^{37}. Using a co-expression network analysis of splicing factor to exon correlation across all patients, we identified three splicing factors (TRA2B, HNRNPH1 and SFRS3) as potential hubs of the survival-associated alternative splicing network of IDC. The expression levels of all three splicing factors were negatively associated with patient survival times (Fig. 6a–c), and both TRA2B and HNRNPH1 were previously reported to have an impact on cancer-related molecular pathways^{40, 41, 42, 43, 44, 45}. Finally, despite the limited power in detecting individual events, we show that the survival-associated alternative splicing events can be used to construct a predictor for patient survival, with an accuracy higher than predictors based on clinical parameters or gene expression profiles (Fig. 7). This further demonstrates the potential biological relevance and clinical utility of the identified alternative splicing events.

We performed cross-validation analyses to evaluate and compare the prognostic value of alternative splicing, gene expression and clinical information for predicting patient survival, either independently or in combination. As expected, the combined use of all three types of information led to the best prediction accuracy. Because we used penalized regression to build the prediction model, combining information from multiple layers of data did not necessarily increase the number of predictors in the model. The perhaps more surprising and intriguing result is that alternative splicing-based predictors appear to outperform gene expression-based predictors when used alone and when either type of data was combined with clinical information (Fig. 7). We observed the same trend in five additional cancer types (Supplementary Fig. 3). We note that this finding was consistent with a previous report that cancer subtype classification based on splicing isoform expression performed better than gene expression-based classification^{25}. While this trend seems counterintuitive because accurate estimation of gene expression requires much lower RNA-seq depth than accurate estimation of alternative splicing^{29}, one possible explanation may be the inherent characteristic of isoform ratio data. By definition, mRNA isoform ratio is estimated as the ratio of multiple mRNA isoforms from a single gene. Therefore, mRNA isoform ratio data have a ‘built-in’ internal control that could be more robust against certain artefacts and confounding issues that influence gene expression estimates across large clinical RNA-seq data sets, such as poor sample quality and RNA degradation^{12}. Regardless of the reasons, our data call for further studies to fully explore the utility of mRNA isoform ratio data for various clinical research applications.

The SURVIV source code is available for download at https://github.com/Xinglab/SURVIV. SURVIV is a general statistical model for survival analysis of mRNA isoform ratio using RNA-seq data. The current statistical framework of SURVIV is applicable to RNA-seq based count data for all basic types of alternative splicing patterns involving two isoform choices from an alternatively spliced region, such as exon-skipping, alternative 5′ splice sites, alternative 3′ splice sites, mutually exclusive exons and retained introns, as well as other forms of alternative isoform variation such as RNA editing. With the rapid accumulation of clinical RNA-seq data sets, SURVIV will be a useful tool for elucidating the clinical relevance and potential functional significance of alternative isoform variation in cancer and other diseases.