TY - GEN
T1 - Quality Control Metrics for Extraction-Free Targeted RNA-Seq Under a Compositional Framework
AU - LaRoche, Dominic
AU - Billheimer, Dean
AU - Michels, Kurt
AU - LaFleur, Bonnie
N1 - Publisher Copyright:
© 2019, Springer Nature Switzerland AG.
PY - 2019
Y1 - 2019
N2 - The rapid rise in the use of RNA sequencing technology (RNA-seq) for scientific discovery has led to its consideration as a clinical diagnostic tool. However, as a new technology the analytical accuracy and reproducibility of RNA-seq must be established before it can realize its full clinical utility (SEQC/MAQC-III Consortium, 2014; VanKeuren-Jensen et al. 2014). We respond to the need for reliable diagnostics, quality control metrics and improved reproducibility of RNA-seq data by recognizing and capitalizing on the relative frequency nature of RNA-Seq data. Problems with sample quality, library preparation, or sequencing may result in a low number of reads allocated to a given sample within a sequencing run. We propose a method, based on outlier detection of Centered Log-Ratio (CLR) transformed counts, for objectively identifying problematic samples based on the total number of reads allocated to the sample. Normalization and standardization methods for RNA-Seq generally assume that the total number of reads assigned to a sample does not affect the observed relative frequencies of probes within an assay. This assumpion, known as Compositional Invariance, is an important property for RNA-Seq data which enables the comparison of samples with differing read depths. Violations of the invariance property can lead to spurious differential expression results, even after normalization. We develop a diagnostic method to identify violations of the Compositional Invariance property. Batch effects arising from differing laboratory conditions or operator differences have been identified as a problem in high-throughput measurement systems (Leek et al. in Genome Biol 15, R29 [14]; Chen et al. in PLoS One 6 [10]). Batch effects are typically identified with a hierarchical clustering (HC) method or principal components analysis (PCA). For both methods, the multivariate distance between the samples is visualized, either in a biplot for PCA or a dendrogram for HC, to check for the existence of clusters of samples related to batch. We show that CLR transformed RNA-Seq data is appropriate for evaluation in a PCA biplot and improves batch effect detection over current methods. As RNA-Seq makes the transition from the research laboratory to the clinic there is a need for robust quality control metrics. The realization that RNA-Seq data are compositional opens the door to the existing body of theory and methods developed by Aitchison (The statistical analysis of compositional data, Chapman & Hall Ltd., 1986) and others. We show that the properties of compositional data can be leveraged to develop new metrics and improve existing methods.
AB - The rapid rise in the use of RNA sequencing technology (RNA-seq) for scientific discovery has led to its consideration as a clinical diagnostic tool. However, as a new technology the analytical accuracy and reproducibility of RNA-seq must be established before it can realize its full clinical utility (SEQC/MAQC-III Consortium, 2014; VanKeuren-Jensen et al. 2014). We respond to the need for reliable diagnostics, quality control metrics and improved reproducibility of RNA-seq data by recognizing and capitalizing on the relative frequency nature of RNA-Seq data. Problems with sample quality, library preparation, or sequencing may result in a low number of reads allocated to a given sample within a sequencing run. We propose a method, based on outlier detection of Centered Log-Ratio (CLR) transformed counts, for objectively identifying problematic samples based on the total number of reads allocated to the sample. Normalization and standardization methods for RNA-Seq generally assume that the total number of reads assigned to a sample does not affect the observed relative frequencies of probes within an assay. This assumpion, known as Compositional Invariance, is an important property for RNA-Seq data which enables the comparison of samples with differing read depths. Violations of the invariance property can lead to spurious differential expression results, even after normalization. We develop a diagnostic method to identify violations of the Compositional Invariance property. Batch effects arising from differing laboratory conditions or operator differences have been identified as a problem in high-throughput measurement systems (Leek et al. in Genome Biol 15, R29 [14]; Chen et al. in PLoS One 6 [10]). Batch effects are typically identified with a hierarchical clustering (HC) method or principal components analysis (PCA). For both methods, the multivariate distance between the samples is visualized, either in a biplot for PCA or a dendrogram for HC, to check for the existence of clusters of samples related to batch. We show that CLR transformed RNA-Seq data is appropriate for evaluation in a PCA biplot and improves batch effect detection over current methods. As RNA-Seq makes the transition from the research laboratory to the clinic there is a need for robust quality control metrics. The realization that RNA-Seq data are compositional opens the door to the existing body of theory and methods developed by Aitchison (The statistical analysis of compositional data, Chapman & Hall Ltd., 1986) and others. We show that the properties of compositional data can be leveraged to develop new metrics and improve existing methods.
KW - Composition
KW - Next generation sequencing
KW - Normalization
KW - Quality control
KW - RNA-Seq
KW - Relative abundance
UR - http://www.scopus.com/inward/record.url?scp=85068176286&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85068176286&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-67386-8_21
DO - 10.1007/978-3-319-67386-8_21
M3 - Conference contribution
AN - SCOPUS:85068176286
SN - 9783319673851
T3 - Springer Proceedings in Mathematics and Statistics
SP - 299
EP - 314
BT - Pharmaceutical Statistics - MBSW 39, 2016
A2 - Liu, Ray
A2 - Tsong, Yi
PB - Springer New York LLC
T2 - 39th Annual Midwest Biopharmaceutical Statistics Workshop, MBSW 2016
Y2 - 16 May 2016 through 18 May 2016
ER -