^{1}

^{2}

^{2}

^{1}

^{2}

^{2}

^{1}

^{2}

^{2}

^{*}

^{1}

^{2}

Edited by: Shrikant S. Mantri, National Agri-Food Biotechnology Institute, India

Reviewed by: Shihao Shen, University of California, Los Angeles, United States; Zeeshan Ahmed, University of Connecticut School of Medicine, United States

*Correspondence: Elie Maza

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Plant Science

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

RNA-Seq is a widely used technology that allows an efficient genome-wide quantification of gene expressions for, for example, differential expression (DE) analysis. After a brief review of the main issues, methods and tools related to the DE analysis of RNA-Seq data, this article focuses on the impact of both the replicate number and library size in such analyses. While the main drawback of previous relevant studies is the lack of generality, we conducted both an analysis of a two-condition experiment (with eight biological replicates per condition) to compare the results with previous benchmark studies, and a meta-analysis of 17 experiments with up to 18 biological conditions, eight biological replicates and 100 million (M) reads per sample. As a global trend, we concluded that the replicate number has a larger impact than the library size on the power of the DE analysis, except for low-expressed genes, for which both parameters seem to have the same impact. Our study also provides new insights for practitioners aiming to enhance their experimental designs. For instance, by analyzing both the sensitivity and specificity of the DE analysis, we showed that the optimal threshold to control the false discovery rate (FDR) is approximately 2^{−r}, where r is the replicate number. Furthermore, we showed that the false positive rate (FPR) is rather well controlled by all three studied R packages:

Since its first results were published, RNA-Seq technology has been widely perceived as a revolutionary tool for transcriptomics (Wang Z. et al.,

The problem of finding the best model to fit RNA-Seq data has been tackled recently by Gierlinski et al. (

When the RNA-Seq technology was first introduced, Wang H. et al. (

To our knowledge, only a few recent articles have aimed to exclusively and deeply analyze the impact of the replicate number and library size (or depth) on a DE analysis. Three studies conclude that increasing the number of biological replicates is globally a more efficient strategy than increasing the library sizes, in order to enhance the power and the false discovery rate (FDR) of a DE analysis (Ching et al.,

With the rise of the RNA-Seq technology, many methods and tools have appeared for DE analysis (Table

Information on 29 R packages, methods, or pipelines, for DE analysis of RNA-Seq data: the number of citations of the article introducing the method (until October 2017, extracted from All Databases of Web of Science^{a}

Robinson et al., |
430 | 982 | 1,854 | 3,040 | 4,450 | 22.00 | Negative binomial | TMM | No | |

Trapnell et al., |
861 | 1,648 | 2,446 | 3,300 | 4,283 | 21.17 | Poisson | FPKM (geometric) | Yes | |

Anders and Huber, |
607 | 1,395 | 2,299 | 3,167 | 4,157 | 20.55 | Negative binomial | RLE | No | |

Love et al., |
83 | 282 | 1,899 | 9.39 | Negative binomial | RLE | Yes | |||

Ritchie et al., |
1,276 | 6.31 | Gaussian | vst or QN | Yes | |||||

Trapnell et al., |
421 | 699 | 950 | 4.70 | Beta negative binomial | Geometric | No | |||

Wang et al., |
178 | 297 | 458 | 636 | 850 | 4.20 | Poisson | Total count | No | |

Law et al., |
493 | 2.44 | Gaussian | log-CPM | Yes | |||||

Tarazona et al., |
65 | 164 | 263 | 377 | 473 | 2.34 | Non-parametric | CPM | No | |

Hardcastle and Kelly, |
72 | 109 | 178 | 232 | 302 | 1.49 | Negative binomial | Total count | Yes | |

Leng et al., |
5 | 31 | 93 | 170 | 270 | 1.33 | Negative binomial | RLE | Yes | |

Langmead et al., |
57 | 88 | 112 | 117 | 149 | 0.74 | Poisson or Gaussian | 3rd quartile | No | |

Li and Tibshirani, |
0 | 22 | 52 | 91 | 129 | 0.64 | Non-parametric | Trimmed total count | No | |

Feng et al., |
41 | 73 | 93 | 0.46 | Hierarchical Poisson | RLE | Yes | |||

Li et al., |
4 | 19 | 43 | 63 | 88 | 0.44 | Poisson | Trimmed total count | No | |

Wu et al., |
31 | 44 | 61 | 0.30 | Gamma-Poisson | 3rd quartile | Yes | |||

Zhou et al., |
15 | 21 | 30 | 40 | 50 | 0.25 | Beta-binomial | Total count | No | |

Lund et al., |
42 | 0.21 | Negative binomial | 3rd quartile | No | |||||

Auer and Doerge, |
8 | 12 | 16 | 26 | 39 | 0.19 | Two-stage Poisson | Total count | No | |

Wiel et al., |
5 | 14 | 18 | 28 | 33 | 0.16 | Zero-inflated negative binomial | None | Yes | |

Cumbie et al., |
9 | 14 | 20 | 26 | 30 | 0.15 | Negative binomial | Total count | No | |

Di et al., |
11 | 14 | 21 | 23 | 28 | 0.14 | Negative binomial | Total count | No | |

Yu et al., |
27 | 0.13 | Negative binomial | RLE | No | |||||

Burden et al., |
15 | 0.07 | Negative binomial | RLE | No | |||||

Bi and Davuluri, |
0 | 4 | 11 | 12 | 14 | 0.07 | Gamma-Poisson | TMM | Yes | |

Lee et al., |
4 | 5 | 8 | 8 | 10 | 0.05 | Binomial (position-level) | Total count | Yes | |

Lin et al., |
6 | 0.03 | Non-parametric | Trimmed total count | No | |||||

Wan and Sun, |
0 | 1 | 2 | 4 | 5 | 0.02 | Negative binomial | RLE | No | |

van de Wiel et al., |
1 | 3 | 5 | 0.02 | Zero-inflated negative binomial | None | Yes |

Table

Finally, the choice of the methods we used in this article for DE analyses was done by looking at considerations above and comparison studies, but also considering that our

In the present article, we aim to study the impact of the replicate number and library size on the DE analysis of an RNA-Seq experiment involving the tomato fruit model (

Tomato plants (^{−2}·s^{−1} light intensity.

The ovaries (including style and stigma) and the developing young fruits were collected as samples. Ovaries were picked on the first day of flower opening (anthesis stage) and set as 0 days post-anthesis (DPA). Developing young fruits were picked 4 days after this natural pollination stage and set as 4 DPA. Sampling procedures were mainly as described in Wang H. et al. (

Total RNA was isolated from 200 and 500 mg, respectively, of ovary and young fruit powders (

A quality check of the raw sequences was made with ^{1}

We randomly down-sampled the reads to generate data sets of 2.5, 5, 7.5, 10, 15, and 20 M reads using the python script ^{2}

Raw counts were generated on each gene by using

All DE analyses of the Number of DE genes (section Number of DE Genes of the TOGE Data) and Power (section Power Analysis of the TOGE Data) were performed with R software (version 3.2.0) and the dedicated

To analyze the impact of the number of replicates and the library size on the DE analysis, we built 45 data sets for each number of replicates among two, three, four, five, six, and seven replicates, and each library size among 2.5, 5, 7.5, 10, 15, and 20 M reads. Each replicate was randomly chosen without replacement among the eight samples for each condition. We then analyzed 36 combinations of replicate number and library size, from the smallest with two replicates and 2.5 M reads to the largest with seven replicates and 20 M reads. Then, for each combination, we had 45 DE gene lists, and we computed the median of the two studied indicators: the number of DE genes and the estimated power. Obviously, for eight replicates and each library size, we only had one data set and then one indicator. For the calculation of the power, we needed a reference list of DE genes. For this purpose, we chose the DE genes that were found with all available information (i.e., with eight replicates and 20 M reads) and with a very stringent adjusted

Moreover, to calculate the stability of each indicator, we retained, for each combination of replicate number and library size, the DE genes that were common to all 45 data sets. We then calculated both indicators for this new list of DE genes.

Finally, to analyze the impact of the gene expression level on the studied indicators, the gene set was divided into three parts: genes with low counts, genes with medium counts and genes with high counts, i.e., those with a logCPM (counts per million reads) less than the first quartile, between the first and the third quartile, and higher than the third quartile, respectively. Both indicators were then calculated and presented for both low and high expression levels.

We performed an enrichment analysis with the ^{3}

For a given DE analysis method, the sensitivity (or true positive rate, TPR) and the specificity (or true negative rate, TNR) are defined as follows. The TPR is the number of significantly DE genes that are true DE genes, divided by the total number of true DE genes. The TNR is the number of non-significant DE genes that are true non-DE genes, divided by the total number of true non-DE genes. Moreover, we have that specificity = TNR = 1 – FPR (false positive rate). The FPR is then equal to the number of significantly DE genes that are true non-DE genes, divided by the total number of true non-DE genes.

The four DE analysis methods studied here were carried out using the R software environment (version 3.1.3) (R Core Team,

As described above, the calculation of TPR, TNR, and FPR values requires the knowledge of the list of all true DE genes between our two biological conditions, which is obviously not the case in practice. In order to estimate these true DE genes, we performed a prior DE analysis for each method with the whole data set, i.e., eight replicates per condition and all available reads. Moreover, for this prior analysis, we chose a stringent threshold equal to 0.001 to control the FDR (Benjamini and Hochberg,

A DE meta-analysis was performed for all the biological conditions of the

For a given condition, simulated replicates were carried out by a convex linear combination of existing replicates with uniform random coefficients. For this purpose, we used conditions that had two or more replicates. Then, for each simulated replicate, raw counts were randomly carried out with a multinomial distribution with probabilities given by the true observed probabilities of genes, and with library sizes of 5, 10, 15, 20, and 25 M reads. These calculations aim at simulating pseudo-replicates that have almost the same characteristics (means and variances) as the true ones.

The number of significantly DE genes obtained between conditions 0 DPA (flower before pollination) and 4 DPA (flower after pollination) is shown in Figure

Number of DE genes depending on the depth

All the observed curves in Figures

To determine whether the library size or the replicate number has a relatively higher impact on the number of DE genes, we needed to compare combinations of these two parameters that shared the same total amount of reads. This comparison is shown in Figure

The stability of the number of DE genes represented in Figures

Comparing the effects of library size and replicate number on stability (by looking as above at symbols with a black border in Figure

Figure

Power of the DE analyses depending on the depth

Clearly, in the same way as for the number of DE genes discussed in the previous section, both the power and its stability increased with both the library size and the replicate number. Moreover, for all genes, the increase rate diminishes after 10 M reads for all curves of Figure

The large impact of both library size and replicate number on the power for low-expressed genes can be confirmed by looking at black border symbols in Figure

In Figures

Finally, even more than for the number of DE genes discussed above in Figure

Here, we analyze the sensitivity and the specificity of four classical and widely used DE analysis methods: the first one developed in the

Calculations of sensitivity (TPR) and 1−specificity (FPR) depend on the knowledge of the true list of DE genes for the biological conditions in question. This list is obviously not known in practice, and we therefore need to estimate it. In a study of the optimal replicate number, this kind of estimation is classically done by considering that true DE genes can be found with all data information, i.e., all replicates, and a very stringent threshold to control the FDR (see Materials and Methods). We then obtain the following estimated numbers of so-called true DE genes: 15110 with

Figure

An alternative way to estimate the FPR for a given DE analysis method consists of performing the DE analysis between replicates of the same biological condition (Schurch et al.,

In the above section, TPR and FPR were calculated for a fixed value of the threshold controlling the FDR (0.01). We now investigate the impact of this threshold on both TPR and FPR values by calculating them with different threshold values in the interval [0,1]. Figure

ROC curves for 2, 3, 4, 5, 6, and 7 replicates. Each curve is calculated by varying the

More interestingly, we can also see that the optimal threshold values of these ROC curves, i.e., the black-boxed values of the zoomed graph of Figure

Furthermore, almost identical results can be obtained for the other three methods: ^{−r}, where r is the number of replicates: 0.25 for two replicates, 0.12 for three replicates, 0.06 for four replicates, and so on, and finally 0.007 for seven replicates (see Figure

Figure

ROC curves for all four studied methods,

To assess the impact of both the library size and the replicate number on the detection of GO BP categories, we conducted a GO enrichment analysis at each different combination of depth and replicate number using the

Number of true and false positive BP categories from GO analyses (y-axis) depending on the replicate number (x-axis). The eighth bar corresponding to 8 replicates has been chosen as a reference.

A DE meta-analysis has been performed with all the ^{4}

For the first descriptive analysis and for each project, we performed all possible DE analyses of all pairwise biological conditions. We then obtained a total of 604 pairwise comparisons. For each DE analysis, we extracted the following characteristics: the number of DE genes, the rounded mean number of biological replicates per biological condition, the mean library size per biological replicate, the mean absolute distance between two biological condition means, and the mean of all gene variances in both biological conditions. Figure

Number of DE genes for each pair of conditions (y-axis) depending on replicate number per condition (x-axis) are represented (by mean of boxplots) for given levels of distances between conditions and of variances of these conditions (low, medium, and high levels). Blue triangles, orange squares, and red circles represent the median numbers of DE genes for respectively low, medium and high library sizes.

In a second analysis, we performed DE analyses of all pairs of biological conditions, no matter which project they came from. Moreover, for each biological condition, we simulated between two and 21 replicates with library sizes of 5, 10, 15, 20, and 25 M reads (we repeated each simulation three times). The DE analysis was then performed to extract the number of DE genes with a threshold of 0.05 to control the FDR (see Materials and Methods). We finally obtained 5565 pairs of biological conditions × 20 different numbers of replicates × 5 different sizes of libraries × 3 repetitions = 1,752,975 pairwise DE analyses. Boxplots of the number of DE genes are shown in Figure

Number of DE genes for each pair of conditions (y-axis) depending on the replicate number per condition (x-axis) for five library sizes (5, 10, 15, 20, and 25 million reads for, respectively, blue, cyan, green, orange, and red boxplots).

In the present work, we have conducted a thorough analysis of the impact of both replicate number and library size on an RNA-Seq DE analysis. In this discussion, we will compare our results to those obtained by Ching et al. (

As did the three benchmark articles, our study concludes that an increase in the replicate number or the library size increases the number of significantly DE genes and the power. However, Liu et al. (

All three reference studies and ours show that the curves of number of DE genes and power depending on the library size or on the replicate number reach a plateau after a given value. Nevertheless, it appears that this value is different from one study to another, from one data set to another, between 5 and 20 M reads, and between three and 25 replicates. This result shows, as is emphasized by Liu et al. (

As a novelty, we have introduced the notion of stability of the number of DE genes and power. These two indicators are defined, respectively, as the number of DE genes and power calculated with the common list of DE genes obtained with all simulated samples with given parameters. The stability is then a better biological indicator for the number of DE genes or power. From our results, we can observe very little stability of the power for low-expressed genes, which shows that the list of DE genes is highly related to the used samples. For example, with three replicates and 15 M reads, we have a power of about 85% and a stability of the power of about 25%. For stability indicators of both the number of DE genes and the power, we showed that the increase of the replicate number has a higher impact than the increase of the library size for all gene expression levels. This impact is much higher than for the number of DE genes and the power.

We also estimated the FPR, i.e., the probability of falsely declaring a gene as DE, depending on the replicate number, with replicates of both conditions of the TOGE data (as for the power estimation) and with only biological replicates of a given condition (i.e., with, theoretically, no DE genes). All estimations were carried out with a threshold equal to 0.01 to control the FDR. For the former estimation, we pointed out the increase in the FPR with the replicate number, from about 1% with two replicates to 6% with seven replicates. To our knowledge, this drawback linked to the increase in the replicate number has not been underlined before in the literature. On the other hand, the results of the latter estimation show that the FPR is rather well controlled by the four studied methods (

Another striking result that has not been shown yet in the literature, to our knowledge, is the impact of the threshold controlling the FDR on both the TPR and FPR. Indeed, by means of ROC curves depending on the threshold, we have shown that the optimal value for this threshold is almost equal to 2^{−r}, where r is the replicate number. For instance, the optimal threshold is almost equal to 0.25 for two replicates, 0.12 for three replicates, 0.06 for four replicates, and so on. Obviously, as discussed before, this result has only been shown for our TOGE data, but the trend should still remain for other similar data sets. This result was shown for all four DE analysis methods studied. Moreover, we showed that for more than five replicates, the four methods give almost the same results, but, for fewer than five replicates,

We also performed a GO enrichment analysis depending on both the replicate number and library size. Such an analysis gives meaningful biological results in the sense that the measure is directly linked with the underlying biological processes. This analysis showed that the number of enriched categories (both true and false positive categories) increases significantly depending on the replicate number. On the contrary, the increase of depth does not significantly increase the number of enriched categories, but tends to decrease the rate of false positives. This new result is in adequacy with the trade-off between replicate number and library size discussed above for the number of DE genes and power.

As described above, a meaningful result of the present article comes from the meta-analysis that we made with all 17 projects on the tomato fruit extracted from the

As illustrated by the results above, we cannot

Beyond the cost-effectiveness metric to guide the design of large scale RNA-Seq DE studies proposed by Liu et al. (

The results and discussion above will help RNA-Seq practitioners to better understand the impact of both replicate number and library size on a DE analysis, and also the impact of between-condition dispersion, which will help them to better design their experiments. For instance, we learned that choosing a threshold for FDR around 2^{−r} (with r the replicate number) should be optimal to enhance the sensitivity and specificity of the DE analysis. Moreover, for the RNA-Seq practitioners of the tomato community, the meta-analysis carried out in this study shows that at least four replicates and 20 M reads would be required to be almost sure of obtaining about 1000 DE genes, no matter which biological conditions they are interested in.

Ching et al. (

EM and SL planed the research and performed statistical analyses; EM, MB, MZ, and PF designed the experiment of the TOGE project; DL and MZ performed the bioinformatics analyses of the TOGE project; ES performed statistical analyses; GH and PF performed the experiment of the TOGE project; VLB-A and MB advised on the analyses and interpretations of the results; and EM wrote the manuscript with contributions from all authors. All authors revised the work and the manuscript critically. All authors approved the final manuscript.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Authors would like to thank the reviewers for their valuable comments and suggestions aiming to improve the quality of the paper. We would like to thank also Anne Alibert, English teacher at INP-ENSAT, for carefully reading the manuscript and performing necessary spelling and grammar corrections.

The raw sequences of the TOGE project supporting the conclusions of this article are available from the European Nucleotide Archive^{5}

The Supplementary Material for this article can be found online at:

Venn diagram of the number of genes declared as true DE with all four methods:

First species risk curves depending on the depth

Estimations of FPR for the four studied DE analysis methods. A zoomed version of Figure

ROC curves for the

ROC curves for the

ROC curves for the

Optimal values of the FDR parameter for the four studied methods, _{2} optimal values (dependent variable) depending on the number of replicates (explanatory variable).

^{1}

^{2}

^{3}

^{4}

^{5}