RNA-seq: from counts to differentially expressed genes: Key Points

Introduction

RNA-seq measures gene expression across the transcriptome, producing count data per gene per sample.
The main goal of DE analysis is to identify genes with expression differences between conditions.
Biological replicates are required for valid statistical inference; technical replicates must be combined or carefully handled.
A “counts table” is the starting point for most RNA-seq analyses in R.
Preparing a DGEList object and sample metadata correctly is essential for downstream analyses with limma and edgeR.
Understanding the data structure and design factors early helps ensure trustworthy DE results later.

Filtering lowly expressed genes improves signal-to-noise and reliability of DE results.
Library size influences count comparisons; larger libraries generally provide more statistical power.
Visualisation (PCA, boxplots, RLE) is essential to detect technical variation and assess data quality.
Log transformation of counts stabilises variance and makes visual patterns easier to interpret.
Normalisation corrects for library size and compositional biases, allowing fair comparison across samples.
TMM (trimmed mean of M-values) is a common method for normalising RNA-seq data using edgeR.
Good normalisation removes unwanted technical variation while preserving biological signal.
Persistent technical effects (e.g. batch effects) require more advanced correction strategies beyond basic normalisation.

RNA-seq data can be modelled with linear models after log transformation.
The limma workflow estimates group means and contrasts, then tests for DE.
Empirical Bayes moderation stabilises variances, improving reliability.
Adjusted p-values (FDR) control false discoveries across many tests.
limma-trend and limma-voom give similar results unless library sizes differ greatly.

The full DE workflow combines filtering, normalisation, modelling, and testing.
limma-trend and limma-voom differ mainly in handling library size variability.
Visualisations such as volcano plots, MD plots, and heatmaps summarise DE results clearly.
Public repositories like GEO and GREIN provide accessible RNA-seq count data.
Mastering reproducible code structure ensures robust and transparent RNA-seq analysis.