Aprenda Normalization and Differential Expression | Gene Expression Analysis and Reproducible Pipelines

Deslize para mostrar o menu

Normalization is a crucial step in gene expression analysis because raw RNA-seq count data can be influenced by technical factors unrelated to the biological differences you want to study. One major factor is differences in sequencing depth: some samples might have more total reads simply due to how the sequencing experiment was performed. Without normalization, comparing gene expression levels across samples would be misleading, as genes in samples with higher sequencing depth would appear artificially more abundant.

To address this, normalization methods adjust the raw counts to account for these differences. Common normalization approaches include:

Counts Per Million (CPM): adjusts counts so that the total counts in each sample are scaled to a million, allowing direct comparison across samples;
Transcripts Per Million (TPM): similar to CPM, but also accounts for gene length, making it useful for comparing expression of different genes within a sample;
Reads Per Kilobase Million (RPKM): normalizes for both sequencing depth and gene length, commonly used in single-end RNA-seq data.


              123456789101112131415161718
            
import numpy as np
import pandas as pd

# Example raw count data for three genes in two samples
counts = pd.DataFrame({
    "Gene": ["GeneA", "GeneB", "GeneC"],
    "Sample1": [500, 300, 200],
    "Sample2": [800, 400, 100]
})

# Calculate library size (total counts per sample)
library_sizes = counts[["Sample1", "Sample2"]].sum()

# Calculate CPM for each gene in each sample
cpm = counts[["Sample1", "Sample2"]].div(library_sizes) * 1e6
cpm.insert(0, "Gene", counts["Gene"])

print(cpm)

Once counts are normalized, you can begin to explore biological questions such as which genes are expressed differently between experimental conditions. This is the essence of differential expression analysis. The goal is to identify genes whose expression levels change significantly between groups, such as treated versus control samples.

A basic way to detect differential expression is to compare the normalized expression values between two conditions and calculate the fold change. Fold change quantifies how much a gene's expression increases or decreases between conditions.


              123456789101112
            
# Example: calculate fold change between two conditions for three genes
# Assume CPM-normalized values for two samples
cpm_values = pd.DataFrame({
    "Gene": ["GeneA", "GeneB", "GeneC"],
    "Control": [1000, 500, 250],
    "Treatment": [2000, 250, 500]
})

# Calculate fold change (Treatment / Control)
cpm_values["FoldChange"] = cpm_values["Treatment"] / cpm_values["Control"]

print(cpm_values[["Gene", "FoldChange"]])

While simple fold change calculations can highlight large differences in expression, they do not account for biological variability or statistical significance. In real analyses, you need statistical testing to determine whether observed changes are likely due to true biological effects or just random noise. Methods such as t-tests, or more advanced models built for RNA-seq data, help you identify differentially expressed genes with greater confidence.

Study More

To perform robust differential expression analysis, explore specialized tools such as DESeq2 and edgeR. These packages implement statistical models designed for RNA-seq data and are widely used in the bioinformatics community.

1. What is the purpose of normalizing RNA-seq count data?

2. How is fold change calculated in gene expression analysis?

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 3. Capítulo 3

Pergunte à IA

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Seção 3. Capítulo 3