Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprenda Normalization and Differential Expression | Gene Expression Analysis and Reproducible Pipelines
Practice
Projects
Quizzes & Challenges
Quizzes
Challenges
/
Python for Bioinformatics

bookNormalization and Differential Expression

Normalization is a crucial step in gene expression analysis because raw RNA-seq count data can be influenced by technical factors unrelated to the biological differences you want to study. One major factor is differences in sequencing depth: some samples might have more total reads simply due to how the sequencing experiment was performed. Without normalization, comparing gene expression levels across samples would be misleading, as genes in samples with higher sequencing depth would appear artificially more abundant.

To address this, normalization methods adjust the raw counts to account for these differences. Common normalization approaches include:

  • Counts Per Million (CPM): adjusts counts so that the total counts in each sample are scaled to a million, allowing direct comparison across samples;
  • Transcripts Per Million (TPM): similar to CPM, but also accounts for gene length, making it useful for comparing expression of different genes within a sample;
  • Reads Per Kilobase Million (RPKM): normalizes for both sequencing depth and gene length, commonly used in single-end RNA-seq data.
123456789101112131415161718
import numpy as np import pandas as pd # Example raw count data for three genes in two samples counts = pd.DataFrame({ "Gene": ["GeneA", "GeneB", "GeneC"], "Sample1": [500, 300, 200], "Sample2": [800, 400, 100] }) # Calculate library size (total counts per sample) library_sizes = counts[["Sample1", "Sample2"]].sum() # Calculate CPM for each gene in each sample cpm = counts[["Sample1", "Sample2"]].div(library_sizes) * 1e6 cpm.insert(0, "Gene", counts["Gene"]) print(cpm)
copy

Once counts are normalized, you can begin to explore biological questions such as which genes are expressed differently between experimental conditions. This is the essence of differential expression analysis. The goal is to identify genes whose expression levels change significantly between groups, such as treated versus control samples.

A basic way to detect differential expression is to compare the normalized expression values between two conditions and calculate the fold change. Fold change quantifies how much a gene's expression increases or decreases between conditions.

123456789101112
# Example: calculate fold change between two conditions for three genes # Assume CPM-normalized values for two samples cpm_values = pd.DataFrame({ "Gene": ["GeneA", "GeneB", "GeneC"], "Control": [1000, 500, 250], "Treatment": [2000, 250, 500] }) # Calculate fold change (Treatment / Control) cpm_values["FoldChange"] = cpm_values["Treatment"] / cpm_values["Control"] print(cpm_values[["Gene", "FoldChange"]])
copy

While simple fold change calculations can highlight large differences in expression, they do not account for biological variability or statistical significance. In real analyses, you need statistical testing to determine whether observed changes are likely due to true biological effects or just random noise. Methods such as t-tests, or more advanced models built for RNA-seq data, help you identify differentially expressed genes with greater confidence.

Note
Study More

To perform robust differential expression analysis, explore specialized tools such as DESeq2 and edgeR. These packages implement statistical models designed for RNA-seq data and are widely used in the bioinformatics community.

1. What is the purpose of normalizing RNA-seq count data?

2. How is fold change calculated in gene expression analysis?

question mark

What is the purpose of normalizing RNA-seq count data?

Select the correct answer

question mark

How is fold change calculated in gene expression analysis?

Select the correct answer

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 3. Capítulo 3

Pergunte à IA

expand

Pergunte à IA

ChatGPT

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Suggested prompts:

Can you explain how to interpret the fold change values?

What statistical methods are commonly used for differential expression analysis?

How do I handle biological variability in gene expression data?

bookNormalization and Differential Expression

Deslize para mostrar o menu

Normalization is a crucial step in gene expression analysis because raw RNA-seq count data can be influenced by technical factors unrelated to the biological differences you want to study. One major factor is differences in sequencing depth: some samples might have more total reads simply due to how the sequencing experiment was performed. Without normalization, comparing gene expression levels across samples would be misleading, as genes in samples with higher sequencing depth would appear artificially more abundant.

To address this, normalization methods adjust the raw counts to account for these differences. Common normalization approaches include:

  • Counts Per Million (CPM): adjusts counts so that the total counts in each sample are scaled to a million, allowing direct comparison across samples;
  • Transcripts Per Million (TPM): similar to CPM, but also accounts for gene length, making it useful for comparing expression of different genes within a sample;
  • Reads Per Kilobase Million (RPKM): normalizes for both sequencing depth and gene length, commonly used in single-end RNA-seq data.
123456789101112131415161718
import numpy as np import pandas as pd # Example raw count data for three genes in two samples counts = pd.DataFrame({ "Gene": ["GeneA", "GeneB", "GeneC"], "Sample1": [500, 300, 200], "Sample2": [800, 400, 100] }) # Calculate library size (total counts per sample) library_sizes = counts[["Sample1", "Sample2"]].sum() # Calculate CPM for each gene in each sample cpm = counts[["Sample1", "Sample2"]].div(library_sizes) * 1e6 cpm.insert(0, "Gene", counts["Gene"]) print(cpm)
copy

Once counts are normalized, you can begin to explore biological questions such as which genes are expressed differently between experimental conditions. This is the essence of differential expression analysis. The goal is to identify genes whose expression levels change significantly between groups, such as treated versus control samples.

A basic way to detect differential expression is to compare the normalized expression values between two conditions and calculate the fold change. Fold change quantifies how much a gene's expression increases or decreases between conditions.

123456789101112
# Example: calculate fold change between two conditions for three genes # Assume CPM-normalized values for two samples cpm_values = pd.DataFrame({ "Gene": ["GeneA", "GeneB", "GeneC"], "Control": [1000, 500, 250], "Treatment": [2000, 250, 500] }) # Calculate fold change (Treatment / Control) cpm_values["FoldChange"] = cpm_values["Treatment"] / cpm_values["Control"] print(cpm_values[["Gene", "FoldChange"]])
copy

While simple fold change calculations can highlight large differences in expression, they do not account for biological variability or statistical significance. In real analyses, you need statistical testing to determine whether observed changes are likely due to true biological effects or just random noise. Methods such as t-tests, or more advanced models built for RNA-seq data, help you identify differentially expressed genes with greater confidence.

Note
Study More

To perform robust differential expression analysis, explore specialized tools such as DESeq2 and edgeR. These packages implement statistical models designed for RNA-seq data and are widely used in the bioinformatics community.

1. What is the purpose of normalizing RNA-seq count data?

2. How is fold change calculated in gene expression analysis?

question mark

What is the purpose of normalizing RNA-seq count data?

Select the correct answer

question mark

How is fold change calculated in gene expression analysis?

Select the correct answer

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 3. Capítulo 3
some-alt