Summary  
This chapter covers how to quantify conservation in aligned sequences by calculating percent identity and deriving consensus sequences to identify conserved versus variable positions.  

General domain of usage  
Evolutionary biology

Conserved regions are stretches of **DNA**, **RNA**, or **protein sequences** that remain relatively unchanged across different species or among individuals of a species. These regions are often preserved throughout evolution because they perform essential biological functions. In **evolutionary biology**, conserved regions can indicate shared ancestry and help identify functionally important parts of genes or proteins. In **disease research**, mutations in conserved regions are more likely to have significant effects, making them valuable for understanding genetic disorders and developing targeted therapies.

```
# Example of a multiple sequence alignment
# Aligned DNA sequences from three species

Seq1: ATGCTAGCTAGGCTA
Seq2: ATGCTAGCTAGACTA
Seq3: ATGCTAGCTAGGCTA

# Conserved positions: 1-10, 12-15 (all sequences are identical)
# Variable position: 11 (Seq2 has 'A', others have 'G')
```

To quantify conservation in aligned sequences, you can use several methods. **Percent identity** measures the proportion of positions with identical residues across sequences, providing a simple metric for similarity. Another approach is to determine the **consensus sequence**, which represents the most common nucleotide or amino acid at each alignment position. Both methods help pinpoint highly conserved and variable regions, informing studies of function and evolution.

# Calculate percent identity for aligned DNA sequences
sequences = [
    "ATGCTAGCTAGGCTA",
    "ATGCTAGCTAGACTA",
    "ATGCTAGCTAGGCTA"
]

alignment_length = len(sequences[0])
matches = 0

for i in range(alignment_length):
    column = [seq[i] for seq in sequences]
    if column.count(column[0]) == len(column):
        matches += 1

percent_identity = (matches / alignment_length) * 100
print(f"Percent identity: {percent_identity:.2f}%")

Sequence logos are graphical representations that display the conservation and variability at each position in a multiple sequence alignment. They provide a visual summary of the **consensus** and highlight which positions are highly conserved or variable, making it easier to interpret alignment results and functional significance.

The '`.2f`' in the format string means to format the number to two decimal places.

Note

Why are conserved regions important in comparative genomics?

What does a high percent identity indicate about a set of sequences?

A beginner-friendly course introducing Python programming through real-world bioinformatics problems. Learn to handle biological data formats, analyze DNA sequences, process next-generation sequencing data, and build reproducible analysis pipelines—all with engaging, hands-on tasks and clear explanations.

Dive into sequence analysis techniques, including alignment basics and motif searching, to uncover biological insights from DNA and protein sequences.

Learn how to process next-generation sequencing (NGS) data, including read quality analysis and coverage calculations, using Python.

Explore gene expression analysis using RNA-seq data and learn how to build reproducible bioinformatics workflows with Python.

Conserved Regions and Sequence Variation

1. Why are conserved regions important in comparative genomics?

2. What does a high percent identity indicate about a set of sequences?