Impara Molecular Clustering | Similarity, Clustering and Drug Discovery

Clustering is a powerful approach in chemoinformatics that allows you to group molecules based on their similarity. This process is essential when analyzing large libraries of compounds, as it helps you identify patterns, reduce redundancy, and select representative molecules for further study. In drug discovery, clustering can be used to organize chemical space, prioritize compounds for screening, and ensure diversity in a chemical library.

Definition

A similarity matrix is a table that shows the pairwise similarity scores between molecules, typically calculated with metrics like Tanimoto similarity on fingerprints.
Clustering is the process of grouping molecules so that those within the same group (cluster) are more similar to each other than to those in other groups.


              1234567891011121314151617181920212223242526272829303132333435363738394041
            
from rdkit import Chem, DataStructs
from rdkit.Chem.rdFingerprintGenerator import GetMorganGenerator

# Input SMILES
smiles_list = [
    "CCO",        # ethanol
    "CCCO",       # 1-propanol
    "CCCCO",      # 1-butanol
    "CCN",        # ethylamine
    "CC(=O)O"     # acetic acid
]

# Convert SMILES to Mol objects
mols = []
for smi in smiles_list:
    mol = Chem.MolFromSmiles(smi)
    mols.append(mol)

# Create Morgan fingerprint generator
gen = GetMorganGenerator(radius=2, fpSize=1024)

# Generate fingerprints
fps = []
for mol in mols:
    fp = gen.GetFingerprint(mol)
    fps.append(fp)

# Compute similarity matrix
n = len(fps)
similarity_matrix = []

for i in range(n):
    row = []
    for j in range(n):
        similarity = DataStructs.TanimotoSimilarity(fps[i], fps[j])
        row.append(similarity)
    similarity_matrix.append(row)

# Print matrix
for row in similarity_matrix:
    print(row)

Clustering relies on the idea that molecules with similar properties or structures will have higher similarity scores. The similarity matrix you just computed shows pairwise similarity values between all molecules in your set. Clustering algorithms use this matrix to group molecules into clusters, where each cluster contains compounds that are more similar to each other than to those in other clusters. This process is often the first step before selecting representative compounds for further analysis or experimental testing.


              12345678910111213141516171819
            
# Simple clustering: Group molecules with similarity above a threshold
threshold = 0.7
clusters = []
assigned = set()

for i in range(n):
    if i in assigned:
        continue
    cluster = [i]
    assigned.add(i)
    for j in range(i + 1, n):
        if similarity_matrix[i][j] >= threshold:
            cluster.append(j)
            assigned.add(j)
    clusters.append(cluster)

# Print clusters with molecule indices and SMILES
for idx, cluster in enumerate(clusters):
    print(f"Cluster {idx + 1}: {[smiles_list[i] for i in cluster]}")

By clustering molecules based on their similarity, you can quickly spot groups of redundant compounds—those that are very similar to each other. This helps you avoid screening or analyzing nearly identical molecules, saving both time and resources. At the same time, clustering highlights diverse representatives from your library, which is crucial for exploring new chemical space and increasing the chances of finding novel active compounds.

1. What is the purpose of clustering molecules?

2. What is a potential application of molecular clustering?

Tutto è chiaro?

Grazie per i tuoi commenti!

Sezione 2. Capitolo 3

Chieda ad AI

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

Scorri per mostrare il menu