Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Impara Molecular Clustering | Similarity, Clustering and Drug Discovery
Python for Chemoinformatics

bookMolecular Clustering

Clustering is a powerful approach in chemoinformatics that allows you to group molecules based on their similarity. This process is essential when analyzing large libraries of compounds, as it helps you identify patterns, reduce redundancy, and select representative molecules for further study. In drug discovery, clustering can be used to organize chemical space, prioritize compounds for screening, and ensure diversity in a chemical library.

Note
Definition

A similarity matrix is a table that shows the pairwise similarity scores between molecules, typically calculated with metrics like Tanimoto similarity on fingerprints.
Clustering is the process of grouping molecules so that those within the same group (cluster) are more similar to each other than to those in other groups.

1234567891011121314151617181920212223242526272829303132333435363738394041
from rdkit import Chem, DataStructs from rdkit.Chem.rdFingerprintGenerator import GetMorganGenerator # Input SMILES smiles_list = [ "CCO", # ethanol "CCCO", # 1-propanol "CCCCO", # 1-butanol "CCN", # ethylamine "CC(=O)O" # acetic acid ] # Convert SMILES to Mol objects mols = [] for smi in smiles_list: mol = Chem.MolFromSmiles(smi) mols.append(mol) # Create Morgan fingerprint generator gen = GetMorganGenerator(radius=2, fpSize=1024) # Generate fingerprints fps = [] for mol in mols: fp = gen.GetFingerprint(mol) fps.append(fp) # Compute similarity matrix n = len(fps) similarity_matrix = [] for i in range(n): row = [] for j in range(n): similarity = DataStructs.TanimotoSimilarity(fps[i], fps[j]) row.append(similarity) similarity_matrix.append(row) # Print matrix for row in similarity_matrix: print(row)
copy

Clustering relies on the idea that molecules with similar properties or structures will have higher similarity scores. The similarity matrix you just computed shows pairwise similarity values between all molecules in your set. Clustering algorithms use this matrix to group molecules into clusters, where each cluster contains compounds that are more similar to each other than to those in other clusters. This process is often the first step before selecting representative compounds for further analysis or experimental testing.

12345678910111213141516171819
# Simple clustering: Group molecules with similarity above a threshold threshold = 0.7 clusters = [] assigned = set() for i in range(n): if i in assigned: continue cluster = [i] assigned.add(i) for j in range(i + 1, n): if similarity_matrix[i][j] >= threshold: cluster.append(j) assigned.add(j) clusters.append(cluster) # Print clusters with molecule indices and SMILES for idx, cluster in enumerate(clusters): print(f"Cluster {idx + 1}: {[smiles_list[i] for i in cluster]}")
copy

By clustering molecules based on their similarity, you can quickly spot groups of redundant compounds—those that are very similar to each other. This helps you avoid screening or analyzing nearly identical molecules, saving both time and resources. At the same time, clustering highlights diverse representatives from your library, which is crucial for exploring new chemical space and increasing the chances of finding novel active compounds.

1. What is the purpose of clustering molecules?

2. What is a potential application of molecular clustering?

question mark

What is the purpose of clustering molecules?

Select all correct answers

question mark

What is a potential application of molecular clustering?

Select all correct answers

Tutto è chiaro?

Come possiamo migliorarlo?

Grazie per i tuoi commenti!

Sezione 2. Capitolo 3

Chieda ad AI

expand

Chieda ad AI

ChatGPT

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

bookMolecular Clustering

Scorri per mostrare il menu

Clustering is a powerful approach in chemoinformatics that allows you to group molecules based on their similarity. This process is essential when analyzing large libraries of compounds, as it helps you identify patterns, reduce redundancy, and select representative molecules for further study. In drug discovery, clustering can be used to organize chemical space, prioritize compounds for screening, and ensure diversity in a chemical library.

Note
Definition

A similarity matrix is a table that shows the pairwise similarity scores between molecules, typically calculated with metrics like Tanimoto similarity on fingerprints.
Clustering is the process of grouping molecules so that those within the same group (cluster) are more similar to each other than to those in other groups.

1234567891011121314151617181920212223242526272829303132333435363738394041
from rdkit import Chem, DataStructs from rdkit.Chem.rdFingerprintGenerator import GetMorganGenerator # Input SMILES smiles_list = [ "CCO", # ethanol "CCCO", # 1-propanol "CCCCO", # 1-butanol "CCN", # ethylamine "CC(=O)O" # acetic acid ] # Convert SMILES to Mol objects mols = [] for smi in smiles_list: mol = Chem.MolFromSmiles(smi) mols.append(mol) # Create Morgan fingerprint generator gen = GetMorganGenerator(radius=2, fpSize=1024) # Generate fingerprints fps = [] for mol in mols: fp = gen.GetFingerprint(mol) fps.append(fp) # Compute similarity matrix n = len(fps) similarity_matrix = [] for i in range(n): row = [] for j in range(n): similarity = DataStructs.TanimotoSimilarity(fps[i], fps[j]) row.append(similarity) similarity_matrix.append(row) # Print matrix for row in similarity_matrix: print(row)
copy

Clustering relies on the idea that molecules with similar properties or structures will have higher similarity scores. The similarity matrix you just computed shows pairwise similarity values between all molecules in your set. Clustering algorithms use this matrix to group molecules into clusters, where each cluster contains compounds that are more similar to each other than to those in other clusters. This process is often the first step before selecting representative compounds for further analysis or experimental testing.

12345678910111213141516171819
# Simple clustering: Group molecules with similarity above a threshold threshold = 0.7 clusters = [] assigned = set() for i in range(n): if i in assigned: continue cluster = [i] assigned.add(i) for j in range(i + 1, n): if similarity_matrix[i][j] >= threshold: cluster.append(j) assigned.add(j) clusters.append(cluster) # Print clusters with molecule indices and SMILES for idx, cluster in enumerate(clusters): print(f"Cluster {idx + 1}: {[smiles_list[i] for i in cluster]}")
copy

By clustering molecules based on their similarity, you can quickly spot groups of redundant compounds—those that are very similar to each other. This helps you avoid screening or analyzing nearly identical molecules, saving both time and resources. At the same time, clustering highlights diverse representatives from your library, which is crucial for exploring new chemical space and increasing the chances of finding novel active compounds.

1. What is the purpose of clustering molecules?

2. What is a potential application of molecular clustering?

question mark

What is the purpose of clustering molecules?

Select all correct answers

question mark

What is a potential application of molecular clustering?

Select all correct answers

Tutto è chiaro?

Come possiamo migliorarlo?

Grazie per i tuoi commenti!

Sezione 2. Capitolo 3
some-alt