Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lära QSAR Basics: Feature Selection, Regression & Classification | Similarity, Clustering and Drug Discovery
Python for Chemoinformatics

bookQSAR Basics: Feature Selection, Regression & Classification

QSAR, or Quantitative Structure-Activity Relationship, is a foundational concept in chemoinformatics that links the structural features of molecules to their biological or chemical activities. In drug discovery, QSAR models help you predict how new molecules might behave—such as their potency, toxicity, or other properties—based on patterns learned from known compounds. This approach saves time and resources by focusing laboratory testing on the most promising candidates.

Note
Definition

QSAR (Quantitative Structure-Activity Relationship) is a method for predicting molecular properties based on their chemical structure.

1234567891011121314151617181920212223242526
# Extracting multiple molecular descriptors using RDKit from rdkit import Chem from rdkit.Chem import Descriptors # Example: list of SMILES for small molecules smiles_list = ["CCO", "CCN", "c1ccccc1", "CC(=O)O"] # Prepare a list to hold descriptor values descriptor_names = ["MolWt", "NumHDonors", "NumHAcceptors", "MolLogP"] results = [] for smi in smiles_list: mol = Chem.MolFromSmiles(smi) if mol: desc_values = [ Descriptors.MolWt(mol), Descriptors.NumHDonors(mol), Descriptors.NumHAcceptors(mol), Descriptors.MolLogP(mol) ] results.append([smi] + desc_values) # Print the descriptors table print(f"{'SMILES':<12}{'MolWt':>8}{'HDonors':>10}{'HAcceptors':>12}{'LogP':>8}") for row in results: print(f"{row[0]:<12}{row[1]:8.2f}{row[2]:10}{row[3]:12}{row[4]:8.2f}")
copy

When building QSAR models, you often calculate many descriptors for each molecule. However, not all descriptors are equally useful. Feature selection is the process of identifying which descriptors (features) actually help your model make accurate predictions. Selecting relevant features helps you avoid overfitting, reduces computational cost, and can reveal which molecular properties are most important for the activity you are modeling.

12345678910111213141516171819202122232425262728293031
# Fitting a simple regression model using scikit-learn with molecular descriptors from rdkit import Chem from rdkit.Chem import Descriptors from sklearn.linear_model import LinearRegression import numpy as np # Example: SMILES and their (fake) experimental property values smiles_list = ["CCO", "CCN", "c1ccccc1", "CC(=O)O"] y = [0.5, 0.8, 1.2, 0.4] # Example property values # Calculate descriptors: MolWt and MolLogP X = [] for smi in smiles_list: mol = Chem.MolFromSmiles(smi) if mol: X.append([ Descriptors.MolWt(mol), Descriptors.MolLogP(mol) ]) X = np.array(X) # Fit a regression model model = LinearRegression() model.fit(X, y) # Predict property for a new molecule test_smi = "CCCO" test_mol = Chem.MolFromSmiles(test_smi) test_desc = np.array([[Descriptors.MolWt(test_mol), Descriptors.MolLogP(test_mol)]]) prediction = model.predict(test_desc) print(f"Predicted property for {test_smi}: {prediction[0]:.2f}")
copy

In QSAR, the type of prediction you want to make determines whether you use regression or classification. Regression models predict continuous values, such as solubility or binding affinity. Classification models, on the other hand, predict discrete categories, such as "active" vs "inactive" compounds. Choosing between regression and classification depends on the nature of your target property and the available experimental data.

Note
Study More

In this chapter, we briefly touched on several Python libraries that are important for effective analysis. To deepen your knowledge and gain more hands-on practice, we recommend taking the following courses:

1. What does QSAR stand for?

2. Why is feature selection important in QSAR modeling?

3. What is the difference between regression and classification?

question mark

What does QSAR stand for?

Select the correct answer

question mark

Why is feature selection important in QSAR modeling?

Select all correct answers

question mark

What is the difference between regression and classification?

Select all correct answers

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 2. Kapitel 5

Fråga AI

expand

Fråga AI

ChatGPT

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Suggested prompts:

Can you explain how to choose which molecular descriptors to use?

What are some common feature selection techniques in QSAR modeling?

How do I decide between using regression or classification for my QSAR problem?

bookQSAR Basics: Feature Selection, Regression & Classification

Svep för att visa menyn

QSAR, or Quantitative Structure-Activity Relationship, is a foundational concept in chemoinformatics that links the structural features of molecules to their biological or chemical activities. In drug discovery, QSAR models help you predict how new molecules might behave—such as their potency, toxicity, or other properties—based on patterns learned from known compounds. This approach saves time and resources by focusing laboratory testing on the most promising candidates.

Note
Definition

QSAR (Quantitative Structure-Activity Relationship) is a method for predicting molecular properties based on their chemical structure.

1234567891011121314151617181920212223242526
# Extracting multiple molecular descriptors using RDKit from rdkit import Chem from rdkit.Chem import Descriptors # Example: list of SMILES for small molecules smiles_list = ["CCO", "CCN", "c1ccccc1", "CC(=O)O"] # Prepare a list to hold descriptor values descriptor_names = ["MolWt", "NumHDonors", "NumHAcceptors", "MolLogP"] results = [] for smi in smiles_list: mol = Chem.MolFromSmiles(smi) if mol: desc_values = [ Descriptors.MolWt(mol), Descriptors.NumHDonors(mol), Descriptors.NumHAcceptors(mol), Descriptors.MolLogP(mol) ] results.append([smi] + desc_values) # Print the descriptors table print(f"{'SMILES':<12}{'MolWt':>8}{'HDonors':>10}{'HAcceptors':>12}{'LogP':>8}") for row in results: print(f"{row[0]:<12}{row[1]:8.2f}{row[2]:10}{row[3]:12}{row[4]:8.2f}")
copy

When building QSAR models, you often calculate many descriptors for each molecule. However, not all descriptors are equally useful. Feature selection is the process of identifying which descriptors (features) actually help your model make accurate predictions. Selecting relevant features helps you avoid overfitting, reduces computational cost, and can reveal which molecular properties are most important for the activity you are modeling.

12345678910111213141516171819202122232425262728293031
# Fitting a simple regression model using scikit-learn with molecular descriptors from rdkit import Chem from rdkit.Chem import Descriptors from sklearn.linear_model import LinearRegression import numpy as np # Example: SMILES and their (fake) experimental property values smiles_list = ["CCO", "CCN", "c1ccccc1", "CC(=O)O"] y = [0.5, 0.8, 1.2, 0.4] # Example property values # Calculate descriptors: MolWt and MolLogP X = [] for smi in smiles_list: mol = Chem.MolFromSmiles(smi) if mol: X.append([ Descriptors.MolWt(mol), Descriptors.MolLogP(mol) ]) X = np.array(X) # Fit a regression model model = LinearRegression() model.fit(X, y) # Predict property for a new molecule test_smi = "CCCO" test_mol = Chem.MolFromSmiles(test_smi) test_desc = np.array([[Descriptors.MolWt(test_mol), Descriptors.MolLogP(test_mol)]]) prediction = model.predict(test_desc) print(f"Predicted property for {test_smi}: {prediction[0]:.2f}")
copy

In QSAR, the type of prediction you want to make determines whether you use regression or classification. Regression models predict continuous values, such as solubility or binding affinity. Classification models, on the other hand, predict discrete categories, such as "active" vs "inactive" compounds. Choosing between regression and classification depends on the nature of your target property and the available experimental data.

Note
Study More

In this chapter, we briefly touched on several Python libraries that are important for effective analysis. To deepen your knowledge and gain more hands-on practice, we recommend taking the following courses:

1. What does QSAR stand for?

2. Why is feature selection important in QSAR modeling?

3. What is the difference between regression and classification?

question mark

What does QSAR stand for?

Select the correct answer

question mark

Why is feature selection important in QSAR modeling?

Select all correct answers

question mark

What is the difference between regression and classification?

Select all correct answers

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 2. Kapitel 5
some-alt