Challenge: Predict Employee Attrition

Before diving into a hands-on challenge, it is helpful to recap the typical steps involved in predictive modeling for employee attrition. You usually start by preparing your data, which includes collecting relevant employee features such as age, tenure, satisfaction, and department, and ensuring that the target column (attrition: 1 for left, 0 for stayed) is correctly formatted. The next step is to select and train an appropriate model; logistic regression is commonly used for binary classification problems like attrition prediction. After fitting the model to your data, you evaluate its performance using metrics such as accuracy (the proportion of correct predictions) and recall (the proportion of actual attrition cases correctly identified). Visualizations like confusion matrices and probability plots help you interpret the model’s predictions and understand where it performs well or needs improvement.


              1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
            
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, recall_score, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Example DataFrame
data = {
    'age': [25, 45, 30, 41, 36, 28, 50, 29],
    'tenure': [2, 10, 4, 8, 6, 3, 15, 1],
    'satisfaction': [0.9, 0.4, 0.7, 0.5, 0.6, 0.8, 0.3, 0.95],
    'department': ['Sales', 'HR', 'IT', 'Sales', 'IT', 'HR', 'Sales', 'IT'],
    'attrition': [0, 1, 0, 1, 0, 0, 1, 0]
}
df = pd.DataFrame(data)

# One-hot encode categorical variables
df_encoded = pd.get_dummies(df, columns=['department'], drop_first=True)

# Features and target
X = df_encoded.drop('attrition', axis=1)
y = df_encoded['attrition']

# Model setup
model = LogisticRegression()
model.fit(X, y)

# Predict
y_pred = model.predict(X)
accuracy = accuracy_score(y, y_pred)
recall = recall_score(y, y_pred)

# Confusion matrix
cm = confusion_matrix(y, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

# Predicted probabilities
probs = model.predict_proba(X)[:, 1]
plt.bar(range(len(probs)), probs)
plt.xlabel('Employee')
plt.ylabel('Predicted Probability of Attrition')
plt.title('Attrition Probability by Employee')
plt.show()

# Print metrics
print("Accuracy:", accuracy)
print("Recall:", recall)

# Summary: 
# The logistic regression model predicts employee attrition using age, tenure, satisfaction, and department.
# Accuracy shows the proportion of correct predictions, while recall indicates how well the model identifies employees who left.
# The confusion matrix and probability plot help visualize model performance and individual risk.

When interpreting your attrition model results, remember that accuracy alone may not capture the value of your predictions—recall is especially important if HR wants to minimize missed cases of likely attrition. Use the confusion matrix to identify false positives and false negatives, and review predicted probabilities to spot employees at high risk. In communicating findings to HR, focus on actionable insights:

Which features are most associated with attrition;
Which employees may benefit from targeted retention strategies based on their predicted risk.

Oliko kaikki selvää?

Kiitos palautteestasi!

Osio 3. Luku 5

single

Pyyhkäise näyttääksesi valikon

Before diving into a hands-on challenge, it is helpful to recap the typical steps involved in predictive modeling for employee attrition. You usually start by preparing your data, which includes collecting relevant employee features such as age, tenure, satisfaction, and department, and ensuring that the target column (attrition: 1 for left, 0 for stayed) is correctly formatted. The next step is to select and train an appropriate model; logistic regression is commonly used for binary classification problems like attrition prediction. After fitting the model to your data, you evaluate its performance using metrics such as accuracy (the proportion of correct predictions) and recall (the proportion of actual attrition cases correctly identified). Visualizations like confusion matrices and probability plots help you interpret the model’s predictions and understand where it performs well or needs improvement.


              1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
            
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, recall_score, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Example DataFrame
data = {
    'age': [25, 45, 30, 41, 36, 28, 50, 29],
    'tenure': [2, 10, 4, 8, 6, 3, 15, 1],
    'satisfaction': [0.9, 0.4, 0.7, 0.5, 0.6, 0.8, 0.3, 0.95],
    'department': ['Sales', 'HR', 'IT', 'Sales', 'IT', 'HR', 'Sales', 'IT'],
    'attrition': [0, 1, 0, 1, 0, 0, 1, 0]
}
df = pd.DataFrame(data)

# One-hot encode categorical variables
df_encoded = pd.get_dummies(df, columns=['department'], drop_first=True)

# Features and target
X = df_encoded.drop('attrition', axis=1)
y = df_encoded['attrition']

# Model setup
model = LogisticRegression()
model.fit(X, y)

# Predict
y_pred = model.predict(X)
accuracy = accuracy_score(y, y_pred)
recall = recall_score(y, y_pred)

# Confusion matrix
cm = confusion_matrix(y, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

# Predicted probabilities
probs = model.predict_proba(X)[:, 1]
plt.bar(range(len(probs)), probs)
plt.xlabel('Employee')
plt.ylabel('Predicted Probability of Attrition')
plt.title('Attrition Probability by Employee')
plt.show()

# Print metrics
print("Accuracy:", accuracy)
print("Recall:", recall)

# Summary: 
# The logistic regression model predicts employee attrition using age, tenure, satisfaction, and department.
# Accuracy shows the proportion of correct predictions, while recall indicates how well the model identifies employees who left.
# The confusion matrix and probability plot help visualize model performance and individual risk.

When interpreting your attrition model results, remember that accuracy alone may not capture the value of your predictions—recall is especially important if HR wants to minimize missed cases of likely attrition. Use the confusion matrix to identify false positives and false negatives, and review predicted probabilities to spot employees at high risk. In communicating findings to HR, focus on actionable insights:

Which features are most associated with attrition;
Which employees may benefit from targeted retention strategies based on their predicted risk.

Tehtävä

Pyyhkäise aloittaaksesi koodauksen

Build a Python script that predicts employee attrition using logistic regression. Use the provided DataFrame with employee features and attrition labels. Your script must:

Train a logistic regression model to predict attrition based on age, tenure, satisfaction, and one-hot encoded department.
Predict attrition for all employees in the DataFrame.
Calculate and store the accuracy and recall of the predictions.
Visualize the confusion matrix of actual vs. predicted attrition.
Visualize the predicted probability of attrition for each employee.
Summarize your findings in comments at the end of your script.

Ratkaisu

Vaihda työpöytään todellista harjoitusta vartenJatka siitä, missä olet käyttämällä jotakin alla olevista vaihtoehdoista

Oliko kaikki selvää?

Kiitos palautteestasi!

Osio 3. Luku 5

single

Kysy tekoälyä

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme