Machine Learning Workflow
Let's look at the workflow you would go through to build a successful machine learning project.
Step 1. Get the Data
Define the problem, choose a performance metric, and decide what qualifies as a good result. Then gather the required data from available sources and bring it into a format ready for Python. If the data already exists in a CSV file, preprocessing can begin immediately.
Example
A hospital compiles patient records and demographics into a CSV file. The goal is to predict readmissions, aiming for over 80% accuracy.
Step 2. Preprocess the Data
This step includes:
- Data cleaning: handling missing values and non-numerical inputs;
- EDA: analyzing and visualizing data to understand relationships and detect issues;
- Feature engineering: selecting or creating features that improve model performance.
Example
Missing values (e.g., blood pressure) are filled, and categorical features (e.g., race) are converted into numerical form.
Step 3. Modeling
This stage includes:
- Choosing a model based on problem type and experiments;
- Hyperparameter tuning to improve performance;
- Model evaluation on unseen data.
Hyperparameters are like adjustable controls that define how the model trainsβsuch as training duration or model complexity.
Example
A classification model is selected for predicting readmission (yes/no). After tuning, it is evaluated on a validation/test set to assess generalization.
Step 4. Deployment
Once a model performs well, it is deployed to real systems. The model must be monitored, updated with new data, and improved over time, often restarting the cycle from Step 1.
Example
The model is integrated into the hospital system to flag high-risk patients at admission, helping staff act early.
Some of these terms mentioned here may sound unfamiliar, but we'll discuss them in more detail later in this course.
Data preprocessing and modeling can be done with scikit-learn. The next chapters introduce preprocessing workflows and pipelines, followed by modeling using k-nearest neighbors (KNeighborsClassifier), including training, tuning, and evaluation.
1. What is the primary purpose of the "Get the data" step in a machine learning project?
2. Which of the following best describes the importance of the "Data preprocessing" step in a machine learning project workflow?
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Can you explain more about data preprocessing steps?
What is feature engineering and why is it important?
How does KNeighborsClassifier work in machine learning?
Awesome!
Completion rate improved to 3.13
Machine Learning Workflow
Swipe to show menu
Let's look at the workflow you would go through to build a successful machine learning project.
Step 1. Get the Data
Define the problem, choose a performance metric, and decide what qualifies as a good result. Then gather the required data from available sources and bring it into a format ready for Python. If the data already exists in a CSV file, preprocessing can begin immediately.
Example
A hospital compiles patient records and demographics into a CSV file. The goal is to predict readmissions, aiming for over 80% accuracy.
Step 2. Preprocess the Data
This step includes:
- Data cleaning: handling missing values and non-numerical inputs;
- EDA: analyzing and visualizing data to understand relationships and detect issues;
- Feature engineering: selecting or creating features that improve model performance.
Example
Missing values (e.g., blood pressure) are filled, and categorical features (e.g., race) are converted into numerical form.
Step 3. Modeling
This stage includes:
- Choosing a model based on problem type and experiments;
- Hyperparameter tuning to improve performance;
- Model evaluation on unseen data.
Hyperparameters are like adjustable controls that define how the model trainsβsuch as training duration or model complexity.
Example
A classification model is selected for predicting readmission (yes/no). After tuning, it is evaluated on a validation/test set to assess generalization.
Step 4. Deployment
Once a model performs well, it is deployed to real systems. The model must be monitored, updated with new data, and improved over time, often restarting the cycle from Step 1.
Example
The model is integrated into the hospital system to flag high-risk patients at admission, helping staff act early.
Some of these terms mentioned here may sound unfamiliar, but we'll discuss them in more detail later in this course.
Data preprocessing and modeling can be done with scikit-learn. The next chapters introduce preprocessing workflows and pipelines, followed by modeling using k-nearest neighbors (KNeighborsClassifier), including training, tuning, and evaluation.
1. What is the primary purpose of the "Get the data" step in a machine learning project?
2. Which of the following best describes the importance of the "Data preprocessing" step in a machine learning project workflow?
Thanks for your feedback!