On this page
article
Project: ML Classifier
Build an end-to-end machine learning classifier with Scikit-learn — load data, preprocess, train, evaluate, and save the model.
Build a complete ML pipeline that trains a classifier on the Iris dataset, evaluates performance, and saves the model for reuse.
What You’ll Build
A script that:
- Loads and explores data
- Splits into train/test sets
- Builds a preprocessing + model pipeline
- Trains and evaluates with cross-validation
- Saves the best model to disk
Setup
pip install scikit-learn pandas matplotlib joblib
Step 1: Load and Explore
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df["species"] = iris.target
df["species_name"] = df["species"].map({0: "setosa", 1: "versicolor", 2: "virginica"})
print(df.head())
print(df["species_name"].value_counts())
print(df.describe())
Step 2: Visualize
import matplotlib.pyplot as plt
import seaborn as sns
sns.pairplot(df, hue="species_name")
plt.savefig("iris_pairplot.png")
plt.show()
Step 3: Build Pipeline
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
pipe = Pipeline([
("scaler", StandardScaler()),
("classifier", RandomForestClassifier(random_state=42)),
])
Step 4: Hyperparameter Tuning
param_grid = {
"classifier__n_estimators": [50, 100, 200],
"classifier__max_depth": [None, 5, 10],
}
grid = GridSearchCV(pipe, param_grid, cv=5, scoring="accuracy")
grid.fit(X_train, y_train)
print(f"Best params: {grid.best_params_}")
print(f"Best CV score: {grid.best_score_:.3f}")
print(f"Test score: {grid.score(X_test, y_test):.3f}")
Step 5: Evaluate
from sklearn.metrics import classification_report, confusion_matrix
y_pred = grid.predict(X_test)
print(classification_report(y_test, y_pred, target_names=iris.target_names))
print(confusion_matrix(y_test, y_pred))
Step 6: Save and Load Model
import joblib
joblib.dump(grid.best_estimator_, "iris_model.pkl")
loaded = joblib.load("iris_model.pkl")
sample = [[5.1, 3.5, 1.4, 0.2]]
prediction = loaded.predict(sample)
print(f"Predicted species: {iris.target_names[prediction[0]]}")
Step 7: Prediction Function
def predict_species(sepal_length, sepal_width, petal_length, petal_width):
model = joblib.load("iris_model.pkl")
features = [[sepal_length, sepal_width, petal_length, petal_width]]
pred = model.predict(features)[0]
return iris.target_names[pred]
print(predict_species(5.1, 3.5, 1.4, 0.2)) # setosa
Concepts Applied
Bonus Challenges
- Try different algorithms (SVM, Gradient Boosting) and compare
- Use a real-world dataset from Kaggle
- Build a CLI that accepts measurements and returns predictions (CLI Apps)
- Wrap the model in a FastAPI endpoint (REST API Project)
- Add feature importance visualization
This project teaches the standard ML workflow used in every data science team.