Scikit-Learn in Practice
Scikit-Learn is Python's most popular machine learning library. It gives you a consistent API for dozens of algorithms, built-in datasets, preprocessing tools, and evaluation utilities — all in one package.
Installation & Setup
pip install scikit-learn numpy pandas matplotlib
# Verify installation
python -c "import sklearn; print(sklearn.__version__)"
# → 1.4.0 The Consistent API: fit → predict
Every Scikit-Learn model follows the same three-method pattern. Once you learn one model, switching to any other is trivial:
model.fit(X_train, y_train)Learn from training data
model.predict(X_test)Make predictions
model.score(X_test, y_test)Evaluate accuracy
Your First ML Model (5 Lines)
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)
print(f"Accuracy: {model.score(X_test, y_test):.2%}") Built-In Datasets for Practice
150 flowers, 4 features, 3 species. The "Hello World" of ML.
load_iris() House prices regression. 20,000+ rows, 8 features.
fetch_california_housing() Binary classification: malignant vs benign. 569 samples, 30 features.
load_breast_cancer() Handwritten digit images (0-9). Classic image classification.
load_digits() Complete Workflow Example
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.pipeline import Pipeline
# 1. Load data
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
# 2. Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# 3. Build pipeline (prevents data leakage)
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
# 4. Train
pipeline.fit(X_train, y_train)
# 5. Evaluate
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred)) Hyperparameter Tuning with GridSearchCV
Hyperparameters are settings you choose before training (e.g., max_depth, n_estimators). GridSearchCV automatically tries all combinations and cross-validates each one.
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 5, 10],
'min_samples_split': [2, 5]
}
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5,
scoring='f1',
n_jobs=-1 # use all CPU cores
)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV F1: {grid_search.best_score_:.3f}")
best_model = grid_search.best_estimator_ GridSearchCV tries every combination. If you have 5 parameters with 4 values each, that's 4⁵ = 1024 models. Use RandomizedSearchCV to randomly sample the grid — much faster, almost as good.
Saving and Loading Models
import joblib
# Save model to disk
joblib.dump(pipeline, 'my_model.pkl')
# Load model later (in production)
loaded_model = joblib.load('my_model.pkl')
predictions = loaded_model.predict(new_data) Quick Reference: Common Estimators
ensemble.RandomForestClassifierlinear_model.LogisticRegressionsvm.SVClinear_model.LinearRegressionensemble.GradientBoostingRegressorcluster.KMeansdecomposition.PCAFrequently Asked Questions
Is Scikit-Learn good for deep learning?
No. Scikit-Learn is designed for classical ML on structured/tabular data. For deep learning (images, text, sequences), use PyTorch or TensorFlow. Scikit-Learn does have a basic MLPClassifier for simple neural networks, but it's not production-grade for complex tasks.
What's the difference between fit() and fit_transform()?
fit() learns parameters (e.g., mean/std for a scaler). transform() applies those parameters. fit_transform() does both in one step — convenient for training data. Always use transform() alone for test data to avoid leakage.
How do I handle imbalanced classes?
Use class_weight='balanced' in most classifiers. Or use SMOTE (from the imbalanced-learn library) to over-sample the minority class. Also evaluate with F1 or ROC-AUC rather than accuracy.
Frequently Asked Questions
What will I learn here?
This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.
How should I use this page?
Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.
What should I read next?
Use the navigation below to continue to the next lesson or explore related topics.