Scikit-Learn in Practice

Scikit-Learn is Python's most popular machine learning library. It gives you a consistent API for dozens of algorithms, built-in datasets, preprocessing tools, and evaluation utilities — all in one package.

📖 Covers: Installation · Datasets · fit/predict API · Pipelines · GridSearchCV · Saving Models

Installation & Setup

Terminal
pip install scikit-learn numpy pandas matplotlib

# Verify installation
python -c "import sklearn; print(sklearn.__version__)"
# → 1.4.0

The Consistent API: fit → predict

Every Scikit-Learn model follows the same three-method pattern. Once you learn one model, switching to any other is trivial:

1
model.fit(X_train, y_train)

Learn from training data

2
model.predict(X_test)

Make predictions

3
model.score(X_test, y_test)

Evaluate accuracy

Your First ML Model (5 Lines)

Python · Hello, Machine Learning
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)
print(f"Accuracy: {model.score(X_test, y_test):.2%}")

Built-In Datasets for Practice

🌸 Iris

150 flowers, 4 features, 3 species. The "Hello World" of ML.

load_iris()
🏠 California Housing

House prices regression. 20,000+ rows, 8 features.

fetch_california_housing()
🎗️ Breast Cancer

Binary classification: malignant vs benign. 569 samples, 30 features.

load_breast_cancer()
✍️ MNIST Digits

Handwritten digit images (0-9). Classic image classification.

load_digits()

Complete Workflow Example

Python · Full ML Workflow
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.pipeline import Pipeline

# 1. Load data
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)

# 2. Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 3. Build pipeline (prevents data leakage)
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# 4. Train
pipeline.fit(X_train, y_train)

# 5. Evaluate
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Hyperparameter Tuning with GridSearchCV

Hyperparameters are settings you choose before training (e.g., max_depth, n_estimators). GridSearchCV automatically tries all combinations and cross-validates each one.

Python · GridSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5]
}

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='f1',
    n_jobs=-1  # use all CPU cores
)

grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV F1: {grid_search.best_score_:.3f}")
best_model = grid_search.best_estimator_
💡 Try RandomizedSearchCV for large grids

GridSearchCV tries every combination. If you have 5 parameters with 4 values each, that's 4⁵ = 1024 models. Use RandomizedSearchCV to randomly sample the grid — much faster, almost as good.

Saving and Loading Models

Python · Save & Load
import joblib

# Save model to disk
joblib.dump(pipeline, 'my_model.pkl')

# Load model later (in production)
loaded_model = joblib.load('my_model.pkl')
predictions = loaded_model.predict(new_data)

Quick Reference: Common Estimators

TaskAlgorithmImport
ClassificationRandom Forestensemble.RandomForestClassifier
ClassificationLogistic Regressionlinear_model.LogisticRegression
ClassificationSVMsvm.SVC
RegressionLinear Regressionlinear_model.LinearRegression
RegressionGradient Boostingensemble.GradientBoostingRegressor
ClusteringK-Meanscluster.KMeans
Dim. ReductionPCAdecomposition.PCA

Frequently Asked Questions

Is Scikit-Learn good for deep learning?

No. Scikit-Learn is designed for classical ML on structured/tabular data. For deep learning (images, text, sequences), use PyTorch or TensorFlow. Scikit-Learn does have a basic MLPClassifier for simple neural networks, but it's not production-grade for complex tasks.

What's the difference between fit() and fit_transform()?

fit() learns parameters (e.g., mean/std for a scaler). transform() applies those parameters. fit_transform() does both in one step — convenient for training data. Always use transform() alone for test data to avoid leakage.

How do I handle imbalanced classes?

Use class_weight='balanced' in most classifiers. Or use SMOTE (from the imbalanced-learn library) to over-sample the minority class. Also evaluate with F1 or ROC-AUC rather than accuracy.

Frequently Asked Questions

What will I learn here?

This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.

How should I use this page?

Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.

What should I read next?

Use the navigation below to continue to the next lesson or explore related topics.