Scikit-Learn in Practice

Scikit-Learn is Python's most popular machine learning library. It gives you a consistent API for dozens of algorithms, built-in datasets, preprocessing tools, and evaluation utilities — all in one package.

Installation & Setup

Terminal

pip install scikit-learn numpy pandas matplotlib

# Verify installation
python -c "import sklearn; print(sklearn.__version__)"
# → 1.4.0

The Consistent API: fit → predict

Every Scikit-Learn model follows the same three-method pattern. Once you learn one model, switching to any other is trivial:

model.fit(X_train, y_train)

Learn from training data

→

model.predict(X_test)

Make predictions

→

model.score(X_test, y_test)

Evaluate accuracy

Your First ML Model (5 Lines)

Python · Hello, Machine Learning

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)
print(f"Accuracy: {model.score(X_test, y_test):.2%}")

Built-In Datasets for Practice

🌸 Iris

150 flowers, 4 features, 3 species. The "Hello World" of ML.

load_iris()

🏠 California Housing

House prices regression. 20,000+ rows, 8 features.

fetch_california_housing()

🎗️ Breast Cancer

Binary classification: malignant vs benign. 569 samples, 30 features.

load_breast_cancer()

✍️ MNIST Digits

Handwritten digit images (0-9). Classic image classification.

load_digits()

Complete Workflow Example

Python · Full ML Workflow

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.pipeline import Pipeline

# 1. Load data
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)

# 2. Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 3. Build pipeline (prevents data leakage)
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# 4. Train
pipeline.fit(X_train, y_train)

# 5. Evaluate
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Hyperparameter Tuning with GridSearchCV

Hyperparameters are settings you choose before training (e.g., max_depth, n_estimators). GridSearchCV automatically tries all combinations and cross-validates each one.

Python · GridSearchCV

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5]
}

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='f1',
    n_jobs=-1  # use all CPU cores
)

grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV F1: {grid_search.best_score_:.3f}")
best_model = grid_search.best_estimator_

💡 Try RandomizedSearchCV for large grids

GridSearchCV tries every combination. If you have 5 parameters with 4 values each, that's 4⁵ = 1024 models. Use RandomizedSearchCV to randomly sample the grid — much faster, almost as good.

Saving and Loading Models

Python · Save & Load

import joblib

# Save model to disk
joblib.dump(pipeline, 'my_model.pkl')

# Load model later (in production)
loaded_model = joblib.load('my_model.pkl')
predictions = loaded_model.predict(new_data)

Quick Reference: Common Estimators

ClassificationRandom Forestensemble.RandomForestClassifier

ClassificationLogistic Regressionlinear_model.LogisticRegression

ClassificationSVMsvm.SVC

RegressionLinear Regressionlinear_model.LinearRegression

RegressionGradient Boostingensemble.GradientBoostingRegressor

ClusteringK-Meanscluster.KMeans

Dim. ReductionPCAdecomposition.PCA

Frequently Asked Questions

Is Scikit-Learn good for deep learning?

No. Scikit-Learn is designed for classical ML on structured/tabular data. For deep learning (images, text, sequences), use PyTorch or TensorFlow. Scikit-Learn does have a basic MLPClassifier for simple neural networks, but it's not production-grade for complex tasks.

What's the difference between fit() and fit_transform()?

fit() learns parameters (e.g., mean/std for a scaler). transform() applies those parameters. fit_transform() does both in one step — convenient for training data. Always use transform() alone for test data to avoid leakage.

How do I handle imbalanced classes?

Use class_weight='balanced' in most classifiers. Or use SMOTE (from the imbalanced-learn library) to over-sample the minority class. Also evaluate with F1 or ROC-AUC rather than accuracy.

Scikit-Learn in Practice

Installation & Setup

The Consistent API: fit → predict

Your First ML Model (5 Lines)

Built-In Datasets for Practice

Complete Workflow Example

Hyperparameter Tuning with GridSearchCV

Saving and Loading Models

Quick Reference: Common Estimators

Frequently Asked Questions

Frequently Asked Questions

What will I learn here?

How should I use this page?

What should I read next?