Feature Engineering

"Garbage in, garbage out." The quality of your features matters more than the algorithm you choose. Feature engineering transforms raw, messy data into clean, informative inputs that make models dramatically more accurate.

"Applied ML is basically feature engineering." — Andrew Ng

Step 1: Handle Missing Values

Real-world data always has gaps. Never delete rows blindly — understand why values are missing first.

Mean / Median Imputation

Fill missing numbers with the column mean (or median for skewed data).

df['age'].fillna(df['age'].median())

Best for: Numerical columns, missing at random

Mode Imputation

Fill missing categories with the most frequent value.

df['city'].fillna(df['city'].mode()[0])

Best for: Categorical columns

Add an "Is Missing" Flag

Create a new binary column indicating whether the value was missing.

df['age_missing'] = df['age'].isna().astype(int)

Best for: When missingness itself is informative

Drop Columns

If a column has >80% missing, it's usually not worth keeping.

df.drop(columns=['sparse_col'])

Best for: Columns with too little data to be useful

Step 2: Encode Categorical Variables

ML models work with numbers. Text categories must be converted.

Before

Color
Red
Blue
Green
Red

→

One-Hot Encoding

is_Red	is_Blue	is_Green
1	0	0
0	1	0
0	0	1
1	0	0

Label Encoding
Converts categories to integers: Red=0, Blue=1, Green=2.
⚠ Implies ordering — only use for ordinal categories (Small < Medium < Large)

One-Hot Encoding
Creates a binary column per category. No false ordering.
✅ Use for nominal categories (colors, cities, brands)

Target Encoding
Replaces each category with the mean target value for that category.
✅ Great for high-cardinality columns (zip codes with 1000+ values)

Python · Encoding

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import pandas as pd

df = pd.DataFrame({'color': ['Red','Blue','Green','Red'], 'size': ['S','M','L','M']})

# One-Hot Encoding
df_encoded = pd.get_dummies(df, columns=['color'])

# Or with sklearn (preserves column names)
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder(categories=[['S','M','L']])
df['size_enc'] = enc.fit_transform(df[['size']])

Step 3: Scale Numerical Features

Many algorithms (like KNN, SVM, neural networks) are sensitive to feature scale. A feature with values 0–100,000 will dominate one with values 0–1 unless you scale them.

📊 Interactive: See the Effect of Scaling

Click a scaling method to see how the data distribution changes.

Min-Max Scaling

Scales to [0, 1]. Sensitive to outliers.

(x - min) / (max - min)

Use for: Neural networks, image pixels

Standard Scaling

Mean=0, Std=1. Works with outliers better.

(x - mean) / std

Use for: SVM, logistic regression, PCA

Robust Scaling

Uses median and IQR. Best for outlier-heavy data.

(x - median) / IQR

Use for: Data with many outliers

Step 4: Create New Features

Sometimes the most valuable features don't exist in the raw data — you create them.

📅

Date features
Extract year, month, day of week, is_weekend, days_since_event from timestamps.

🔢

Interaction features
Multiply two features: price_per_sqft = price / area

📦

Binning
Convert continuous age into buckets: 18-25, 26-35, 36-50, 50+

📝

Text features
Word count, character count, sentiment score, TF-IDF from text columns.

📍

Geographic features
Distance to nearest city centre, cluster of coordinates.

🔄

Aggregates
Rolling mean, max, min over a time window for time-series data.

Step 5: Feature Selection

More features ≠ better model. Irrelevant features add noise and slow training.

Correlation filter — Remove features highly correlated with each other (multicollinearity)

Feature importance — Use model.feature_importances_ from Random Forest to rank features

RFE (Recursive Feature Elimination) — Repeatedly train and drop the least important feature

L1 Regularisation (Lasso) — Penalises and zeros out less important features automatically

Putting It All Together: Scikit-Learn Pipelines

Always wrap preprocessing into a Pipeline to prevent data leakage (accidentally using test data statistics to transform training data) and to make deployment clean.

Python · Full Feature Engineering Pipeline

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier

numeric_features = ['age', 'income', 'loan_amount']
categorical_features = ['city', 'employment_type']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100))
])

model.fit(X_train, y_train)
print(f"Accuracy: {model.score(X_test, y_test):.2%}")

Frequently Asked Questions

Should I scale before or after splitting data?

Always after splitting. Fit the scaler only on training data, then transform both train and test. Fitting on all data before splitting leaks information about the test set — your validation scores will be optimistically biased.

Do tree-based models need feature scaling?

No. Decision Trees, Random Forests, and XGBoost are scale-invariant — they make decisions based on thresholds, not distances. Scaling won't hurt but it also won't help them.

How many features is too many?

The "curse of dimensionality" says performance degrades as features grow without more data. A rough rule: aim for at least 10–50 training examples per feature. For 100 features, aim for 1000–5000 examples minimum.

Feature Engineering

Step 1: Handle Missing Values

Mean / Median Imputation

Mode Imputation

Add an "Is Missing" Flag

Drop Columns

Step 2: Encode Categorical Variables

Step 3: Scale Numerical Features

📊 Interactive: See the Effect of Scaling

Step 4: Create New Features

Step 5: Feature Selection

Putting It All Together: Scikit-Learn Pipelines

Frequently Asked Questions

Frequently Asked Questions

What will I learn here?

How should I use this page?

What should I read next?