Feature Engineering
"Garbage in, garbage out." The quality of your features matters more than the algorithm you choose. Feature engineering transforms raw, messy data into clean, informative inputs that make models dramatically more accurate.
"Applied ML is basically feature engineering." — Andrew Ng
Step 1: Handle Missing Values
Real-world data always has gaps. Never delete rows blindly — understand why values are missing first.
Mean / Median Imputation
Fill missing numbers with the column mean (or median for skewed data).
df['age'].fillna(df['age'].median()) Best for: Numerical columns, missing at random
Mode Imputation
Fill missing categories with the most frequent value.
df['city'].fillna(df['city'].mode()[0]) Best for: Categorical columns
Add an "Is Missing" Flag
Create a new binary column indicating whether the value was missing.
df['age_missing'] = df['age'].isna().astype(int) Best for: When missingness itself is informative
Drop Columns
If a column has >80% missing, it's usually not worth keeping.
df.drop(columns=['sparse_col']) Best for: Columns with too little data to be useful
Step 2: Encode Categorical Variables
ML models work with numbers. Text categories must be converted.
| Color |
|---|
| Red |
| Blue |
| Green |
| Red |
| is_Red | is_Blue | is_Green |
|---|---|---|
| 1 | 0 | 0 |
| 0 | 1 | 0 |
| 0 | 0 | 1 |
| 1 | 0 | 0 |
Converts categories to integers: Red=0, Blue=1, Green=2.
⚠ Implies ordering — only use for ordinal categories (Small < Medium < Large)
Creates a binary column per category. No false ordering.
✅ Use for nominal categories (colors, cities, brands)
Replaces each category with the mean target value for that category.
✅ Great for high-cardinality columns (zip codes with 1000+ values)
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import pandas as pd
df = pd.DataFrame({'color': ['Red','Blue','Green','Red'], 'size': ['S','M','L','M']})
# One-Hot Encoding
df_encoded = pd.get_dummies(df, columns=['color'])
# Or with sklearn (preserves column names)
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder(categories=[['S','M','L']])
df['size_enc'] = enc.fit_transform(df[['size']]) Step 3: Scale Numerical Features
Many algorithms (like KNN, SVM, neural networks) are sensitive to feature scale. A feature with values 0–100,000 will dominate one with values 0–1 unless you scale them.
📊 Interactive: See the Effect of Scaling
Click a scaling method to see how the data distribution changes.
Scales to [0, 1]. Sensitive to outliers.
(x - min) / (max - min) Use for: Neural networks, image pixels
Mean=0, Std=1. Works with outliers better.
(x - mean) / std Use for: SVM, logistic regression, PCA
Uses median and IQR. Best for outlier-heavy data.
(x - median) / IQR Use for: Data with many outliers
Step 4: Create New Features
Sometimes the most valuable features don't exist in the raw data — you create them.
Extract year, month, day of week, is_weekend, days_since_event from timestamps.
Multiply two features:
price_per_sqft = price / areaConvert continuous age into buckets: 18-25, 26-35, 36-50, 50+
Word count, character count, sentiment score, TF-IDF from text columns.
Distance to nearest city centre, cluster of coordinates.
Rolling mean, max, min over a time window for time-series data.
Step 5: Feature Selection
More features ≠ better model. Irrelevant features add noise and slow training.
model.feature_importances_ from Random Forest to rank featuresPutting It All Together: Scikit-Learn Pipelines
Always wrap preprocessing into a Pipeline to prevent data leakage (accidentally using test data statistics to transform training data) and to make deployment clean.
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
numeric_features = ['age', 'income', 'loan_amount']
categorical_features = ['city', 'employment_type']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
model = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators=100))
])
model.fit(X_train, y_train)
print(f"Accuracy: {model.score(X_test, y_test):.2%}") Frequently Asked Questions
Should I scale before or after splitting data?
Always after splitting. Fit the scaler only on training data, then transform both train and test. Fitting on all data before splitting leaks information about the test set — your validation scores will be optimistically biased.
Do tree-based models need feature scaling?
No. Decision Trees, Random Forests, and XGBoost are scale-invariant — they make decisions based on thresholds, not distances. Scaling won't hurt but it also won't help them.
How many features is too many?
The "curse of dimensionality" says performance degrades as features grow without more data. A rough rule: aim for at least 10–50 training examples per feature. For 100 features, aim for 1000–5000 examples minimum.
Frequently Asked Questions
What will I learn here?
This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.
How should I use this page?
Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.
What should I read next?
Use the navigation below to continue to the next lesson or explore related topics.