Feature Stores
Training-serving skew — where features computed at training time differ subtly from those computed at serving time — is one of the most insidious production ML bugs. Feature stores solve this by centralising feature computation, ensuring the exact same logic runs in both training pipelines and real-time serving, and enabling feature reuse across teams and models.
The Problem: Training-Serving Skew
Consider a fraud model that uses "number of transactions in the last hour" as a feature. During training, this is computed from historical batch data. In production, it's computed in real time from a streaming database. Any difference in how "last hour" is computed — timezone handling, null treatment, windowing boundaries — silently degrades model performance.
Training pipeline: Python script reads from S3 → computes features with Pandas. Serving pipeline: Java microservice reads from Redis → computes features differently. Skew is invisible until production accuracy drops.
One feature definition (Python). Training retrieves historical point-in-time features from the offline store. Serving retrieves the same feature from the online store. Same logic, guaranteed consistency.
Feature Store Architecture
Offline Store
Historical feature values for training. Backed by data warehouses (BigQuery, Snowflake, S3 + Parquet). Supports point-in-time correct joins — retrieves the feature value as it existed at the time of each training example (prevents data leakage).
Online Store
Low-latency feature retrieval for real-time inference. Backed by Redis, DynamoDB, or Bigtable. Features are pre-computed and cached. P99 latency <5ms for most use cases.
Feast: Open-Source Feature Store
Feast is the most widely used open-source feature store. Here's a complete example:
# 1. Define feature views (feature_repo/features.py)
from feast import Entity, FeatureView, Field, FileSource
from feast.types import Float32, Int64
from datetime import timedelta
customer = Entity(name="customer_id", description="Customer identifier")
customer_stats = FeatureView(
name="customer_transaction_stats",
entities=[customer],
ttl=timedelta(days=7),
schema=[
Field(name="tx_count_1h", dtype=Int64),
Field(name="tx_amount_avg", dtype=Float32),
Field(name="days_since_join", dtype=Int64),
],
source=FileSource(path="data/customer_stats.parquet",
timestamp_field="event_timestamp"),
)
# 2. Apply registry
# feast apply
# 3. Training: retrieve historical features
from feast import FeatureStore
import pandas as pd
store = FeatureStore(repo_path=".")
training_df = store.get_historical_features(
entity_df=pd.DataFrame({
"customer_id": [1, 2, 3],
"event_timestamp": pd.to_datetime(["2024-01-01", "2024-01-01", "2024-01-01"])
}),
features=["customer_transaction_stats:tx_count_1h",
"customer_transaction_stats:tx_amount_avg"],
).to_df()
# 4. Serving: retrieve online features (after materialisation)
# feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S")
online_features = store.get_online_features(
features=["customer_transaction_stats:tx_count_1h"],
entity_rows=[{"customer_id": 42}],
).to_dict() Feature Store Options Compared
Point-in-Time Correct Joins
The most critical feature store capability. During training, naive joins use the latest feature value — but this leaks future information. If you're predicting whether a customer will churn in March, you shouldn't use their April transaction count as a feature.
A model trained with future feature values will look great in offline evaluation but fail completely in production. Point-in-time joins ensure each training row only uses feature values available before the prediction timestamp. Always use this for time-series and event-based ML problems.
When Do You Need a Feature Store?
You Need One When...
- Multiple models use the same features
- You've had training-serving skew bugs
- Feature computation takes >10 minutes
- >3 data scientists sharing features
- Real-time feature latency matters (<10ms)
- Compliance requires feature auditability
Probably Don't Need One If...
- Only 1–2 models in production
- Features are simple (no windowing/aggregation)
- Batch predictions only (no real-time serving)
- Small team with tight code review
- Early-stage project, still iterating
- Infrastructure cost is a concern
Before a full feature store, a shared feature computation library (a Python package that both training and serving import) eliminates most skew. This is 20% of the complexity for 80% of the benefit. Adopt Feast when feature sharing across teams becomes the bottleneck.
Frequently Asked Questions
How does a feature store handle real-time (streaming) features?
Streaming features require a stream processor (Kafka + Flink/Spark Streaming) to continuously compute aggregations and write to the online store. Tecton and Hopsworks have native streaming support. With Feast, you handle stream processing separately (e.g., with Flink) and write results to Feast's online store. Real-time features add significant infrastructure complexity — batch features recomputed hourly cover most use cases.
What is feature materialisation?
Materialisation is the process of computing and writing feature values into the online store so they can be retrieved at serving time with low latency. You run feast materialize (batch) or feast materialize-incremental on a schedule (e.g., every hour) to keep the online store fresh. Without materialisation, online feature retrieval would require recomputing from raw data on every request — too slow.
Can I use a feature store with LLMs?
Yes, for structured features used alongside LLMs. In a RAG system, you might store pre-computed document embeddings in a feature store / vector database, or use a feature store to provide user context features (subscription tier, language preference, interaction history) that get included in the LLM prompt. The feature store handles the structured data side; the vector database handles semantic retrieval.
Frequently Asked Questions
What will I learn here?
This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.
How should I use this page?
Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.
What should I read next?
Use the navigation below to continue to the next lesson or explore related topics.