Cloud Storage Types Explained
Not all storage is the same. Storing a photo album, a database, and a 10TB training dataset each need a different kind of storage. Picking the wrong type wastes money, hurts performance, and can break applications. Here's how to get it right.
Object Storage — The Cloud's Filing Cabinet
Object storage stores data as discrete "objects" — files with metadata and a unique identifier. You access objects via HTTP APIs (PUT, GET, DELETE). There's no directory hierarchy in the traditional sense — just a flat namespace, though you can simulate folders with prefixes in the key names.
AWS S3, Google Cloud Storage, Azure Blob Storage
These are the canonical examples. You create a "bucket," upload files to it, and retrieve them via URL. They're infinitely scalable, replicated across multiple AZs by default, and priced per GB stored plus per request. S3 alone stores exabytes of data for companies like Netflix and Airbnb.
Object Storage for AI & ML
This is the dominant storage for AI workloads. Training datasets (image files, text corpora, audio clips), model checkpoints during training, and final model artifacts all live in object storage. A typical LLM training dataset might be 10–100TB on S3. S3's "request rate" supports millions of parallel read operations — essential for feeding hungry GPU clusters.
Block Storage — The Cloud's Hard Drive
Block storage presents raw storage volumes to a VM — like attaching a hard drive. The OS formats it, creates a file system, and uses it exactly like a local disk. It's fast, low-latency, and designed for transactional workloads.
AWS EBS, Google Persistent Disk, Azure Managed Disks
When you launch an EC2 instance, it comes with a root EBS volume (your OS disk). You can attach additional EBS volumes for data. EBS volumes are sized, persistent, and attached to one instance at a time. They range from cheap HDD volumes (gp2) to ultra-high-performance NVMe SSDs (io2) for databases requiring extreme IOPS.
Block Storage for AI
GPU training jobs often use local NVMe SSDs (instance storage) for the highest throughput during training — data is staged here from S3. For databases behind AI applications (user data, feature stores), EBS or equivalent is standard. Persistent block storage also backs Kubernetes PersistentVolumes for stateful workloads.
File Storage — The Cloud's Network Share
File storage is a shared file system that multiple VMs can mount simultaneously. It behaves like a traditional NAS (Network Attached Storage) — you see it as a directory, you can read and write files to it from multiple machines at the same time.
AWS EFS, Google Filestore, Azure Files
Ideal when multiple servers need shared read-write access to the same files. AWS EFS auto-scales with your data — no need to pre-provision capacity. Azure Files is great for lifting on-premises SMB file shares to the cloud without changing how applications access them.
File Storage for AI
Distributed training jobs that run across multiple nodes and need to read the same checkpoints or share configuration files use file storage. High-performance file systems like Lustre (AWS FSx for Lustre) are specifically optimized for AI training — streaming at 100s of GB/second from storage to GPU clusters.
Choosing the Right Storage Type
| Type | Access Method | Latency | Concurrency | Best For |
|---|---|---|---|---|
| Object (S3) | HTTP API | Low-medium | Unlimited | Datasets, models, backups |
| Block (EBS) | Mounted disk | Very low (sub-ms) | Single VM | OS disk, databases |
| File (EFS) | NFS mount | Low | Many VMs | Shared workloads |
| Local NVMe | Mounted disk | Ultra-low | Single VM | Scratch space for training |
Frequently Asked Questions
What is a data lake and how does it relate to cloud storage?
A data lake is a central repository that stores raw data in its native format at any scale. In practice, it's usually S3 (or Google Cloud Storage / Azure Data Lake Storage) with a structured folder hierarchy and governance layer on top. For AI, the data lake is where you collect and store all raw training data — logs, user interactions, sensor data — before it's processed into curated datasets for model training.
How much does cloud storage cost?
S3 Standard costs ~$0.023/GB/month, plus ~$0.0004 per 1,000 GET requests. EBS gp3 costs ~$0.08/GB/month. EFS costs ~$0.30/GB/month. For a 100TB training dataset on S3, you'd pay ~$2,300/month in storage alone (plus egress costs when downloading to GPUs). This is why dataset storage and retrieval efficiency matters enormously in AI infrastructure economics.
What is S3 Glacier and when would I use it?
S3 Glacier is an archival storage tier — extremely cheap ($0.004/GB/month) but with retrieval times ranging from minutes to hours. Use it for data you almost never access but must retain (regulatory archives, old model checkpoints, audit logs). You can set lifecycle policies to automatically move S3 objects to Glacier after 90 or 180 days, significantly reducing storage costs for older data.
Frequently Asked Questions
What will I learn here?
This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.
How should I use this page?
Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.
What should I read next?
Use the navigation below to continue to the next lesson or explore related topics.