fsync & fdatasync
Writing to a file doesn't mean it's on disk. Linux buffers writes in RAM for performance. If power dies before the buffer flushes, your data is gone. fsync() is the system call that forces data to stable storage — the foundation of durable databases and reliable file systems.
The Durability Problem
Why isn't a successful write() call enough?
write() copies data to the kernel's page cache — in RAM. The kernel writes it to disk later (asynchronously). If the system crashes between write() and the actual disk flush, data is lost. For most apps this is fine. For a database storing a financial transaction, it's catastrophic. fsync() bridges the gap.
Typical write path:
write(fd, data, len)
→ data lands in page cache (RAM) ← fast, returns immediately
→ page marked "dirty"
→ pdflush writes to disk eventually ← could be 30 seconds later
Power outage here = data lost
With fsync():
write(fd, data, len) ← write to page cache
fsync(fd) ← blocks until data+metadata on disk
return OK ← now safe from power loss
fsync vs fdatasync — What's the Difference?
| fsync(fd) | fdatasync(fd) | |
|---|---|---|
| Flushes file data | Yes | Yes |
| Flushes metadata (mtime, size) | Yes | Only if needed for data recovery |
| Speed | Slower (2 disk writes typical) | Faster (1 disk write often) |
| Use case | When metadata correctness matters | When you just need data safe |
| Used by | Filesystem operations | Most databases (PostgreSQL, SQLite) |
When does fdatasync flush metadata?
Only when recovery requires it — specifically when the file size changed (new data was appended). If you overwrote existing bytes, the file size didn't change, metadata update isn't needed for recovery, and fdatasync skips it. This saves one disk round-trip.
Write Ordering — The Hard Problem
Why do databases write to a WAL (Write-Ahead Log) first?
Crash recovery requires knowing what was intended. A database could be in the middle of updating 10 related pages when power dies. To recover, it needs a record of what to redo or undo. The WAL is that record — write the intent first, fsync it, then apply the change. On crash, replay the WAL. This pattern (WAL + fsync) is how every durable database works.
# The write-ahead logging pattern:
# 1. Write operation to WAL (journal)
# 2. fsync(wal_fd) ← intent is now durable
# 3. Apply change to data file
# 4. fsync(data_fd) ← change is now durable
# 5. Mark WAL entry committed
# If crash between step 2 and 4:
# Recovery: replay WAL entry → apply change → done
# If crash before step 2:
# Recovery: WAL entry incomplete → ignore → consistent state
# PostgreSQL example: pg_wal/ directory = write-ahead log
# SQLite: -wal file = WAL file (in WAL mode)
# MySQL InnoDB: ib_logfile0/ib_logfile1 = redo log
All the sync Variants
# fsync(fd) — flush data + metadata for one file
# fdatasync(fd) — flush data (+ metadata only if needed)
# sync() — flush ALL dirty pages systemwide (slow!)
# syncfs(fd) — flush all dirty pages on the filesystem containing fd
# O_SYNC flag — fsync on every write (no separate fsync needed)
open("file", O_WRONLY | O_SYNC)
# Every write() blocks until data is on disk
# High overhead — use only for truly critical single writes
# O_DSYNC — like O_SYNC but data only (like fdatasync per-write)
open("file", O_WRONLY | O_DSYNC)
# From shell — force flush everything:
sync # blocks until all dirty pages written to disk
Group Commit — Making fsync Efficient
Doesn't calling fsync after every transaction make databases slow?
Yes — each fsync is 5-10ms on an HDD, which limits you to ~100 commits/second. The solution is group commit: batch multiple transactions' WAL writes together, then fsync once for all of them. PostgreSQL, MySQL InnoDB, and SQLite all implement this. One fsync covers hundreds of concurrent commits, making throughput far higher than sequential fsyncs.
# Without group commit:
T1: write WAL → fsync → commit (5ms)
T2: write WAL → fsync → commit (5ms)
T3: write WAL → fsync → commit (5ms)
# Total: 15ms, 3 fsyncs
# With group commit:
T1, T2, T3: all write WAL
→ one fsync for all three (5ms)
→ T1, T2, T3 all commit
# Total: 5ms, 1 fsync — 3x throughput
# PostgreSQL tuning:
# synchronous_commit = on (default, safe)
# synchronous_commit = off (data loss risk, faster)
# commit_delay = 100 (microseconds, enables group commit)
The Disk Cache Problem
Does fsync guarantee data is on the physical platter?
Not always. Drives have their own internal cache. fsync tells the drive to flush its cache to the platter. But some drives lie — they report success without actually flushing (write cache enabled, no battery backup). Enterprise drives have battery-backed write caches. Consumer drives may not. This is why enterprise storage for databases matters: you need hardware that actually honors fsync.
# Check if write cache is enabled on a drive
hdparm -W /dev/sda
# /dev/sda:
# write-caching = 1 (on) ← drive has write cache enabled
# Disable write cache (slower but safer without battery backup)
hdparm -W0 /dev/sda
# On SSDs: check if the drive supports FUA (Force Unit Access)
# FUA lets writes bypass cache directly to persistent storage
# NVMe drives typically support this; Linux uses it automatically
Frequently Asked Questions
What will I learn here?
This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.
How should I use this page?
Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.
What should I read next?
Use the navigation below to continue to the next lesson or explore related topics.