fsync & fdatasync

Writing to a file doesn't mean it's on disk. Linux buffers writes in RAM for performance. If power dies before the buffer flushes, your data is gone. fsync() is the system call that forces data to stable storage — the foundation of durable databases and reliable file systems.

The Durability Problem

Why isn't a successful write() call enough? write() copies data to the kernel's page cache — in RAM. The kernel writes it to disk later (asynchronously). If the system crashes between write() and the actual disk flush, data is lost. For most apps this is fine. For a database storing a financial transaction, it's catastrophic. fsync() bridges the gap.
Typical write path: write(fd, data, len) → data lands in page cache (RAM) ← fast, returns immediately → page marked "dirty" → pdflush writes to disk eventually ← could be 30 seconds later Power outage here = data lost With fsync(): write(fd, data, len) ← write to page cache fsync(fd) ← blocks until data+metadata on disk return OK ← now safe from power loss

fsync vs fdatasync — What's the Difference?

fsync(fd)fdatasync(fd)
Flushes file dataYesYes
Flushes metadata (mtime, size)YesOnly if needed for data recovery
SpeedSlower (2 disk writes typical)Faster (1 disk write often)
Use caseWhen metadata correctness mattersWhen you just need data safe
Used byFilesystem operationsMost databases (PostgreSQL, SQLite)
When does fdatasync flush metadata? Only when recovery requires it — specifically when the file size changed (new data was appended). If you overwrote existing bytes, the file size didn't change, metadata update isn't needed for recovery, and fdatasync skips it. This saves one disk round-trip.

Write Ordering — The Hard Problem

Why do databases write to a WAL (Write-Ahead Log) first? Crash recovery requires knowing what was intended. A database could be in the middle of updating 10 related pages when power dies. To recover, it needs a record of what to redo or undo. The WAL is that record — write the intent first, fsync it, then apply the change. On crash, replay the WAL. This pattern (WAL + fsync) is how every durable database works.
# The write-ahead logging pattern: # 1. Write operation to WAL (journal) # 2. fsync(wal_fd) ← intent is now durable # 3. Apply change to data file # 4. fsync(data_fd) ← change is now durable # 5. Mark WAL entry committed # If crash between step 2 and 4: # Recovery: replay WAL entry → apply change → done # If crash before step 2: # Recovery: WAL entry incomplete → ignore → consistent state # PostgreSQL example: pg_wal/ directory = write-ahead log # SQLite: -wal file = WAL file (in WAL mode) # MySQL InnoDB: ib_logfile0/ib_logfile1 = redo log

All the sync Variants

# fsync(fd) — flush data + metadata for one file # fdatasync(fd) — flush data (+ metadata only if needed) # sync() — flush ALL dirty pages systemwide (slow!) # syncfs(fd) — flush all dirty pages on the filesystem containing fd # O_SYNC flag — fsync on every write (no separate fsync needed) open("file", O_WRONLY | O_SYNC) # Every write() blocks until data is on disk # High overhead — use only for truly critical single writes # O_DSYNC — like O_SYNC but data only (like fdatasync per-write) open("file", O_WRONLY | O_DSYNC) # From shell — force flush everything: sync # blocks until all dirty pages written to disk

Group Commit — Making fsync Efficient

Doesn't calling fsync after every transaction make databases slow? Yes — each fsync is 5-10ms on an HDD, which limits you to ~100 commits/second. The solution is group commit: batch multiple transactions' WAL writes together, then fsync once for all of them. PostgreSQL, MySQL InnoDB, and SQLite all implement this. One fsync covers hundreds of concurrent commits, making throughput far higher than sequential fsyncs.
# Without group commit: T1: write WAL → fsync → commit (5ms) T2: write WAL → fsync → commit (5ms) T3: write WAL → fsync → commit (5ms) # Total: 15ms, 3 fsyncs # With group commit: T1, T2, T3: all write WAL → one fsync for all three (5ms) → T1, T2, T3 all commit # Total: 5ms, 1 fsync — 3x throughput # PostgreSQL tuning: # synchronous_commit = on (default, safe) # synchronous_commit = off (data loss risk, faster) # commit_delay = 100 (microseconds, enables group commit)

The Disk Cache Problem

Does fsync guarantee data is on the physical platter? Not always. Drives have their own internal cache. fsync tells the drive to flush its cache to the platter. But some drives lie — they report success without actually flushing (write cache enabled, no battery backup). Enterprise drives have battery-backed write caches. Consumer drives may not. This is why enterprise storage for databases matters: you need hardware that actually honors fsync.
# Check if write cache is enabled on a drive hdparm -W /dev/sda # /dev/sda: # write-caching = 1 (on) ← drive has write cache enabled # Disable write cache (slower but safer without battery backup) hdparm -W0 /dev/sda # On SSDs: check if the drive supports FUA (Force Unit Access) # FUA lets writes bypass cache directly to persistent storage # NVMe drives typically support this; Linux uses it automatically

Frequently Asked Questions

What will I learn here?

This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.

How should I use this page?

Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.

What should I read next?

Use the navigation below to continue to the next lesson or explore related topics.