person Tute World Team schedule Updated April 28, 2026

System Calls

Every time your program reads a file, sends a network packet, or allocates memory, it makes a system call — a controlled crossing of the boundary between user space and kernel space. Understanding how this boundary works explains performance, security, and why some operations are inherently slow.

CPU Privilege Levels (Rings)

x86-64 CPU privilege levels:

  Ring 0 = Kernel mode
    - Full hardware access
    - Can execute privileged instructions (in/out, lgdt, cli/sti)
    - Can access all memory
    - Where the Linux kernel runs

  Ring 1, 2 = Not used by Linux
    (hypervisors sometimes use Ring 1)

  Ring 3 = User mode
    - Restricted hardware access
    - Cannot execute privileged instructions
    - Memory access limited to own address space
    - Where your applications run

Crossing from Ring 3 to Ring 0:
  - System call (intentional, controlled)
  - Exception (page fault, divide by zero)
  - Interrupt (hardware event)

  In all cases: CPU saves state, switches stack, jumps to kernel code

The syscall Instruction

How does a system call actually cross from user space to kernel space? On x86-64, user space executes the SYSCALL instruction. The CPU uses the LSTAR MSR register (set at boot) to jump to the kernel's syscall entry point, saves registers, switches to the kernel stack, and sets CPL (Current Privilege Level) to 0. The kernel reads the syscall number from RAX, looks it up in the syscall table, and calls the handler. On return, SYSRET restores user space state.

# System call mechanism (x86-64):
# 1. User space puts syscall number in RAX
# 2. Arguments go in: RDI, RSI, RDX, R10, R8, R9
# 3. Execute SYSCALL instruction
# 4. CPU:
#    - Saves RIP (return address) in RCX
#    - Saves RFLAGS in R11
#    - Loads kernel RIP from LSTAR MSR
#    - Switches to Ring 0
# 5. Kernel:
#    - Saves all registers on kernel stack
#    - Looks up RAX in sys_call_table[]
#    - Calls the handler: sys_read, sys_write, etc.
#    - Return value goes in RAX
#    - Restores registers, executes SYSRET
# 6. User space resumes

# Syscall numbers (x86-64):
# 0  = read
# 1  = write
# 2  = open
# 3  = close
# 9  = mmap
# 57 = fork
# 59 = execve
# Full list: /usr/include/asm/unistd_64.h

vDSO — Syscalls Without Crossing to Kernel

Why is gettimeofday() fast even though it seems to need kernel data? vDSO (virtual Dynamic Shared Object) is a tiny shared library the kernel maps into every process's address space. It contains kernel code that can be executed in user space — specifically for syscalls that only read kernel data that changes rarely. gettimeofday() reads a shared memory page the kernel keeps updated, no ring transition needed. This makes it ~10x faster than a real syscall.

# vDSO in every process:
cat /proc/self/maps | grep vdso
# 7fff12300000-7fff12301000 r-xp 00000000 00:00 0  [vdso]

# Syscalls accelerated by vDSO:
# clock_gettime   - nanosecond time (most common use)
# gettimeofday    - microsecond time
# getcpu          - which CPU this thread is on
# time            - seconds since epoch

# Without vDSO: each call = SYSCALL instruction + ring switch (~100-200 ns)
# With vDSO: memory read + math (~10-20 ns)

# Confirm vDSO is used:
strace ./myapp 2>&1 | grep clock_gettime
# (nothing — clock_gettime is handled in user space via vDSO)
strace -e trace=clock_gettime ./myapp
# (empty output — vDSO intercepted before kernel sees it)

The Cost of System Calls

# Syscall overhead varies by kernel version and hardware:
# Pre-Spectre/Meltdown patches: ~100 ns
# Post-patches (KPTI, Retpoline): ~200-400 ns

# KPTI (Kernel Page Table Isolation) — Meltdown fix:
# Kernel and user space now have separate page tables
# Context switch between them on each syscall = extra TLB flush
# Can be disabled on trusted hardware: nopti kernel parameter

# Measuring syscall overhead:
# syscall() in a tight loop:
# getpid() takes ~200 ns on patched x86 (was ~50 ns pre-Meltdown)

# Why this matters:
# 1 million short writes (1 byte each): 1M syscalls
# At 200ns each = 200ms just in syscall overhead
# vs. 1 write(fd, bigbuffer, 1MB): 1 syscall = 200ns
# ALWAYS batch small I/O operations!

Frequently Asked Questions

What will I learn here?

This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.

How should I use this page?

Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.

What should I read next?

Use the navigation below to continue to the next lesson or explore related topics.