Boosting Performance: Removing fsync from Local Storage

Performance gains can be substantial when understanding and carefully tuning low-level I/O operations.

For decades, fsync() has been the bedrock of data durability in POSIX-compliant systems. It’s the grumpy gatekeeper of your data, ensuring that everything you’ve written isn’t just sitting in some volatile memory buffer, waiting to vanish with the next power hiccup. This unwavering commitment to safety, however, comes at a steep performance cost. For applications where raw speed is paramount, especially within custom storage engines, the question isn’t if fsync() is a bottleneck, but how profoundly it cripples throughput. We’re going to dive deep into the mechanics of bypassing this stubborn API for local storage, understanding the trade-offs, and identifying the specific scenarios where this aggressive optimization is not just viable, but potentially revolutionary.

The “Atomic Commit” Illusion: How fsync Guardrails Your Data (and Your Sanity)

At its core, a transaction within a database or a write operation in any persistent storage system aims for atomicity. You want an operation to either fully complete or not happen at all. This guarantees that in the event of a crash or power loss, your system is left in a consistent, recoverable state. This is where fsync() plays its crucial role.

When an application writes data, it’s typically buffered by the operating system’s page cache. This is a performance optimization; writing to RAM is orders of magnitude faster than writing to disk. However, this data is volatile. fsync() forces the OS to not only write the data from the page cache to the disk’s physical blocks but also to ensure that all metadata related to that write (like file size updates, block allocation information, etc.) is also durably persisted. It then waits for confirmation from the underlying storage device that this flush operation is complete. This “flush and wait” mechanism is what provides the strong durability guarantee.

The problem is the “wait.” Modern storage devices, particularly SSDs and NVMe drives, have their own sophisticated caching mechanisms, often with battery backup or supercapacitors for power loss protection (PLP). They can perform writes internally in ways that are much more efficient than simply flushing the OS page cache. By invoking fsync(), we’re often telling the storage device to do something it’s already doing, or to perform a less efficient flush operation to satisfy the OS’s demand for a guaranteed persistent state. This creates a bottleneck, effectively serializing writes and dramatically reducing the achievable IOPS (Input/Output Operations Per Second).

The sentiment in performance tuning circles is often that fsync() is a “wart” on the Linux/Unix API. It’s necessary for general-purpose file system guarantees, but for specialized systems, it’s an impediment to unlocking the true potential of high-performance hardware.

Crafting the Bypass: Engineering for Durability without the fsync() Hangover

Bypassing fsync() isn’t about simply commenting out a line of code. It’s a fundamental re-architecture of how your storage engine interacts with the underlying hardware and the operating system. The key is to shift the responsibility for durability from the generic OS file system layer to your application-specific logic, leveraging modern hardware capabilities.

The core strategies involve:

  1. O_DIRECT Writes: This flag tells the operating system to bypass the kernel page cache entirely. Writes go directly from the application’s buffer to the storage device. This eliminates the overhead of double buffering and page cache management. However, O_DIRECT alone doesn’t guarantee durability; data can still reside in the device’s DRAM cache.

  2. Preallocated, Pre-zeroed Files/Extents: Instead of growing files dynamically, which involves metadata updates that fsync() would normally protect, you can preallocate large chunks of storage. Pre-zeroing these extents ensures they are clean and ready for direct writes, avoiding the need for traditional file system metadata updates for data placement.

  3. Journaling Aligned to Hardware Atomic Write Units: This is perhaps the most critical piece. Modern NVMe drives have features like Atomic Write Unit Power Fail (AWUPF). This specifies the smallest power-fail safe unit of data a drive can write atomically. If your journaling mechanism (e.g., a write-ahead log or WAL) writes data in units that are aligned with this hardware guarantee, you can achieve durability by writing to the journal and relying on the drive’s inherent atomicity for those specific writes. The system then only needs to ensure the journal pointer is durably updated, not every single data block. fdatasync() can be used here, as it flushes data and essential metadata, which might be sufficient for a well-designed journal.

  4. Leveraging Enterprise SSD PLP: Enterprise-grade SSDs often include Power Loss Protection (PLP). This means they have capacitors or supercapacitors that provide enough power to flush their internal DRAM cache to non-volatile NAND flash during a power failure. If you are certain your storage hardware has robust PLP, you can rely on the device itself to handle the final data persistence. This significantly reduces the need for explicit fsync() calls from the application.

Consider an example scenario where a custom storage engine implements a write-ahead log using O_DIRECT. Instead of write() followed by fsync(), it might perform write() to the log file using O_DIRECT. If the log entries are structured to align with the SSD’s AWUPF, the durability guarantee comes from the hardware’s atomic write capability for that unit. The system only needs to ensure that the pointer to the “end” of the log is updated durably, which can be achieved with a single, aligned fdatasync() on a very small metadata block, or even by structuring the log so that a successful write of a log segment header implicitly signifies the durability of its contents.

This approach allows for incredibly high write throughput because the OS cache is bypassed, and the storage device’s internal optimizations are leveraged. Benchmarks show dramatic improvements, with custom engines achieving figures like 190,985 obj/s compared to ~116,041 obj/s when fsync() is used. This isn’t a marginal improvement; it’s a near doubling of performance in some cases.

The Tightrope Walk: When Durability Contracts Narrow

The ability to discard fsync() comes with a profoundly important caveat: you are entering into a narrower durability contract than what POSIX file semantics provide. This is not a decision to be taken lightly, and it fundamentally changes how your system must be designed and deployed.

This strategy is unequivocally suitable only for specialized, single-node storage engines with explicitly defined durability semantics.

Here’s what that means in practice:

  • SSD-Only Deployments: This optimization is predicated on the characteristics of SSDs. Consumer-grade SSDs, in particular, often lack robust PLP. Without it, a power failure can mean that data residing in the SSD’s DRAM cache is irretrievably lost. Therefore, relying on hardware PLP is a prerequisite. Enterprise SSDs are a much safer bet.
  • Storage Engine Owns Allocation, Journaling, and Recovery: You cannot simply disable fsync() and expect your existing database code to function correctly. Your storage engine must take complete ownership of:
    • Allocation: How disk space is managed and allocated.
    • Journaling: How writes are recorded before being applied to their final destination, and how that journal is made durable.
    • Recovery: How to rebuild a consistent state from the journal after a crash. This is paramount. If data in volatile caches (kernel or device) is lost, your custom recovery logic must be able to reconstruct what was lost, or accept the potential loss gracefully.
  • Accepting a Different Durability Contract: You are trading the absolute, OS-guaranteed durability of fsync() for potentially higher performance. This implies that in certain failure scenarios (e.g., power loss occurring at a very specific microsecond), you might lose a small, recent chunk of data. This loss must be manageable and predictable by your application. For instance, some systems might tolerate losing the last few milliseconds of transactions, while others would deem this unacceptable.
  • No General-Purpose Databases: This is not for PostgreSQL, MySQL, or standard file systems. These systems rely on the POSIX durability guarantees for a reason. Their clients expect them to behave like the operating system’s file system in terms of data safety.

The risks of miscalculation are severe: data corruption, lost updates, and inconsistent states. Applications like libeatmydata demonstrate the extreme performance gains by disabling fsync() altogether, but they explicitly sacrifice crash safety. Projects like MinIO and SeaweedFS often disable fsync() by default, requiring explicit configuration for strict durability, acknowledging that users must opt-in to this performance profile.

The core insight here is that fsync() is a generalized solution for a generalized problem. When you have a very specific problem domain – like a single-node, high-throughput local storage engine for a particular workload – you can build a more tailored, and often faster, solution. This involves deep knowledge of your hardware’s capabilities and a robust, custom-built resilience mechanism within your application. The performance uplift is undeniable, but the engineering discipline required is equally significant. For those who can meet that bar, the performance ceiling of local storage can be dramatically raised.

Next post

How LEDs Are Made: Illuminating the Manufacturing Process

How LEDs Are Made: Illuminating the Manufacturing Process