Linux Storage Cache

Linux Content Index

File System Architecture – Part I
File System Architecture– Part II
File System Write
Buffer Cache
Linux File Write 

Every system use case involves a combination of software modules talking to each other. But an optimal system is more than set of efficient modules. The overall architectural arrangement of these modules and their data/control interfaces also impacts the performance. In that sense, several useful design methodologies can be observed from examining the storage sub system within Linux.

Application to VFS
Fig 1

Various abstraction layers have their own purpose. For example, VFS exposes a uniform interface to the user applications, while it’s the file system which actually implements the structural representation of files within the device. Eventually, how each storage sector is accessed will depend on the block driver interactions with the hardware. Productivity depends on how quickly a system can service user requests. In other words, the response time is critical. The pure logic implemented for each of the involved modules could be efficient, but if they are merely stacked up vertically then it will be only as good as using a primitive processor with no internal caches or pipelined executions.

*********************************

Inode-Dentry Cache

Book keeping information associated with each file is represented by two kernel structures — Inodes and Dentries. In bare simple terms these constructs hold the information necessary to identify and locate the storage blocks associated with a file. A multitasking environment might have numerous processes accessing the same volume or even the same file. Caching this information at higher levels can reduce contentions associated with the corresponding lower layers. If the repeated file accesses would be picked from the VFS cache, then the file system will be free to quickly service the new requests. Consider a use case where there are multiple processes looking up for a file in different dedicated directories; without a cache, the contention on the file system will scale with the number of processes. But with a good cache mechanism, only the brand new requests would end up accessing the file system module. The idea is quite like a fully associative hierarchical hardware cache.

Inode/Dentry Cache
Fig 2 – Inode/Dentry Cache

*********************************

Data Cache

Inodes and Dentries are accessed by the kernel to eventually locate the file contents, hence the above cache is only a part of the solution while the data cache handles the rest (see below Fig -3). Even here we need frequently accessed file data to be cached and only the new writes or new reads should end up contending for file system module attention. Please note that inside the kernel, the pages associated with file are arranged as a radix tree indexed via file offsets. While the dentry cache is managed by a hash table. Also, each such data cache is eventually associated with the corresponding Inode.

Cache mechanism also determines the program execution speed, it’s closely related to how efficiently the binary sections of the executable are managed within the RAM. For example, the read-ahead heuristics would determine how often the system endures an executable page miss. Also another useful illustration of the utility of a cache is the use case where a program is accessing files from the same volume where its executable is paging from; without a cache the contention on the file system module would be tremendous.

Full Cache
Fig 3 – Full Cache

*********************************

Cache Management

Cache management involves its own complexities. How these buffers are filled, flushed, invalidated or marked as dirty depends on the file system module. Linux kernel merely invokes the file system registered call-backs and expects that the associated cache buffers would be handled accordingly. For example, during a file overwrite the file system should mark the overwritten cached page as dirty, otherwise the flush may never get invoked for that page.

We can also state that the Linux kernel VFS abstracts the pure logic of how data is represented on the storage device from how it’s viewed by kernel on a running system. In this manner the role of the file system is limited to its core functionality of mapping file access to actual storage blocks and it also deals with the transformation between internal file system specific representation of files to the generic VFS Inode, Dentry and data cache constructs. Eventually, how file is created, deleted or modified and how the associated data cache is managed depends solely on the file system module.

For example, a new file creation would involve invoking of the file system module call-backs registered via kernel structure “inode_operations”. The lookup call-backs registered here would be employed to avoid file name duplication and also the obvious “create inode” call would be needed to create the file. An fopen would invoke “file_operations” callback and similarly a data read/write would involve use of “address_space_operations”.

How the registered file system call-backs would eventually manage the inode, dentry and the page data cache buffers would depend on its own inherent logic. So here there is a kind of loop where the VFS calls the file system module call-backs and those modules would in turn call other kernel helper functions to handle specific cache buffer operations like clean, flush, write-back etc. In the end when the file system call-backs return the control back to VFS the associated caches should be already configured and ready for use.

Inode/Page Cache maps
Fig 4 – Inode/Page Cache maps

*********************************

Load Store Analogy

A generic Linux file system can be described as a module which directly services the kernel cache by implementing functions for creating, invalidating or flushing various cache entries and in this manner it only indirectly caters to the POSIX file operations. The methodology here is in a way analogous to how a RISC load-store would differ from a 8051 like CISC design, separation of the slow memory access from the register operations constitute the core attribute of a load-store mechanism. Here the Linux kernel ensures that the operations on the cache can happen in parallel with an ongoing load or store which might involve the file system and the driver modules, the obvious trade-offs are complexity and memory consumption.

*********************************

Pipelines

Caches are effective when there are read ahead mechanisms and background flusher tasks. A pipelined architecture would involve as much parallel execution as possible. Cache memories are filled in advance by the reader threads while the dirty entries are constantly flushed in the background by worker threads. Such sector I/O requests are asynchronously queued and processed by the block driver task which will eventually DMA them into the interfaced device. After the writes, the corresponding cache entries are marked as clean and similarly after the reads the entries are marked as up-to-date. The pipeline stages can be seen in the below Fig 5.

Kernel Storage Pipeline
Fig 5 – Kernel Storage Pipeline

System organization evolves over time when various use cases are analyzed for performance bottlenecks and the system arrangement itself is constantly revisited to remove those identified issues. Matured generic software architectures adapt effectively to all the intended use cases while remaining  within various resource constraints. For a general purpose operating system like Linux, the infrastructure for cache memory utilization is a good illustration of how various abstract software components can seamlessly coordinate with widely varying user requests.

Linux File System Write

Linux Content Index

File System Architecture – Part I
File System Architecture– Part II
File System Write
Buffer Cache
Storage Cache

Linux file write methodology is specific to particular file systems. For example, JFFS2 will sync data within the user process ‘context’ itself (inside write_end call), while file systems like UBIFS, LogFS, Ext & FAT tends to create dirty pages and let the background flusher task manage the sync.

These two methodologies come with their own advantages, but usage of the second method leads to Linux kernel exhibiting certain quirks which might drastically impact the write throughput measurements.

Linux File Write

The above diagram does give an overview of the design, there are in fact two main attributes to this architecture:

COPY : User space data is copied into a kernel page associated with the file and marked as dirty via“write_begin” and “write_end” file system call backs, they are in fact invoked for every “write” system call made by the user process. So while they are indeed kernel mode functions they get executed on behalf of the user space process which in turn simply waits on the the system call. The sequence is illustrated below:

    write_begin()  /* Informs FS about the number of bytes being written */
    mem_copy()    /* Copies data to kernel page cache.*/
    write_end()     /* Copied kernel page marked as dirty and FS internal */
/*  meta information is updated */

&

Write-Back : The Linux kernel will spawn a flusher task to write these above created dirty pages into the storage. For that this newly spawned task would typically invoke file system specific “writepage” or “writepages” call-back.

Some Nitty-Gritty Details

The call-backs mentioned above are specific to the file system and are essentially registered during its initialization with kernel. You might have noticed that the above design almost resemble a classic producer consumer scenario, but there are certain nuances:

Lower Threshold: The flusher task (consumer) is spawned only when a certain percentage of page cache becomes dirty. So there is indeed a lower threshold (configured in /proc/sys/vm/dirty_background_ratio) for spawning this task. Please note that it can also be invoked after an internal time-out. So even if the dirty pages are in small numbers, they are not kept around in volatile memory for a long time.

Higher Threshold: Once a certain high percentage of page cache is dirtied by a user process (producer), this process is forced to sleep to avoid the kernel running out of memory. Such a user process is not unhooked from this involuntary sleep until the flusher task catches up and brings down the dirty page levels to acceptable limits. This way kernel tends to push back on the process which comes with a massive capacity to write data. This higher threshold can be specified by writing to /proc/sys/vm/dirty_ratio.

So between the above mentioned lower and upper thresholds the system tends to have two concurrent tasks, one which dirty the pages with contents and the other which cleans them up with a write back. If there are too many dirty pages being created, then Linux might also spawn multiple flusher tasks. Seems like this design might scale well on SMP architectures with multiple storage disk paths.

We could bring some more clarity with the exposition of code level details. The critical aspect of the above mechanism is implemented in the below mentioned balance_dirty_pages (page-writeback.c) which gets invoked after “write_end” call inside kernel.

The function audits the number of dirty pages in the system and then decides whether the user process should be stalled.

static void balance_dirty_pages (struct address_space *mapping, unsigned long write_chunk)

{
int pause = 1;


……
………
for (;;)

{
….
……         /* Code checks for dirty page status and breaks out of the loop only if the dirty page ratio has gone down */
……..

io_schedule_timeout(pause);

/*
* Increase the delay for each loop, up to our previous
* default of taking a 100ms nap.
*/
pause <<= 1; if (pause > HZ / 10)
pause = HZ / 10;
}
}

The argument ‘pause’ to the function io_schedule_timeout specify the sleep time in terms of jiffies while the big for (;;) loop repeatedly checks the dirty page ratio after each sleep cycle and if there is no ample reduction in that then the sleep time is doubled until it reaches 100mS, which is in fact the maximum time the ‘producer‘ process can be forced to sleep. This algorithm is executed during file writes to ensure that no rogue process goes amok consuming the full page cache memory.

The Gist

We can attribute considerable significance for the above mentioned upper and lower thresholds, if they are not fine tuned for the system specific use cases then the RAM utilization will be far from optimal. The particulars for these configurations will depend on RAM size plus how many processes will concurrently initiate file system writes, how big these writes will be, file sizes being written, typical time intervals between successive file writes etc.

A file system mounted with syncs off is meant to utilize page cache for buffering. Optimal write speeds will mandate that the upper threshold be large enough to buffer the full file while allowing seamless concurrent write backs, otherwise the kernel response time for user process writes will be erratic at best.

Another critical aspect is that the writepage should have minimal interference with write_begin & write_end, only then the flusher task write backs can happen with least friction with regards to the user process which is constantly invoking the latter two functions. If they share some resource, then eventually the user process and the flusher task will end up waiting on each other and the expected quick response time of an asynchronous buffered page cache write will not be materialized.

Linux Buffer Cache and Block Device

Linux Content Index

File System Architecture – Part I
File System Architecture– Part II
File System Write
Buffer Cache
Storage Cache

Working on a flash file system port to Linux kernel mandates a good understanding of its buffer management, sometimes the excavation of internals can be a bit taxing but it’s needed for executing a well integrated port. Hopefully this write up will help someone who is looking to get a good design overview of the kernel buffer/page cache.

VFS work on top of pages and it allocates memory with a minimum granularity of the page size (usually 4K) but a storage device is capable of an I/O size much smaller than that (typically 512 bytes).  A case where the device write granularity is set to page size can result in redundant writes, for example if an application modifies 256 bytes of a file then the corresponding page in cache is marked as dirty & flushing of this file results in a transfer of 4K  bytes because there is no mechanism for the cache to identify the dirty sector within this dirty PAGE.

Buffer Cache bridges the page and the block universes, a page associated with a file  will be divided into buffers & when a page is modified only the corresponding buffer is set as dirty, so while flushing only the dirty buffer gets written. In other words a buffer cache is an abstraction done on top of a page, its simply a different facet to the same RAM memory. The data structure called “buffer_head” is used for managing these buffers, the below diagram is a rather crass depiction of this system.

Buffer Cache can traverse the list of these buffers associated with a page via the “buffer_head” linked list and identify their status; note that the above diagram assumes a buffer size of 1K.

How about streaming writes?

When dealing with huge quantities of data, the method of sequential transfer of small sectors can lead to a pronounced overall read-write head positioning delay and there is also this additional overhead due to multiple set up and tear down cycles of transfers.

Lets consider an example of a 16K buffer write operation consisting of four 4K pages, the memory mapping for these 4 pages are given in the table below. For example, sectors 0 to 3 of a file is located in the RAM address 0x1000 to 0x1FFF, and it’s mapped to storage blocks 10, 11, 12 & 13.

Blindly splitting the 16K into 16 buffers and writing them one by one will result in as many transfer set ups and teardowns, and also an equivalent number of read-write head position delays.

How about we reorder the writing of pages in the following manner?

Page1 >>> Page 3 >>> Page 2 >>> Page 4

The above sequence is the result of sorting the pages in an increasing order of block mappings, ie: from 10 till 25, a closer look will also reveal that the adjacent transfers have addresses running from 0x1000 till 0x4FFF and hence contiguous in RAM memory. This results in an uninterrupted  transfer of 16K bytes, which seems to be the best case optimization.

How does Linux achieve this gathering?

Let’s try to comprehend how bock driver subsystem skews application’s order of writes to achieve the previously described optimization.

Algorithm implemented by the above architecture is explained below

Step 1: A Buffer is encapsulated in a BIO structure.
File system buffers are encapsulated in a BIO structure  (submit_bh function can give the details). At this stage all the 16 buffers in our example are wrapped inside its corresponding BIO struct. We can see the mapping in the below table.

Step 2: A BIO is plugged into a request.
BIOs are sequentially submitted to the below layers and the those mapped to adjacent blocks are clubbed into a single “request“.  With the addition of the request struct we have the below updated mapping table, as you can see “Request 0” combines BIO 0,1,2 & 3.

Prior to adding the requests to the global block driver queue there is an intermediate task_struct queue, this is an obvious optimization aimed at clubbing the adjacent BIOs into one request and to minimize contention for the block driver queue. Yes, a global queuing operation cannot happen in parallel to another queuing or a de-queuing.

Step 3: Merge requests into the global queue.
Merging a request into the global queue will reorder and coalesce the adjacently mapped “requests” into one and in our case we will finally have only one single request which will envelop all the BIOs.

Step 4:Finally we need to identify the contiguous memory segments.
Block device driver will identify consecutive memory segments across adjacent BIOs and club them before initiating a transfer, this will reduces the I/O setup – tear down cycles. So we will have a segment attribute added to our final table which is given below.

The block device driver can also have a request scheduler, this scheduler algorithm will prioritize and pick requests to be de-queued from the global queue for I/O processing. Please note that the steps for read requests also follow the above framework.

Inference

The above bells and whistles are irrelevant when it comes to flash memories, so avoiding all this baggage might make a lot of sense.