Linux Content Index
Working on a flash file system port to Linux kernel mandates a good understanding of its buffer management, sometimes the excavation of internals can be a bit taxing but it’s needed for executing a well integrated port. Hopefully this write up will help someone who is looking to get a good design overview of the kernel buffer/page cache.
VFS work on top of pages and it allocates memory with a minimum granularity of the page size (usually 4K) but a storage device is capable of an I/O size much smaller than that (typically 512 bytes). A case where the device write granularity is set to page size can result in redundant writes, for example if an application modifies 256 bytes of a file then the corresponding page in cache is marked as dirty & flushing of this file results in a transfer of 4K bytes because there is no mechanism for the cache to identify the dirty sector within this dirty PAGE.
Buffer Cache bridges the page and the block universes, a page associated with a file will be divided into buffers & when a page is modified only the corresponding buffer is set as dirty, so while flushing only the dirty buffer gets written. In other words a buffer cache is an abstraction done on top of a page, its simply a different facet to the same RAM memory. The data structure called “buffer_head” is used for managing these buffers, the below diagram is a rather crass depiction of this system.
Buffer Cache can traverse the list of these buffers associated with a page via the “buffer_head” linked list and identify their status; note that the above diagram assumes a buffer size of 1K.
How about streaming writes?
When dealing with huge quantities of data, the method of sequential transfer of small sectors can lead to a pronounced overall read-write head positioning delay and there is also this additional overhead due to multiple set up and tear down cycles of transfers.
Lets consider an example of a 16K buffer write operation consisting of four 4K pages, the memory mapping for these 4 pages are given in the table below. For example, sectors 0 to 3 of a file is located in the RAM address 0x1000 to 0x1FFF, and it’s mapped to storage blocks 10, 11, 12 & 13.
Blindly splitting the 16K into 16 buffers and writing them one by one will result in as many transfer set ups and teardowns, and also an equivalent number of read-write head position delays.
How about we reorder the writing of pages in the following manner?
Page1 >>> Page 3 >>> Page 2 >>> Page 4
The above sequence is the result of sorting the pages in an increasing order of block mappings, ie: from 10 till 25, a closer look will also reveal that the adjacent transfers have addresses running from 0x1000 till 0x4FFF and hence contiguous in RAM memory. This results in an uninterrupted transfer of 16K bytes, which seems to be the best case optimization.
How does Linux achieve this gathering?
Let’s try to comprehend how bock driver subsystem skews application’s order of writes to achieve the previously described optimization.
Step 1: A Buffer is encapsulated in a BIO structure.
File system buffers are encapsulated in a BIO structure (submit_bh function can give the details). At this stage all the 16 buffers in our example are wrapped inside its corresponding BIO struct. We can see the mapping in the below table.
Step 2: A BIO is plugged into a request.
BIOs are sequentially submitted to the below layers and the those mapped to adjacent blocks are clubbed into a single “request“. With the addition of the request struct we have the below updated mapping table, as you can see “Request 0” combines BIO 0,1,2 & 3.
Prior to adding the requests to the global block driver queue there is an intermediate task_struct queue, this is an obvious optimization aimed at clubbing the adjacent BIOs into one request and to minimize contention for the block driver queue. Yes, a global queuing operation cannot happen in parallel to another queuing or a de-queuing.
Step 3: Merge requests into the global queue.
Merging a request into the global queue will reorder and coalesce the adjacently mapped “requests” into one and in our case we will finally have only one single request which will envelop all the BIOs.
Step 4:Finally we need to identify the contiguous memory segments.
Block device driver will identify consecutive memory segments across adjacent BIOs and club them before initiating a transfer, this will reduces the I/O setup – tear down cycles. So we will have a segment attribute added to our final table which is given below.
The block device driver can also have a request scheduler, this scheduler algorithm will prioritize and pick requests to be de-queued from the global queue for I/O processing. Please note that the steps for read requests also follow the above framework.
The above bells and whistles are irrelevant when it comes to flash memories, so avoiding all this baggage might make a lot of sense.