NAND – Embedded Sense

SD Card Speed Class

February 15, 2014June 8, 2016Mahesh Sreekandath1 Comment

All new SDCARDs conform to one of the speed classes defined by the specification, it’s usually stamped on top of the card and they are quite relevant for multiple reasons.

It defines the speed at which you can clock the card, 50MHz for class 10 and otherwise 25MHz.
Also alludes to the RU size, higher class cards pack more internal buffering.
Finally its primary purpose is of conveying the maximum write speed, class 10 being the fastest.

SD Bus Clock

High clocking capability means better internal circuit and more the bill of material$, class 10 cards are said to be capable of high speed (50MHz) mode, at least that is what the spec claims but the scriptures may not always reflect the reality. Running an intensive I/O test showed that in practice even a 44MHz clock tends to choke most of the cards, actually two out of the four popular makes exhibited inconsistencies. Even though card’s CSD register mentioned 50MHz, the general card behavior tends to be intolerant to intensive use cases at higher clocks, sometimes even before switching to high speed mode (via CMD6) the CSD register claim 50MHz compliance which seems contrary to the spec and when practiced bricks or crashes the card. The root cause of the high speed non-compliance is not clear to me but the fact that certain cards are more tolerant than the others should be kept in mind while testing the system.

Buffering

Internal SDCARD buffers expedite I/O by implementing a producer-consumer like scenario, Class 4 cards tend to have an internal buffer (RU) size of 32K while for class 6 it goes up to 64K and for Class 10 a whopping 512K, not surprising that the price of the card also follows a similar pattern, runs from $5 to $10 & to around $30 for high speed cards. Optimal use of cards would be to write in terms of the RU sizes so that the card buffers are utilized to the maximum, writing more than the RU size will make the host unnecessarily wait for the card to flush data to internal flash and transmitting very small packets will only keep the SDCARD firmware unusually idle. Performance depends on the depth and also on the maximum utilization of available computational units within the pipeline. So, with regards to SDCARD also, we need to simultaneously engage the host and the card which eventually leads to better throughput.

Write Speed

Time taken to write 128 MB of data was sampled over a dozen cards, Class 10 PNY card was the fastest at 17 secs while the slowest was 88 secs for the Transcend class 4 card. Below we have a graph which shows how the time to write is inversely proportional to the cost and the RU size and the speed class of SDCARDs. Embedded host used for this exercise was an Embedded Artist EA 3250 board and it runs a 266MHz NXP controller.

Inference

Accurate class differentiation emphasized by the spec gets a tad obscured when it trickle down into the eventual product which seems to vary in behavior and performance capabilities across and within the same class. The challenge with SDCARD host productization is primarily that of interoperability testing, here we have considered only a dozen makes which is popular in the US, undoubtedly once we account for the China market there will be little semblance of sanity left in the above graph.

More on this topic – An account on SDCARD internals.

Linux Buffer Cache and Block Device

November 4, 2012June 11, 2016Mahesh Sreekandath4 Comments

Linux Content Index

— File System Architecture – Part I
–– File System Architecture – Part II
–– File System Write
–– Buffer Cache
–– Storage Cache

Working on a flash file system port to Linux kernel mandates a good understanding of its buffer management, sometimes the excavation of internals can be a bit taxing but it’s needed for executing a well integrated port. Hopefully this write up will help someone who is looking to get a good design overview of the kernel buffer/page cache.

VFS work on top of pages and it allocates memory with a minimum granularity of the page size (usually 4K) but a storage device is capable of an I/O size much smaller than that (typically 512 bytes). A case where the device write granularity is set to page size can result in redundant writes, for example if an application modifies 256 bytes of a file then the corresponding page in cache is marked as dirty & flushing of this file results in a transfer of 4K bytes because there is no mechanism for the cache to identify the dirty sector within this dirty PAGE.

Buffer Cache bridges the page and the block universes, a page associated with a file will be divided into buffers & when a page is modified only the corresponding buffer is set as dirty, so while flushing only the dirty buffer gets written. In other words a buffer cache is an abstraction done on top of a page, its simply a different facet to the same RAM memory. The data structure called “buffer_head” is used for managing these buffers, the below diagram is a rather crass depiction of this system.

Buffer Cache can traverse the list of these buffers associated with a page via the “buffer_head” linked list and identify their status; note that the above diagram assumes a buffer size of 1K.

How about streaming writes?

When dealing with huge quantities of data, the method of sequential transfer of small sectors can lead to a pronounced overall read-write head positioning delay and there is also this additional overhead due to multiple set up and tear down cycles of transfers.

Lets consider an example of a 16K buffer write operation consisting of four 4K pages, the memory mapping for these 4 pages are given in the table below. For example, sectors 0 to 3 of a file is located in the RAM address 0x1000 to 0x1FFF, and it’s mapped to storage blocks 10, 11, 12 & 13.

Blindly splitting the 16K into 16 buffers and writing them one by one will result in as many transfer set ups and teardowns, and also an equivalent number of read-write head position delays.

How about we reorder the writing of pages in the following manner?

Page1 >>> Page 3 >>> Page 2 >>> Page 4

The above sequence is the result of sorting the pages in an increasing order of block mappings, ie: from 10 till 25, a closer look will also reveal that the adjacent transfers have addresses running from 0x1000 till 0x4FFF and hence contiguous in RAM memory. This results in an uninterrupted transfer of 16K bytes, which seems to be the best case optimization.

How does Linux achieve this gathering?

Let’s try to comprehend how bock driver subsystem skews application’s order of writes to achieve the previously described optimization.

Algorithm implemented by the above architecture is explained below

Step 1: A Buffer is encapsulated in a BIO structure.
File system buffers are encapsulated in a BIO structure (submit_bh function can give the details). At this stage all the 16 buffers in our example are wrapped inside its corresponding BIO struct. We can see the mapping in the below table.

Step 2: A BIO is plugged into a request.
BIOs are sequentially submitted to the below layers and the those mapped to adjacent blocks are clubbed into a single “request“. With the addition of the request struct we have the below updated mapping table, as you can see “Request 0” combines BIO 0,1,2 & 3.

Prior to adding the requests to the global block driver queue there is an intermediate task_struct queue, this is an obvious optimization aimed at clubbing the adjacent BIOs into one request and to minimize contention for the block driver queue. Yes, a global queuing operation cannot happen in parallel to another queuing or a de-queuing.

Step 3: Merge requests into the global queue.
Merging a request into the global queue will reorder and coalesce the adjacently mapped “requests” into one and in our case we will finally have only one single request which will envelop all the BIOs.

Step 4:Finally we need to identify the contiguous memory segments.
Block device driver will identify consecutive memory segments across adjacent BIOs and club them before initiating a transfer, this will reduces the I/O setup – tear down cycles. So we will have a segment attribute added to our final table which is given below.

The block device driver can also have a request scheduler, this scheduler algorithm will prioritize and pick requests to be de-queued from the global queue for I/O processing. Please note that the steps for read requests also follow the above framework.

Inference

The above bells and whistles are irrelevant when it comes to flash memories, so avoiding all this baggage might make a lot of sense.

FUSE File System Performance on Embedded Linux

May 22, 2012July 11, 2016Mahesh Sreekandath1 Comment

We ported and benchmarked a flash file system to Linux running on an ARM board. Porting was done via FUSE, a user space file system mechanism where the file system module itself runs as a process inside Linux. The file I/O calls from other processes are eventually routed to the FUSE process via inter process communication. This IPC is enabled by a low level FUSE driver running in the kernel.

http://en.wikipedia.org/wiki/Filesystem_in_Userspace

The above diagram provides an overview of FUSE architecture. The ported file system was proprietary and was not meant to be open sourced, from this perspective file system as a user space library made a lot of sense.

Primary bottleneck with FUSE is its performance. The control path timing for a 2K byte file read use-case is elaborated below. Please note that the 2K corresponds to NAND page size.

1. User space app to kernel FUSE driver switch. – 15 uS

2. Kernel space FUSE to user space FUSE library process context switch. – 1 to15 mS

3. Switch back into kernel mode for flash device driver access – NAND MTD driver overhead without including device delay is in uS.

4. Kernel to FUSE with the data read from flash – 350uS (NAND dependent) + 15uS + 15uS (Kernel to user mode switch and back)

5. From FUSE library back to FUSE kernel driver process context switch. – 1 to 15mS

6. Finally from FUSE kernel driver to the application with the data – 15 uS

As you can see, the two process context switches takes time in terms of Milliseconds, which kills the whole idea. If performance is a crucial, then profile the context switch overhead of an operating system before attempting a FUSE port. Seems loadable kernel module approach would be the best alternative.

NAND flash musings

July 5, 2010June 6, 2016Mahesh Sreekandath1 Comment

Been quite some time since the last post, life has been busy, thanks to the NAND flash chips from Toshiba & Samsung. Ironic enough their seemingly naive data sheets introduce NAND as an angelic technology. Simple protocols, even more simple hardware interface. A totally reasonable requirement placed on driver to fix one bit errors and detect two bit errors (which is not supposed to happen but still for some unknown reason vendors mention this requirement too, would be ecstatic to know why). A touch of complexity is felt only when bad blocks are encountered, which is totally fair considering the cost effectiveness of NANDs.

My initial impression of NAND being a fairly simple fixed hassle free storage media was progressively crushed to shreds during the last one year of NAND torments. Have worked only on SLC NANDs from Toshiba & Samsung, they are extensively used on mobile handset platforms. So, MLC is an unknown inferno to me. Hopefully the below mentioned points might help the posterity from enduring the crisis. Always remember to religiously follow the Data sheet (henceforth referred to as “the book”) for NAND salvation.

Keep innovative operation sequences for hobby projects.

Do not try stuff like NAND reset command during NAND busy unless the book clearly explains the behavior of its effect on read, program and erase operation with a CLEAR timing diagram.
Do NOT use read back check to detect bad blocks unless that is mentioned as one of the methods in the book
MORAL : Follow ONLY what is written in the book, do not infer or even worse assume.

Read wear leveling cannot prevent bit errors nor can erase refresh solve bit errors.

I have managed to induce bit errors on Samsung NAND flash when partial page writes are executed beyond the maximum number specified for a page and also by executing multiple partial page reads. Interestingly, even after continuous block erases, the single bit read errors refused to disappear.
Any deviation from the strict protocol mentioned in the book can result in manifestation of strange symptoms.
BTW: A deterministic read wear count is a myth unless it is mentioned in the book.
MORAL : Symptoms and root causes never have 1:1 ratio.

Never go back and correct mistakes within a block

Samsung NAND flashes “prohibits” going back to a lesser numbered page in a block and reprogramming it (for Eg: Do not program page 10 after programming page 20 within a block) the effect of such an operation is not documented so you do not know the symptoms which can incarnate in any form.
Go ahead and question the logic of any file system which does random page programming in a block to mark dirty pages!.
MORAL:Do not question what the book says, just blindly follow it.