All new SDCARDs conform to one of the speed classes defined by the specification, it’s usually stamped on top of the card and they are quite relevant for multiple reasons.
It defines the speed at which you can clock the card, 50MHz for class 10 and otherwise 25MHz.
Also alludes to the RU size, higher class cards pack more internal buffering.
Finally its primary purpose is of conveying the maximum write speed, class 10 being the fastest.
SD Bus Clock
High clocking capability means better internal circuit and more the bill of material$, class 10 cards are said to be capable of high speed (50MHz) mode, at least that is what the spec claims but the scriptures may not always reflect the reality. Running an intensive I/O test showed that in practice even a 44MHz clock tends to choke most of the cards, actually two out of the four popular makes exhibited inconsistencies. Even though card’s CSD register mentioned 50MHz, the general card behavior tends to be intolerant to intensive use cases at higher clocks, sometimes even before switching to high speed mode (via CMD6) the CSD register claim 50MHz compliance which seems contrary to the spec and when practiced bricks or crashes the card. The root cause of the high speed non-compliance is not clear to me but the fact that certain cards are more tolerant than the others should be kept in mind while testing the system.
Internal SDCARD buffers expedite I/O by implementing a producer-consumer like scenario, Class 4 cards tend to have an internal buffer (RU) size of 32K while for class 6 it goes up to 64K and for Class 10 a whopping 512K, not surprising that the price of the card also follows a similar pattern, runs from $5 to $10 & to around $30 for high speed cards. Optimal use of cards would be to write in terms of the RU sizes so that the card buffers are utilized to the maximum, writing more than the RU size will make the host unnecessarily wait for the card to flush data to internal flash and transmitting very small packets will only keep the SDCARD firmware unusually idle. Performance depends on the depth and also on the maximum utilization of available computational units within the pipeline. So, with regards to SDCARD also, we need to simultaneously engage the host and the card which eventually leads to better throughput.
Time taken to write 128 MB of data was sampled over a dozen cards, Class 10 PNY card was the fastest at 17 secs while the slowest was 88 secs for the Transcend class 4 card. Below we have a graph which shows how the time to write is inversely proportional to the cost and the RU size and the speed class of SDCARDs. Embedded host used for this exercise was an Embedded Artist EA 3250 board and it runs a 266MHz NXP controller.
Accurate class differentiation emphasized by the spec gets a tad obscured when it trickle down into the eventual product which seems to vary in behavior and performance capabilities across and within the same class. The challenge with SDCARD host productization is primarily that of interoperability testing, here we have considered only a dozen makes which is popular in the US, undoubtedly once we account for the China market there will be little semblance of sanity left in the above graph.
SD CARD protocol itself provides a rough blueprint of the internal design, but sometimes you can implement particular use cases hinging on the hints provided by the spec to expose certain limitations and strengths of the card machinery. It’s well-known that an SD CARD runs on flash, usually a NAND MLC or an SLC but superficially it behaves nothing like a flash and the user need not fret about block erases, bit flips or wear leveling. All the complexities of a flash seemingly abstracted inside a black box, but we could still attempt to unearth certain aspects of this chaos.
Once we supply the voltage, card internally figures out whether it’s being used with SPI or SD bus by sampling the card detect pin (pin 1), which is pulled high only for SD mode. Protocol implemented for these interfaces differ significantly, so their software stacks also differ and of course there also exist a small micro-controller driving this SD firmware.
The boot code checks pin 1, identifies the interface and jumps to the corresponding software interface stack, now lets consider that the card is connected over SD bus. The boot code ensures that the voltage supplied by the host is appropriate by using CMD8 & ACMD41, this boot process is driven by the host and we can imagine that the card start-up is completed only when the software enters what the spec defines as the transfer state.
The hardware pins are shared across both the modes, so there has to be an internal multiplexer which will configure the controller to drive the pins either with SD or with the SPI physical layer protocol. We can imagine that the chip set used will include the corresponding hardware IPs or at least some form of software implementation of the SD, SPI & multiplexer logic.
Boot up is completed once we enter the transfer mode, here the SD state machine is primarily meant to service I/O requests like sector reads, writes and erases. A DMA engine is most probably utilized for ensuring that the internal flash transfers happen in parallel with the SD I/Os, in other words the card might be transferring NAND pages to the flash while its receiving more data from the host via the SD interface. The read and write operations are pipe-lined and buffered to ensure the maximum throughput, especially for higher grade class 10 cards.
Flash contents cannot be over-written without intermediate erases. So, for any SDCARD I/O operation there exist a ‘translation’ from virtual to physical address. For example, lets consider that we wrote to the sector 1024 on the SDCARD, and this address got initially mapped to physical sector 1024 itself on flash. Later we overwrite the contents, but this time write cannot go to the same physical address because it needs to be erased before programmed again. Now the translation layer will have to remap the virtual address 1024 to another erased and clean physical address, say 2048. So, now the virtual sector 1024 translates to a different physical address. Similarly, every write to virtual address gets mapped to a physical address and every subsequent read of the location is translated to the very same address. These mappings are stored separately by the firmware and this way a translation layer will abstract the flash complexities from the card user who remains oblivious to the actual physical address locations. Obviously, the erase of dirty pages need to happen before they can be reused.
Every translation layer also requires a size for its maps, spec confirms that a card complying with speed class specification (class 2, 4, 6 & 10) will implement internal map size equal to its ‘Allocation Unit” (AU) size, this can be read from one of the card registers. This means that the virtual to physical maps will be of AU size, which is typically 4MB.
Flash Translation Layer Algorithm
Reads are straightforward, all SDCARD needs to do is access its mapping information and read the corresponding physical AU. Writes can be tricky depending on the use case, lets consider two types.
a. Contiguous writes
Writes are sequential — and cross AU boundaries only after the previous physical AU was completely written. Here the card translation layer will select a fully erased physical AU for the write and then flush the contents continuously. After each AU write, the older physical AU is marked dirty and queued for an erase, while the translation is remapped to the newly written AU. An illustration is given below.
Consider virtual AU 10 was mapped to physical AU 30 (PAU 30) before write, and when the IO starts the map algorithm selects (Physical) AU40 as the new physical map for virtual AU10 (VAU 10).
Write starts at PAU40. After 4MB write ends, the VAU 10 presently mapped to PAU 30 gets remapped to PAU 40.
PAU 30 is marked as dirty and added to the erase queue.
Quite easily the most simple and the fastest use case.
b. Random writes across AUs Here we write at block addressed which are located at different AUs, quite similar to the use case where the FAT table and directory entry contents need to be updated in between a file write. Consider the following illustration:
Step one: 1MB write of file contents to VAU 10. Step two: 512 byte FAT table write to VAU 1.
& Step three: 512 byte directory entry write to VAU 8.
The first step will write 1 MB at VAU10 and then the file system moves on to write the FAT table located on VAU1, consider that before being written the VAU10 was previously fully used and contained 3MB of valid and 1MB of dirty data (3 + 1 = 4MB).
The steps executed inside the FTL with regards to the above file system operations are illustrated below.
File clusters inside VAU10 was initially mapped to PAU20, when the write started a fully erased PAU 35 was selected as the new map (writes cannot happen on dirty flash blocks!).
1MB of file contents were written to PAU35. Now 3MB of VAU 10 contents reside in PAU 20 and rest of the newly written 1MB is in PAU35.
One VAU cannot be mapped to two physical AUs, so we have two options:
a. Either move the 3MB contents from PAU20 to PAU35 and then remap VAU10 to PAU35.
b. Do a partial erase of PAU 20 and move the new 1 MB from PAU35 to that location and maintain the same old map.
SDCARDs might be using either one of the above two options, in the first case map information will be updated (VAU10->PAU20 modified to VAU10->PAU35) while the second case maintains the same map (VAU10 -> PAU20). Both the cases incur an overhead and this weighs heavily on the SDCARD write performance.
The above details alludes to the fact that an SDCARD tends to select an AU to which we write as the “active one”, so any writes to this ‘Active AU’ will have minimal overhead but as soon as we switch the write location outside of this ‘Active AU’ the map maintenance kicks in — essentially the garbage collection. Some of the class 10 cards can accommodate two active AUs and hence the overhead happens only during the above Step Three of the FAT file system write illustration given above, while the switch to the Step Two happens without a glitch.
Above graph illustrates the write times for the following three use cases:
1. Interleaved writes switch across VAUs continuously, first 4K bytes are written to an address on VAU1, second 4K to an address on VAU2 and so on.
2. Random writes switch VAUs for every fifth write, so 4K writes happen to four locations within VAU1 before it moves to VAU2 and so on. Here the number of times we switch VAUs are less that the first case.
3. Streaming 4K write is contiguous, AU switch happens only after the previous one is completely written, so there is no merge overhead and the number of times we switch AUs are least here.
Block Associative Sector Translation
Locality of reference is integral for achieving optimal write throughput. Quite visible from the graph that writes are faster when its contiguous, while its far from optimal when they are done across AU boundaries. Most probably SD CARDs should be employing a variant of block associative sector translation.