SD CARD protocol itself provides a rough blueprint of the internal design, but sometimes you can implement particular use cases hinging on the hints provided by the spec to expose certain limitations and strengths of the card machinery. It’s well-known that an SD CARD runs on flash, usually a NAND MLC or an SLC but superficially it behaves nothing like a flash and the user need not fret about block erases, bit flips or wear leveling. All the complexities of a flash seemingly abstracted inside a black box, but we could still attempt to unearth certain aspects of this chaos.
Boot Code
Once we supply the voltage, card internally figures out whether it’s being used with SPI or SD bus by sampling the card detect pin (pin 1), which is pulled high only for SD mode. Protocol implemented for these interfaces differ significantly, so their software stacks also differ and of course there also exist a small micro-controller driving this SD firmware.
The boot code checks pin 1, identifies the interface and jumps to the corresponding software interface stack, now lets consider that the card is connected over SD bus. The boot code ensures that the voltage supplied by the host is appropriate by using CMD8 & ACMD41, this boot process is driven by the host and we can imagine that the card start-up is completed only when the software enters what the spec defines as the transfer state.
The hardware pins are shared across both the modes, so there has to be an internal multiplexer which will configure the controller to drive the pins either with SD or with the SPI physical layer protocol. We can imagine that the chip set used will include the corresponding hardware IPs or at least some form of software implementation of the SD, SPI & multiplexer logic.
Transfer Mode
Boot up is completed once we enter the transfer mode, here the SD state machine is primarily meant to service I/O requests like sector reads, writes and erases. A DMA engine is most probably utilized for ensuring that the internal flash transfers happen in parallel with the SD I/Os, in other words the card might be transferring NAND pages to the flash while its receiving more data from the host via the SD interface. The read and write operations are pipe-lined and buffered to ensure the maximum throughput, especially for higher grade class 10 cards.
Translation Layer
Flash contents cannot be over-written without intermediate erases. So, for any SDCARD I/O operation there exist a ‘translation’ from virtual to physical address. For example, lets consider that we wrote to the sector 1024 on the SDCARD, and this address got initially mapped to physical sector 1024 itself on flash. Later we overwrite the contents, but this time write cannot go to the same physical address because it needs to be erased before programmed again. Now the translation layer will have to remap the virtual address 1024 to another erased and clean physical address, say 2048. So, now the virtual sector 1024 translates to a different physical address. Similarly, every write to virtual address gets mapped to a physical address and every subsequent read of the location is translated to the very same address. These mappings are stored separately by the firmware and this way a translation layer will abstract the flash complexities from the card user who remains oblivious to the actual physical address locations. Obviously, the erase of dirty pages need to happen before they can be reused.
Every translation layer also requires a size for its maps, spec confirms that a card complying with speed class specification (class 2, 4, 6 & 10) will implement internal map size equal to its ‘Allocation Unit” (AU) size, this can be read from one of the card registers. This means that the virtual to physical maps will be of AU size, which is typically 4MB.

Flash Translation Layer Algorithm
Reads are straightforward, all SDCARD needs to do is access its mapping information and read the corresponding physical AU. Writes can be tricky depending on the use case, lets consider two types.
a. Contiguous writes
Writes are sequential — and cross AU boundaries only after the previous physical AU was completely written. Here the card translation layer will select a fully erased physical AU for the write and then flush the contents continuously. After each AU write, the older physical AU is marked dirty and queued for an erase, while the translation is remapped to the newly written AU. An illustration is given below.
Consider virtual AU 10 was mapped to physical AU 30 (PAU 30) before write, and when the IO starts the map algorithm selects (Physical) AU40 as the new physical map for virtual AU10 (VAU 10).

Write starts at PAU40. After 4MB write ends, the VAU 10 presently mapped to PAU 30 gets remapped to PAU 40.

PAU 30 is marked as dirty and added to the erase queue.
Quite easily the most simple and the fastest use case.
b. Random writes across AUs
Here we write at block addressed which are located at different AUs, quite similar to the use case where the FAT table and directory entry contents need to be updated in between a file write. Consider the following illustration:
Step one: 1MB write of file contents to VAU 10.
Step two: 512 byte FAT table write to VAU 1.
&
Step three: 512 byte directory entry write to VAU 8.
The first step will write 1 MB at VAU10 and then the file system moves on to write the FAT table located on VAU1, consider that before being written the VAU10 was previously fully used and contained 3MB of valid and 1MB of dirty data (3 + 1 = 4MB).
The steps executed inside the FTL with regards to the above file system operations are illustrated below.
File clusters inside VAU10 was initially mapped to PAU20, when the write started a fully erased PAU 35 was selected as the new map (writes cannot happen on dirty flash blocks!).

1MB of file contents were written to PAU35. Now 3MB of VAU 10 contents reside in PAU 20 and rest of the newly written 1MB is in PAU35.

One VAU cannot be mapped to two physical AUs, so we have two options:
a. Either move the 3MB contents from PAU20 to PAU35 and then remap VAU10 to PAU35.

or
b. Do a partial erase of PAU 20 and move the new 1 MB from PAU35 to that location and maintain the same old map.

SDCARDs might be using either one of the above two options, in the first case map information will be updated (VAU10->PAU20 modified to VAU10->PAU35) while the second case maintains the same map (VAU10 -> PAU20). Both the cases incur an overhead and this weighs heavily on the SDCARD write performance.
The above details alludes to the fact that an SDCARD tends to select an AU to which we write as the “active one”, so any writes to this ‘Active AU’ will have minimal overhead but as soon as we switch the write location outside of this ‘Active AU’ the map maintenance kicks in — essentially the garbage collection. Some of the class 10 cards can accommodate two active AUs and hence the overhead happens only during the above Step Three of the FAT file system write illustration given above, while the switch to the Step Two happens without a glitch.

Above graph illustrates the write times for the following three use cases:
1. Interleaved writes switch across VAUs continuously, first 4K bytes are written to an address on VAU1, second 4K to an address on VAU2 and so on.
2. Random writes switch VAUs for every fifth write, so 4K writes happen to four locations within VAU1 before it moves to VAU2 and so on. Here the number of times we switch VAUs are less that the first case.
3. Streaming 4K write is contiguous, AU switch happens only after the previous one is completely written, so there is no merge overhead and the number of times we switch AUs are least here.
Block Associative Sector Translation
Locality of reference is integral for achieving optimal write throughput. Quite visible from the graph that writes are faster when its contiguous, while its far from optimal when they are done across AU boundaries. Most probably SD CARDs should be employing a variant of block associative sector translation.
More on this topic – SDCARD Speed Class