System Optimization: From Feature Phones to Data Lakes

Introduction

What motivates system optimization? Often, it is the arrival of new customer demands: a higher throughput requirement, a tighter latency target, or a new use case that pushes a system beyond its original specifications. In short, the need for optimization begins when an existing system is unable to adapt to new expectations. The task can then be split into two parts: first, to comprehend the current design constraints; and second, to determine what it would take to adapt.

Step 1 : Comprehension – The Why

Why is the current system unable to satisfy the performance specification? The first step is to analyze the design through the lens of requirements — to understand how various use cases impact resource utilization such as CPU cycles, storage bandwidth, or memory footprint etc. Such analysis provides insights and hypotheses about possible systemic imbalances. By modeling use cases and capturing performance metrics, these hypotheses can be validated or refined. Depending on the scale and complexity of the system, isolating the bottleneck may require iterations. Still, measurement is vital to confirming assumptions and deriving solutions.

Step 2 : Solutions – The How

The details of the bottleneck identified through measurements form the input to the next step — identifying a solution. Captured data should highlight which resources are heavily utilized (ie: expensive) and which remain relatively idle. The goal is to target these imbalances at the architectural, algorithmic, or design level. In other words, every subsystem should be efficiently utilized throughout the execution of a use case. This balancing act typically involves either reducing contention or introducing greater concurrency. Contention can be reduced by optimizing algorithms, adding caches, batching operations, or improving prioritization. Concurrency, on the other hand, often requires reworking the pipeline.

Practicing What We Preach

The following real-world industry examples illustrate how this methodology can be applied in practice. It also explains the challenges that arise when theory meets practice in projects of different scales and complexities.

Asynchronous Raw NAND Writes

Similar to how branch prediction and multi-issue execution keeps CPU pipelines busy, concurrency can be applied at every layer of a computing system to ensure efficient resource usage. One such example came from NAND flash file system throughput improvement. Here customer  demanded 50-60% improvement in write throughput. The existing design was simple and effective, until Samsung determined it was inadequate for their next generation of feature phones.

Step 1 : Original Design

As shown below (Fig 1), all components executed sequentially.

Fig 1 : Existing Stack

In a resource-constrained RTOS environment, this simplicity had its benefits. But with a new performance specification, this became the limitation.

End to end latency was around 400μs, of which 250μs was spent by CPU waiting on NAND programming. In effect, the CPU — running application and file system — was idle majority of the time. Measurements identified the bottleneck — Step 1 completed.

Step 2 : Improved Design – Concurrency

Fig2 below shows the new architecture. Since the bottleneck was obvious, the solution was to introduce concurrency between CPU and Raw NAND by making that lowest level programming interface asynchronous. Alternative solutions like asynchronous I/O at higher file system or application level would have been more expensive and likely much less effective. Because they all can create resource contentions down the stack.

Fig 2 : Improved Stack with concurrency between CPU and NAND

A Cache

Introducing a simple cache allowed NAND programming to be decoupled from rest of the sequence. File system will now just update the cache, while the actual programming of NAND happens in the background. This allowed both CPU and NAND to execute concurrently.

  • Throughput improvement: >70%
  • Trade-offs: additional memory usage and an acceptable increase in system complexity

While conceptually simple, the implementation within a memory-constrained RTOS raised interesting challenges. Cache memory management had to be carefully tuned for efficiency, and NAND bad block management had to be performed asynchronously. These intricacies led to an interesting debate on patentability and an eventual grant.

The Verdict

Illustrates how the two step methodology delivered results and even an intellectual property. Next example applies these same principles within a more complex environment with similar results.

Linux File System Optimization

The requirement here was to maximize streaming read/write throughput for a proprietary file system — ideally outperforming native solutions like Ext4 and UBIFS. This project required running a benchmark such as IOZone and iteratively isolating bottlenecks across the storage stack. The simplified architecture diagram (Fig 3) highlights the critical paths uncovered by measurement. (See more design details about Linux Storage Stack in this link)

Step 1 : Design Resource Contentions

Fig 3 : Existing Stack with two levels of contention

Two major bottlenecks were identified through design review and measurements:

  1. File System Contention — Multiple applications accessing the file system introduced latencies in the tens of microseconds.

  2. Block Driver Contention — Concurrent access from the file system and page cache to the block device caused latencies up to the order of single-digit seconds.

Page cache is essentially data cache from file system user perspective, while block Driver is basically storage device management layer which abstracts disk I/O functions.

Step 2: Caches

File System Contention Resolution

During streaming I/O, the primary role of a file system in Linux is to map file offset to storage block addresses. Concurrent access from multiple processes to the file system for this mapping operation became a source of contention. Optimizing the file system mapping algorithm itself would be expensive, so the solution was to prefetch and cache the mapping data at higher VFS layer.

To minimize overhead, prefetching was triggered only when streaming I/O was detected — essentially repeated reads or writes to contiguous offsets. This cache effectively reduced latency from tens of microseconds to ~5 μs on average.

Block Driver Contention Resolution

Several strategies could have resolved this — optimizing file system or application, increasing page cache memory, or adding dedicated storage paths for these relatively independent access streams. But the solution employed was much more targeted and simpler:

  1. Increase the in-memory cache for file system metadata.
  2. Disable automatic file system metadata syncs, instead rely on background sync that executed only when file system was idle.
Fig 4 : Improved Stack

Now during peak I/O activity, block driver was practically dedicated to service page cache requests. Metadata syncs were also deferred to idle time windows. This combination of hierarchical caching and careful prioritization resolved the block driver bottleneck.

The trade-offs were cache management overhead, increase in memory footprint, and some risk in loss of data due to deferred background syncs. File system itself was power fail safe so it’s guaranteed to work across power failures, but the last written file contents before a metadata sync could be lost. All these were acceptable from customer’s use case perspective.

The Verdict

Identification of bottlenecks was possible only by layering metrics across the stack — from application level down to block driver. Then executing customer workloads, and then analyzing correlating spikes across time.

The solutions were then targeted to resolve specific resource contentions with minimal impact to other use cases. Even within a relatively complex stack like Linux kernel, the two step methodology of comprehension and then problem solving was effective.

While the Linux FS project tackled performance inside a single device, the next example scales up to cloud infrastructure, where data pipelines process gigabytes of metrics across distributed systems.

AWS Data Pipeline

The requirement here was to collect gigabytes of metrics from a test network, then process and monitor them for key performance indicators. These metrics were generated by the verification target — a cloud storage transport protocol (SRD) which was still under development.  The project goal was to design a system to extract, transform, and load end-to-end test metrics from this storage stack.

Because the storage stack under test was experimental, the requirements for data transformation were open-ended. Debugging relied on running SQL queries and looking for correlations in collected metrics to isolate bottlenecks. The expectation was to support a wide range of queries with “reasonable” latency, despite little clarity on the specific use cases.

Open-Ended Requirements

The experimental nature of this project imposed three constraints:

  1. Quick project execution due to tight timelines
  2. A simple, adaptable architecture to handle unknown future use cases.
  3. Low latency access to stored data for interactive debug.

Step 1 : Data Lake and S3 Measurements

After evaluating multiple solutions, AWS S3, an unstructured data storage was chosen as the database backend. Compared to traditional databases, it offered lower maintenance overhead and schema flexibility. Even though a simple object store, when used to save structured files like JSON or CSV, it can act as a database backend to services like AWS Athena and Redshift Spectrum. But, the drawback was performance: throughput fell below 1 MB/sec when accessing large numbers of small files. Measurements identified this as the primary bottleneck.

Step 2 : Partitions and ETL Chains

Partitioning

Partitioning was the first aspect to the solution. As AWS notes :

By partitioning your data, you can restrict the amount of data scanned by downstream analytics engines, thereby improving performance and reducing the cost for queries. 

Files corresponding to a single day’s  worth of data are placed under a prefix such as
s3://my_bucket/logs/year=2022/month=06/day=01/. We can use a WHERE clause to query the data as follows: "SELECT * FROM table WHERE year=2023 AND month=06 AND day=01"
The preceding query reads only the data inside the partition folder year=2022/month=06/day=01 instead of scanning through the files under all partitions

With partitioning and by limiting SQL query ranges, access complexity dropped to O(Log n) from O(n) (see below Fig 5).

Fig 5 : O (Log n) partitioned access

Data Lake ETL Chains

Initially, data was stored with year/month/day partition. This was later extracted and transformed using AWS Athena. Filtered output was redirected in CSV form into another partitioned S3 bucket for further queries and debug.

For example :

Output of a query like “”SELECT * FROM table WHERE year=2023 AND month=06 AND day=01 AND server_type=”storage”“” was redirected to location s3://my_bucket/logs/server_type=storage/

This created chained and tailored datasets for detailed analysis — in above case storage server metrics.

Depending on emerging requirements, these ETL chains were developed, and this allowed for simple scalable organization of large scale data.

The Verdict

The two-step methodology applied here was:

  1. Identify — Small file access on S3 was measured and confirmed as the bottleneck.

  2. Resolve — Partitioning and ETL chaining were introduced to limit query scope and improve throughput.

Despite relying on a low-cost S3 backend, the system achieved acceptable query latencies for debugging, while remaining flexible enough to handle evolving requirements. This combination of simplicity and scalability allowed gigabytes of test data to be managed efficiently under aggressive development timelines.

Example proved the two step works at gigabyte scale, next illustration dials it down to kilobytes level, shows how the same principle can work on a wearable embedded device.

Noisy Neighbor

The problem here was data loss on shared diagnostic channels running on a resource-constrained wearable device (in this case, the now discontinued Amazon Halo). Diagnostic in this case involved logs, metrics and other critical information required to assess general health of the device. So, several components used these channels, and it was difficult to pinpoint the “noisy neighbor” causing these drops. Usual solutions of enforcing a per-component quota would have been too expensive in terms of complexity and overhead.

Following the two-step method, the first step in this case would be to instrument metrics to monitor bandwidth usage and gain a clear understanding of the problem.

Step 1 : Instrument Usage Metrics

Instrumented metrics tracked :

  • Min/Maximum/Average bandwidth utilization
  • Dropped byte count.

First three metrics monitored overall trends, while dropped bytes highlighted actual data loss. Plotting dropped bytes across time allowed easy identification of relevant time windows.

Fig 6 : Dropped bytes

Correlation & Cross-correlation

Focusing on logs or metrics captured during dropped bytes intervals like 12:05 or 12:20 helped narrow down:

  • What was happening (the active use case)

  • Who was involved (the specific component)

For example : Correlating a spike at 12:05 with actual logs conveyed eMMC or FAT32 as likely culprits. Investigating the code path related to eMMC error handling helped further isolate the reason.

  • [12:01][1233][INFO][FAT32]File opened
  • [12:02][1234][INFO][FAT32]Reading LBA 20
  • [12:02][1235][INFO][eMMC]Device read initated
  • [12:03][1236][ERROR][eMMC] Error interrupt
  • [12:03][1237][ERROR][eMMC] CRC error detected
  • [12:03][1238][ERROR][eMMC] Unrecoverable error returned
  • [12:07][1249][ERROR][FAT32] Directory entry read failed

When multiple diagnostic channels exist — for example, logs and metrics — cross-correlation can be even more powerful. By examining related metrics during a log drop window (or vice versa), you can triangulate the root cause with higher confidence.

Step 2 : Resolution through Prioritization

Once the noisy culprits are identified, the responsibility was now on the component or use case designer to prioritize and reduce the usage.

Final question was — why is dropped_bytes metric itself never dropped.

  • First, it’s a low frequency, aggregated metric.
  • Second, it’s reported through low priority idle task. This prioritization ensured bandwidth is always available for it.

Unlike timing sensitive inline metrics like errors or warnings, these aggregate values can be safely prioritized lower. This design guarantees accurate reporting of overall system health without impacting inline real-time reported diagnostics.

The Verdict

Once again, the two-step methodology proved effective:

  1. Identify by measuring and correlating metrics to isolate the noisy neighbor.

  2. Resolve by fixing noisy element and by introducing prioritization within telemetry.

While this example applied noisy neighbor within an embedded device, next illustration will explain how these same principles can be applied at slightly bigger scale for memory optimization.

Dynamic Memory Allocation

The goal here was to reduce Linux DRAM footprint on Amazon Alexa. As expected, initial measurements indicated kernel footprint was relatively small compared to user-space. With approximately 200 runtime processes, getting the overall measurements itself seemed like a complex task.

Step 1 : The Haystack – Capturing data

With high number of concurrent processes, the measurement itself  demanded a methodical approach. Approach used was a combination of Linux pmap and python scripting.

PMAP and Python Pandas

Linux provides a useful tool named pmap, this provides memory map for a process. With 200 processes, there will be 200 such maps with detailed information like size, permissions, and segment type (e.g., library, stack, or [anon] for dynamically allocated memory).

Example below:

12345: /usr/bin/my_applicationAddress           Kbytes     RSS   Dirty Mode   Mapping0000555a297b6000      12       8       0 r-x--  /usr/bin/my_application0000555a297b9000       4       4       4 r----  /usr/bin/my_application0000555a297ba000       4       4       4 rw---  /usr/bin/my_application0000555a297bb000      64      64      64 rw---  [ anon ]00007f9c87d46000    1864    1024       0 r-x--  /usr/lib/x86_64-linux-gnu/libc-2.31.so00007f9c87f18000     512      64       0 r----  /usr/lib/x86_64-linux-gnu/libc-2.31.so00007f9c87f98000      16      16      16 rw---  /usr/lib/x86_64-linux-gnu/libc-2.31.so00007f9c87f9c000      20      20      20 rw---  [ anon ]00007f9c87fa1000     144     144       0 r-x--  /usr/lib/x86_64-linux-gnu/ld-2.31.so00007f9c87fc5000       8       8       8 r----  /usr/lib/x86_64-linux-gnu/ld-2.31.so00007f9c87fc7000       4       4       4 rw---  /usr/lib/x86_64-linux-gnu/ld-2.31.so00007f9c87fc8000       4       4       4 rw---  [ anon ]00007f9c87fca000    8192    8192    8192 rw---  [ anon ]00007ffc12345000     132      12       0 rw---  [ stack ]

The output of pmap roughly resembles space delimited CSV, so some simple scripting could clean this up, transform into comma delimited form and then load it into a Pandas Dataframe.

A union of 200 such tables correlating to each process provided the unified view of the whole system. Now some SQL queries to slice, filter, and aggregate provided different insights into system-wide memory distribution.

Step 2 : Finding The Needle

Grouping rows in the unified table based on different types of memory maps indicated [anon] (ie: dynamic allocation) as the bottleneck. Studying the output stats of Android Native Memory allocator showed some internal fragmentation. Diving a bit deeper and investigating available memory config led to the discovery of MALLOC_SVELTE option. Setting this to true enables low-memory configuration.

MALLOC_SVELTE := true

Similar to the usual CPU/memory trade-off cases, even here enabling this can impact allocation latency. But it also expected to improve memory utilization and is meant for memory constrained systems. Measurements indicated ~30% reduction in memory footprint with some hit in latency for a subset of products.

The Verdict

Measurements led to identifying the exact bottleneck — with that a clear targeted solution with highest return on investment was discovered. This one line change to add an option reduced memory by a third. System successfully reconfigured for adapting to a memory constrained environment. In this particular case, measurement itself was the challenge and it required a bit of cross-disciplinary approach — applying of some big data concepts to Linux system optimization.

The Finale

Whether it’s an extinct feature phone, or modern data lakes, the same two step methodology delivers results. It works because it mirrors the scientific method: observe, hypothesize, test, refine. Applying this to real-world systems demand some creativity and a degree of patience. And even when it doesn’t fully deliver results, it leaves us with defensible measurements that explain why—and a clearer picture of where to look next.

Raspberry Pi 2 ARM-GPU IPC

Raspberry Pi 2 is powered by a Broadcom BCM2836 chipset. Along with the typical peripheral IP blocks like UART, SD, etc, BCM2836 ASIC integrates four ARM Cortex A7 cores and a graphics processing engine. The primary intent of this post is to elaborate the ARM – GPU inter-processor communication (IPC) mechanism. Essentially to explain the hardware and the software blocks involved in establishing a messaging channel between these two processors.

The basic infrastructure is illustrated by the diagram given below, we have the ARM processor and the VideoCore engine with a shared SDRAM. Note that the SDRAM would be an off chip module, so automatically there will be an external bus interface involved here. With these multiple processors and an SDRAM, the obvious IPC mechanism would be shared memory. But before getting into more details let’s have a brief overview of the RPi 2 boot.

System Overview
System Overview

Boot Process

BCM2836 boot binaries are located in the FAT formatted micro SD Card. Binaries for both GPU and ARM are in the FAT root directory. Once powered up, the very first boot step would be the execution of BCM2836 on-chip boot ROM, that would fetch second level bootloader which would in turn boot the GPU binary from SD. It’s GPU binary which is responsible for loading the ARM executable. More details about boot is can be read here, but for now we can crudely summarize the steps in the following manner.

  1. Power ON.
  2. BCM2836 BOOT ROM fetches second level bootloader.
  3. Second level boot loader fetches GPU binary.
  4. GPU binary loads the ARM image and transfers control.

So in this post we are looking to explain certain aspects of the ARM IPC code which is essential for communicating with the GPU. Please note that the second level bootloader and GPU binary images can be fetched from the GIT Hub repository. Now, let’s attempt to establish the fundamentals of this ARM-GPU IPC.

Shared Memory

To drive the graphics framebuffer or to accomplish any such task involving both ARM and GPU processing units, we require some meaningful message exchanges between these cores. For that we would use the shared SDRAM. But note that ARM-GPU combination is an asymmetric multiprocessing environment, so first we need to establish the processor level split for this 1G SDRAM. Basically this defines how much RAM is dedicated to each of these cores. This division is defined in “config.txt” configuration file located in the SDCARD, the relevant configuration parameter is mentioned below.

# GPU – CPU Split
gpu_mem=64

Note that “config.txt” is read by GPU during the boot process.

According to the documentation, the above “gpu_mem=64” setting should ideally allocate the top 64MB (address 0x3c000000 — 0x3FFFFFFF) of the SDRAM to GPU and the rest to ARM. But in reality this seems to be not the case, and the top 80MB is actually being taken by GPU and the bottom 944MB goes to ARM. Had to figure this out the hard way by messaging the GPU for its memory configuration, more on this later. Now the final high level split is illustrated below:

SDRAM Split
SDRAM Split

 

So now the binaries running on both ARM and GPU realize their own boundaries. Which is great, we cannot afford any form of cannibalism. Now we can introduce the next level of IPC synchronization — the Mailbox mechanism. This peripheral helps ARM and GPU communicate the exact location of the shared memory address where the larger shared message structures can be read or written.

Mailbox
Mailbox

 

The basic idea is simple; one Mailbox is for reading and the other for writing! Mailbox-0 read by Cortex A7 would be written by GPU, similarly GPU would read the Mailbox-1 which in turn would be written by A7. These 32-bit Mailbox registers are ideal for communicating that shared SDRAM memory address meant for ARM — GPU message exchanges.

Mailboxes

The four least significant bits (nibble) of the 32-bit address communicated through the message box is reserved for clarifying the Mailbox channel. This channel gives an indication of how to interpret the data structure at that transmitted address location. Note that each channel tends to have its own message structure, this implements a form of multiplexing over the same mailbox interface.

More details regarding these channels can be read here. Note that reserving the last 4 bits for channel number would also mean that the transmitted address has to be always aligned at 16 byte boundaries. But this seems more or less insignificant, because if the ARM Cortex A7 L1 cache is enabled, then to manage coherency we would anyway want buffers to be aligned at 64 byte cache line boundaries. Also, such shared buffers getting cached in L1 should be flushed before transmission and invalidated after it’s written by the GPU. Of course, all the shared memory address exchanges has to be physical bus address values, not virtual, otherwise they would make no sense.

Mailbox Register Encoding
Mailbox Register Encoding

 

An example of a 32-bit address transmitted over Mailbox channel 8 is illustrated above. Note that this encoding of the channel within the least significant nibble is a software mechanism. To the hardware these are mere 32bit values. Also, the channel number 8 which communicates property tags would be the one we use here for illustrating higher level data structures of the IPC. Now for some grisly register map details.

[code lang=”cpp”]

/* Register access helper */
#define REG(x) (*(volatile unsigned int *)(x))

/* Mailbox 0 base address (Read by ARM) */
#define MBOX0_BASE 0x3F00B880

/* Mailbox 1 base address (Read by GPU) */
#define MBOX1_BASE 0x3F00B8A0

/*
** Mailbox Controller Registers
*/
/* I/O register is at the base address */
#define MBOX_RDWR(x) REG((x))

/* Status register at the offset 0x18 from base */
#define MBOX_STATUS(x) REG((x) + 0x18)

/* Status register value when mailbox is full/empty */
#define MBOX_FULL 0x80000000
#define MBOX_EMPTY 0x40000000

/* Interrupt configuration register */
#define MBOX_CONFIG(x) REG((x) + 0x1C)

/* Configuration register mask to enable mailbox data IRQ */
#define MBOX_DATAIRQEN 0x00000001

[/code]

For a successful mailbox transaction we only need the above three hardware registers and the associated bit masks.

  1. MBOX_RDWR : I/O register to transfer the 32-bit value
  2. MBOX_STATUS : Status register to synchronize the communication
  3. MBOX_CONFIG: Configuration register to enable interrupt generation

Below we have a write sequence transmitting a 32-bit value from ARM to GPU :

[code lang=”cpp”]

/* 1. Enable Mailbox0 interrupt to catch the GPU response,
this is optional if the read is going to happen in polled
mode. */
MBOX_CONFIG(MBOX0_BASE) |= MBOX_DATAIRQEN;

/* 2. Before writing, loop until there is space
available in Mailbox1. This step may be optional
if the code always waits for a full transaction to complete. */
while (MBOX_STATUS(MBOX1_BASE) & MBOX_FULL);

/* 3. Write the 32 bit address into the Mailbox1 I/O register */
MBOX_RDWR(MBOX1_BASE) = ui32_address | channel;
[/code]

Once ARM transmits the 32-bit value over Mailbox1, next step would be about waiting for a response on Mailbox0 written by GPU. Reading this 32 bit GPU response would mandate us to first wait for the interrupt and then read the MBOX_RDWR register.

[code lang=”cpp”]

/* 1. Now that the interrupt has fired, disable it. */
MBOX_CONFIG(MBOX0_BASE) &= ~MBOX_DATAIRQEN;

/* 2. Read the Mailbox 0 I/O register */
ui32_value = MBOX_RDWR(MBOX0_BASE);

[/code]

Instead of interrupt, it’s also possible to poll on the Mailbox status register. Just wait looping for the bit MBOX_EMPTY to be reset.

[code lang=”cpp”]

/* 1. Poll the status */
while (MBOX_STATUS(MBOX0_BASE) & MBOX_EMPTY);

/* 2. Read the Mailbox 0 I/O register */
ui32_value = MBOX_RDWR(MBOX0_BASE);

[/code]

Additionally we should also add the check to validate that the response is indeed for the expected channel.

[code lang=”cpp”]

do {
/* 1. Poll the status */
while (MBOX_STATUS(MBOX0_BASE) & MBOX_EMPTY);

/* 2. Read the Mailbox 0 I/O register */
ui32_value = MBOX_RDWR(MBOX0_BASE);
} while ((ui32_val & 0xF) != channel)

[/code]

Now we have established a Mailbox based synchronization mechanism for both the cores to communicate shared memory addresses.

Data Structure

Mailbox merely enables the processors to transmit buffer pointers. The structure of this buffer contents would depend on the channel. Detailed specification of the shared memory data structures used for Mailbox channel 8 is linked here. Below we have a top level overview of that Channel 8 packet:

[code lang=”cpp”]

/* Packet HEADER */
struct header
{
ui32 packet_length; // size in bytes (including header and values )

/*
** Request codes:
** 0x00000000: process request
** All other values reserved
** Response codes:
** 0x80000000: request successful
** 0x80000001: error parsing request buffer
** (partial response)
*/
ui32 req_resp_code;
};

/* Packet CONTENTS */
struct tags
{
ui32 id; // id
ui32 sz; // value size in bytes
ui32 req_resp; // request/response code
ui32 value[BUF_LEN]; // value length varies for each tag
};

/* Channel 8 Packet */
struct packet
{
struct header pheader;
struct tags ptags[NUMS];
};

[/code]

Please note that the above illustration is to merely bring clarity to the high level structure of this packet. In reality it’s best to implement the packet generation using a ui32 array instead of the above structures.

The “struct packet” is essentially a header + sequence of tags, these tags are associated with a particular request like “Get VC memory” or “Get ARM memory”. In fact, these two tags were used to figure out the previously mentioned discrepancy of 64MB v/s 80MB in the ARM-GPU SDRAM split.

As mentioned in the link, these tag structures would be populated with requests by ARM, and GPU will overwrite them with responses. For example, “Allocate buffer” tag request would make GPU return a frame buffer address. After that, a valid frame written to this location will also get displayed on an attached HDMI monitor. We can even have multiple frames buffers and switch across them using “Set virtual offset” tag. Response time for this operation seems to be deterministic and around 300uS, which is good, otherwise we would end up having frame lags.

 

References:

  1. Linux Mailbox Driver
  2. Valvers : Graphics Basic
  3. GIT Hub GPU/Bootloader Repository
  4. Valvers : Bare Metal Programming in C Pt1
  5. Rpi2 config.txt documentation.
  6. Broadcom BCM2835 Peripheral Reference Manual
  7. Mailbox Property Interface 
  8. VideoCore® IV 3D Architecture Reference Guide
  9. BCM2835 Errata

 

Edit Distance

Edit distance is quite useful to illustrate how we can optimize backtracking via memoization. Here we have two words word1 and word2, and the problem is about finding the minimum number of operations required to convert word1 to word2. Note that only three operations are permitted on the word —  1) Insert a character  2) Delete a character 3) Replace a character.

For example, converting “tea” to “set” would require at least 2 operations.

  1. “tea” -> Replace ‘t’ with ‘s’
  2. “sea” -> Replace ‘a’ with ‘t’
  3. “set”

Calculating this minimum number of operations would mandate that we attempt all the three possible conversion methods at every relevant location. Essentially, whenever there is a character mismatch we attempt to either delete, insert or replace. The whole approach can be visualized as a tree.

btrack

As you can see, the red connections are character delete operations, while green is replace and brown is insert. The algorithm basically runs a depth first search on a graph like above. Once we reach the leaf node, the total number of operations required for that sequence would be known. Recursive C function implementing the idea is given below.

[code lang=”cpp”]
int ml(char* w1, int o1, int l1, char* w2, int o2, int l2)
{
int od, oi, os, r = 0;

/* End of the string, return the number of remaining characters
in the other string */
if (o1 == l1 || o2 == l2)
return (o1 == l1) ? l2 – o2 : l1 – o1;

/* Find minimum distance for strings at [o1..l1] & [o2..l2].
If characters are unequal, try insert and delete */
if (w1[o1] != w2[o2])
{
od = ml(w1, o1 + 1, l1, w2, o2, l2) + 1;
oi = ml(w1, o1, l1, w2, o2 + 1, l2) + 1;
r++;
}

/* Explore the case where character is replaced or
where they are equal. Finally, pick the minimum. */
os = ml(w1, o1 + 1, l1, w2, o2 + 1, l2) + r;
return MIN_INT(os, od, oi);
}

int minDistance(char *word1, char * word2)
{
int l1 = strlen(word1), l2 = strlen(word2), md = 0;
md = ml(word1, 0, l1, word2, 0, l2);
return md;
}
[/code]

Please note that the nodes illustrate only that remaining part of the string which needs to be evaluated.  For example, after deletion of ‘t’, we have [“ea” , “set”]. Also, quite evident that there are several nodes in the above tree which contain the same set of strings. For example, [“ea”, “et”] is repeated thrice. An obvious optimization would be to avoid repeated traversing of these similar sub-trees by using memoization. Basically we cache the result of the first traversal and reuse it during subsequent calls. Now with this optimization we have a much sparser tree structure.

EditDistance

Now the nodes are non-repeating and the implementation with this optimization is given below. Note that the memoization array is indexed based on the combination of the string offsets. For example, the number of operations required to convert sub-string “a” (w1[2]) into “et” (w2[1]) can be stored and later accessed from location dp[2][1].

[code lang=”cpp”]
void ml(char* w1, int o1, int l1, char* w2, int o2, int l2,
unsigned char *dp)
{
int offst = o1 * (l2 + 1) + o2, od = (o1 + 1) * (l2 + 1) + o2;
int oi = o1 * (l2 + 1) + o2 + 1, r = 0;

/* End of the string, return the number of remaining characters
in the other string */
if (o1 == l1 || o2 == l2)
dp[offst] = (o1 == l1) ? l2 – o2 : l1 – o1;

/* If the minimum distance for strings at [o1..l1] & [o2..l2] is
not already calculated, then figure out the same */
else if (dp[offst] == MAX_CHAR)
{
/* Characters are unequal, try insert and delete */
if (w1[o1] != w2[o2])
{
ml(w1, o1 + 1, l1, w2, o2, l2, dp);
ml(w1, o1, l1, w2, o2 + 1, l2, dp);
r++;
}

/* Explore the case where character is replaced or
where they are equal. Finally, pick the minimum. */
ml(w1, o1 + 1, l1, w2, o2 + 1, l2, dp);
dp[offst] = (r > 0) ? (MIN_INT(dp[od + 1] + r, dp[od] + r,
dp[oi] + r)) : dp[od + 1];
}
}

int minDistance(char *word1, char * word2)
{
int l1 = strlen(word1), l2 = strlen(word2), md = 0;
unsigned char *dp = malloc((l1 + 1) * (l2 + 1) * sizeof(unsigned char));

/* Validate */
if (!dp) return 0;

/* Initialize buffer */
memset(dp, MAX_CHAR, (l1 + 1) * (l2 + 1));

/* Get the minimum distance */
ml(word1, 0, l1, word2, 0, l2, dp);
md = dp[0];
free(dp);
return md;
}
[/code]
Full source code can be accessed at bit bucket link

Substring with Concatenation of All Words

Bioinformatics could be one of those areas where problems like ‘substring with concatenation of all words ‘ might be relevant. Here we are actually looking for a sequence within a larger string, and this sequence has to be a concatenation of a given set of words, all of equal length. For example, for the string “”wordgoodgoodgoodbestword”” and the given set of words [“word”,”good”,”best”,”good”], the solution should return index [8]. Note that this concatenation of words can happen in any order. Even though the previous example had only one index, the solution should identify all the instances of such sequences. For example, given “aaaaaaaa” and word list [“aa”,”aa”,”aa”], the solution should return indices [0, 1, 2].

 

Substring Sequence
Substring Sequence

 

A brute force approach would involve scanning every offset of the string seeking for matching words within the list. So for a string of length n and a word list of size m containing words of length k, the complexity would be O(n x m x m x k). An improvement over the brute force approach would be to sort the word array, now the time complexity becomes O(n x m x Log(m) x k). So the primary problem is about quickly identifying whether there is a matching word corresponding to a substring at some offset. Seems like a data structure like trie is ideal for such an operation.

 

Trie
Trie for [“word”,”good”,”best”,”good”]

A trie for the list of words [“word”,”good”,”best”,”good”] is illustrated above, we can use this to improve the speed of the sequential scan operation. From the data structure it’s evident that the time complexity for a string search would be O(k). Note that such a tree needs to also handle duplicate words, hence the variable “count” which keeps track of the number of instances of that word within the list. In case of the above example, “good”  happens to have a count equal to 2. Use of trie can further reduce the overall time complexity to O(n x m x k). But the obvious trade-off is the overhead of creating this tree.

Structure of a trie node used in this algorithm is illustrated below.

[code language=”cpp”]
/********************************/
/* Trie Node */
/********************************/
struct tnode
{
struct tnode *tc[TRIE_SZ]; // child nodes
int tcount; // total number of instances of a word
int count; // decremented during searching
struct tnode *next; // linked list node
};

[/code]

 

Each trie node can have up to 26 children, please note that here we assume the comparison is not case sensitive, otherwise we would need 52. Variable ‘tcount’ records the total number of instances of a word on the list. And the variable ‘count’ is used for keeping track of the number of times a word is encountered on the main string. It is initialized with the same value as ‘tcount’ but decremented during the scan whenever the associated word is found on the main string.

Consider scanning of the string “wordgoodgoodgoodbestword”, ‘count’ for the word “good” would be decremented to 1 as soon as it’s encountered for the first time at the offset 4. So when ‘count’ reaches zero, it actually means that a valid sub-string cannot have any more instances of that word. Here as soon as the algorithms encounters the third instance of word “good” it would realize that the sub-string starting at 0 is invalid, because even though “good” is present in the list its ‘count’ is now zero. So now we want to restart the scanning from the next offset 1.

Before restarting this scan we need to also revert the variable ‘count’ to ‘tcount’. This would mean that we have to keep track of the nodes decremented during the last scan. Here we can utilize the ‘next’ pointer field.

 

Trie Next Pointer
Trie Next Pointer

 

During the previously failed scan which started at offset 0 our algorithm would have used the next pointer to generate a linked list of encountered words. The above example would generate a link from “word” to “good”. Please note that during the scan the word “good” was encountered after “word”, hence the link in that direction. Now we can run through this linked list and quickly reset the counters which were decremented. Obviously this linked list would be dismantled as soon as the reset is done. It’s merely an optimization to quickly revert the trie to its original pristine state.

Code used for creating and searching a trie is given below:

[code language=”cpp”]
/***********************************************************************/
/* add: Add a new string */
/* */
/***********************************************************************/
struct tnode *add(struct tnode *node, char *str, struct mstack *m)
{
int index = CHAR_TO_OFFST(*str);

/* If the node does not exist, then allocate the same */
if (!node)
{
/* Allocate */
node = m_alloc(m, sizeof(struct tnode));
if (!node)
return NULL;
}

/* Continue populating the trie */
if (*str != ‘\0’)
node->tc[index] = add(node->tc[index], str + 1, m);

/* If this is the last character, then set the count */
else
node->tcount = ++node->count;

/* Return the node */
return node;
}

/***********************************************************************/
/* search: Search for a string */
/* */
/***********************************************************************/
struct tnode *search(struct tnode *node, char *str)
{
int index = CHAR_TO_OFFST(*str);

/* If this is the last character of the string,
then return node */
if (*str == ‘\0’)
{
if (node->tcount > 0)
return node;
else
return NULL;
}

/* Ensure the character is valid and it’s present in the trie */
if (node->tc[index] == NULL)
return NULL;

/* Continue the search */
return search(node->tc[index], str + 1);
}

/***********************************************************************/
/* create_trie: Create a trie */
/* */
/***********************************************************************/
struct tnode *create_trie(char **ch, int wlen, struct mstack *m)
{
struct tnode *r = NULL;
int i;

/* Allocate the trie */
for (i = 0; i < wlen; ++i)
{
/* If the root is not set, then initialize the same */
if (!r)
r = add(NULL, ch[i], m);

/* Else continue the allocation */
else if (!add(r, ch[i], m))
return NULL;
}

/* Return the root */
return r;
}
[/code]

 

The function which implements the main logic is given below.

[code language=”cpp”]
/***********************************************************************/
/* You are given a string, s, and a list of words, words, that are all */
/* of the same length. Find all starting indices of substring(s) in s */
/* that is a concatenation of each word in words exactly once and */
/* without any intervening characters. */
/* For example, given: */
/* s: "barfoothefoobarman" */
/* words: ["foo", "bar"] */
/* */
/* You should return the indices: [0,9]. */
/* (order does not matter). */
/* */
/***********************************************************************/
int* findSubstring(char* s, char** words, int wordsSize, int *returnSize)
{
struct tnode *root = NULL, *head = NULL, *node = NULL;;
int wlen, slen, i, j, tw = 0, wsize, s_end, arr_off = 0;
int *ret_arr = NULL;
struct mstack m = {NULL};
struct tnode **m_arr = NULL;
struct tnode dummy;

/* Maintain Sanity */
*returnSize = 0;
if (!s || !words || !wordsSize || !returnSize)
return NULL;

/* Calculate the input array size */
wlen = strlen(words[0]);
slen = strlen(s);

/* Initialize variables */
wsize = wlen * wordsSize;
s_end = slen – wsize;

/* Initialize dummy to zero */
dummy.count = dummy.tcount = 0;

/* Allocate a trie for the array */
if (s_end >= 0)
{
/* Allocate memory for simple allocator */
if (!m_stack_alloc(&m, SEGMENT_COUNT, SEGMENT_SIZE))
goto subs_exit;

/* Memoization Array */
m_arr = calloc(slen, sizeof(struct tnode *));
if (!m_arr)
goto subs_exit;

/* Create trie */
root = create_trie(words, wordsSize, &m);
if (!root)
goto subs_exit;
}

/* Loop as long as there is a possibility for a match */
for (i = 0; i <= s_end; ++i)
{
/* Loop checking whether the substring at this location is a
concatenation of the words */
for (j = i; j <= slen – wlen; j += wlen)
{
char c = s[j + wlen];
struct tnode *tn = m_arr[j];

/* If there is no hit, then search the trie */
if (!tn)
{
/* Create a NULL terminating condition */
s[j + wlen] = ‘\0’;

/* If the word is not found then, set the value to
dummy */
if ((tn = search(root, &s[j])) == NULL)
tn = &dummy;

/* Replace the character */
s[j + wlen] = c;

/* Assign the pointer to the memoization array */
m_arr[j] = tn;
}

/* If it’s not found, then break */
if (!tn->count)
break;

/* Decrement the count */
–tn->count;

/* Initiate the linked list head */
if (!head)
node = head = tn;

/* Add the new node only if it’s not already a part of the
list */
else if ((!tn->next) && (node != tn))
{
node->next = tn;
node = tn;
}

/* Increment the hit count */
tw++;

/* If all the words were found, then break*/
if (tw == wordsSize)
{
/* Make the necessary dynamic allocation */
if ((!ret_arr) && ((ret_arr = malloc(sizeof(int) * slen)) == NULL))
goto subs_exit;

/* Save the offset, increment the index and break */
ret_arr[arr_off] = i;
arr_off++;
break;
}
}

/* Reset the list */
if (head)
{
reset_list(head);
head = NULL;
tw = 0;
}
}

/* Free the trie, memoization array, assign the return size */
subs_exit:
m_stack_free(&m);
if (m_arr)
free(m_arr);
*returnSize = arr_off;
return ret_arr;
}
[/code]

As you can see the algorithm also implements memoization related optimization. Other dynamic memory allocation related changes and rest of the code can be accessed from the Bitbucket repo 

Largest Rectangle in Histogram

Largest rectangle in a histogram is an interesting puzzle and can be solved in several different ways. One of the most popular O(n) method involves usage of stack, another one employs memoization to improve on the brute force approach, and then there is also a slightly slower O(nLog(n)) method employing segment tree. Evidently this problem seems quite useful for illustrating some of the data structure and programming concepts.

Below diagram is an illustration of such a histogram represented by an array with heights { 2, 1 ,5, 6, 2, 3}.

Largest Rectangle in Histogram
Largest Rectangle in Histogram

The Problem

We are essentially seeking the largest rectangle within a histogram. And the two variables determining the area of such a rectangle would be the minimum height among the included bars and the total number of bars. For example, in the above histogram the maximum area would be formed by two bars at the offset 2 and 3 with a minimum height equal to 5, so 5 x 2 = 10 .

Area = Minimum_Height x Number of bars

The search for that elusive maximum area requires us to scan all the possible rectangles in the histogram. This can either happen by picking a bar and then seeking the maximum range where its own height is the minimum. Or by picking a range and then finding the minimum height. In other words, we pick a value for one variable in the above equation and then proceed to find a valid measurement for the other one.

O(n) Approach

If we sequentially pick each element and then locate the smaller bar to the left and right, then we would have the width of the rectangle corresponding to each height/bar. In this manner we can eventually figure out the largest rectangle. A brute force O(n^2) approach would simply involve two nested loops, outer loop to sequentially pick the element and the inner loop to scan the values to the left and right of that selected bar.

Maximum area measurements for each bar
Maximum area measurements for each bar

The above diagram illustrates the basic idea behind this algorithm. Performance would eventually depend on how quickly we can identify this width corresponding to each rectangle. We can explore how the usage of stack and memoization can essentially improve the time taken for this scan.

Stack: The idea is to push array elements into a stack as long as the present top of the stack is shorter than the element being pushed. If the stack top is larger than the element, then the present element provides the right hand side offset for the width, so now we can calculate the area of the rectangle with top element as the height. Note that the left offset can be identified by checking the stack element right below the top. After calculating the area, pop the stack. Repeat this operation as long as the top of the stack is larger than the present element.  So at any instant the stack would only contain those elements whose left offset is identified but not the right.

Stack Illustration 1
Stack Illustration 1

 

Stack Illustration 2
Stack Illustration 2

The above diagrams illustrate how the stack operations would scan the array { 2, 1, 5, 6, 2, 3}. As you can see, space complexity would also be O(n).

[code language=”cpp”]
/* Push the first element into the stack */
spush(&s, &a[0]);
b = a[0];

/* Scan all the elements */
while (i < n)
{
/* The bar is higher than stack top, push
into the stack */
if (a[i] >= b)
{
spush(&s, &a[i]);
b = a[i];
}

/* Else we need to calculate the area spanned by the
bar on the top of the stack. */
else
{
/* Peek while the stack top is less than the present
bar */
while ((speek(&s, &offst, &b) == 0) && (b > a[i]))
{
int loffst, ht;

/* Pop and then peek into the stack to get the
left offset */
spop (&s, &offst, &b);
speek(&s, &loffst, &ht);

/* Update the max rectangle */
if (max_rectangle < (b * ((i – loffst – 1))))
max_rectangle = b * (i – loffst – 1);
}
/* Push the bar into the stack */
spush(&s, &a[i]);
b = a[i];
}
/* Move to the next bar */
++i;
}

/* Pop the remaining bars */
while (spop(&s, &offst, &b) == 0)
{
int loffst, ht;

/* Update the area */
speek(&s, &loffst, &ht);
if (max_rectangle < (b * (i – loffst – 1)))
max_rectangle = b * (i – loffst – 1);
}
/* Return maximum area */
free(s.s);
return max_rectangle;
[/code]

Relevant piece of code is provided above, for more details please refer to the source code repository link.

Memoization: A brute force approach to this problem would require us to scan the left and right sub-arrays for each element looking for the relevant shorter height. For example, the bar with height 1 in the above example would force scanning of the whole array. In that sense if all the bars were of equal height, then the brute force algorithm would always force n^2 comparisons. Memoization involves caching and utilizing the already scanned information for subsequent use. Seems like that approach is an ideal candidate here.

Memoization
Memoization

The above diagram illustrates how the relevant smaller right offset for each element can be recorded by scanning the array in the reverse order. Please note that the curved arrow pointers signify comparisons within the inner loop. The crucial aspect is that by using the already recorded offsets we can leapfrog certain locations which are not relevant for that element.

For example, with this new method, generation of offset values until the location 2 (having value 5) does not see any reduction in number of comparisons. But the calculation of right offset for value 1 at the array location 1 was reduced by less than half because of the already cached offset value. How? We know that the value 5 is greater than 1, and we also know the offset of the value which is smaller than 5 (ie: 2 at location 4). So, we can conveniently jump to location 4  and check whether the value at that offset (ie:2) is less than 1, the search ends if it is smaller, else we merely jump to the next recorded offset (ie:6, end of list).

Similarly the smaller left offset values can also be generated by scanning the array, but this time in the order from the start. With the allocated additional space for memoization, we manage to short-circuit the comparison sequence. Once we have the left and right offsets for all the elements, we can simply run through the array and calculate the maximum area.

[code language=”cpp”]
/* First element has no left smaller element, last element has no
right smaller */
smaller_left[0] = -1;
smaller_right[n – 1] = n;
for (i = 1; i < n; ++i)
{
/* Best case */
if (a[i] > a[i – 1])
smaller_left[i] = i – 1;

/* Else scan the array */
else
{
int x = i – 1;
/* Look for the next smaller element towards the left */
while ((x >= 0) && (a[x] >= a[i]))
x = smaller_left[x];

/* Found the offset for the next smaller element */
smaller_left[i] = x;
}
}

/* Generate the offsets for smaller elements to the right */
for (i = n – 2; i >= 0; –i)
{
/* Best case */
if (a[i + 1] < a[i])
smaller_right[i] = i + 1;

/* Else scan the array */
else
{
int x = i + 1;

/* Loop as long as the values at the offsets are greater
than a[i] */
while ((x < n) && (a[x] >= a[i]))
x = smaller_right[x];

/* Smaller to the right */
smaller_right[i] = x;
}
}

/* O(n) scan to get the maximum area */
for (i = 0; i < n; i++)
{
int mi = a[i] * (smaller_right[i] – smaller_left[i] – 1);
if (max_rectangle < mi)
max_rectangle = mi;
}

/* Return maximum area */
return max_rectangle;
[/code]

Essential piece of code is given above, for more details please refer to the source code repository link.

O(nLog(n)

The third method is slightly more elaborate, it involves a segment tree. Since this data structure can be used for quickly finding the minimum value within a sub-array, we can use that property to implement a divide and conquer algorithm. In other words, armed with a segment tree hammer, this problem starts to look like a nail. The point is by using a segment tree the minimum height within any range can be found in logarithmic time . So here our approach would involve dividing the histogram while calculating the minimum height within each such divisions.

So first we start with the whole histogram, basically we pick the maximum width and then find the corresponding height. Then we divide the histogram based on the offset of the bar with this minimum height and recursively repeat the same operation on each sub-division. The idea is illustrated in the below diagram.

 

Divide and Conquer
Divide and Conquer

[code language=”cpp”]
int seg_tree(int *a, int *arr, int s, int e, int len)
{
int min, lmax, rmax, mmax;

/* Recursion termination */
if (s > e)
return 0;

/* Get the minimum value offset */
min = search_sg_tree(arr, a, 0, s, e, 0, len – 1);

/* Calculate the area with the minimum value within this offset */
mmax = a[min] * (e – s + 1);

/* Go to the left subtree */
lmax = seg_tree(a, arr, s, min – 1, len);

/* Go to the right subtree */
rmax = seg_tree(a, arr, min + 1, e, len);

/* Return the max */
return (((mmax > lmax) ? mmax : lmax) > rmax) ?
((mmax > lmax) ? mmax : lmax) : rmax;
}
[/code]

As you can see, depth of the recursive calls in the above divide and conquer method eventually depends on how the array actually gets divided during each call. If all the values are equal, then it will lead to the worst case scenario where the call stack depth is equivalent to n. In other words, the call stack memory usage complexity is O(n).

Segment Tree used for the above algorithm is given below:

Segment Tree
Segment Tree

 

We can implement such a segment tree using an array data structure, address of the left and right children can be easily generated using the formula (Parent_Offset * 2 + 1) &  (Parent_Offset * 2 + 2), quite like a heap. Below figure illustrates the above segment tree when it’s structured like an array.

Segment Tree Array
Segment Tree Array

As you can see, some of the locations remain unused. A tree of depth 4 can hold upto 16 elements, but this array requires only 11 locations. Generation of a segment tree is also quite similar to how an array would be converted to binary tree. Basically, we recursively generate the sub-trees while allowing the smaller child to rise to the top. Within such a tree, the smallest sub-tree would be the leaf nodes which would hold the array values and that also determines the terminating condition for recursion.

[code language=”cpp”]
/***********************************************************************/
/* create_sg_tree: Create a segment tree from an array */
/* */
/***********************************************************************/
int create_sg_tree(int *arr, int soff, int s, int e, int *a, int len)
{
int mid;

/* We are at the leaf node */
if (s == e)
return s;

/* Calculate the middle of the array */
mid = s + ((e – s + 1) / 2) – 1;

/* Set the left subtree root node */
arr[(soff * 2) + 1] = create_sg_tree(arr, (soff * 2) + 1, s,
mid, a, len);

/* Set the right subtree root node */
arr[(soff * 2) + 2] = create_sg_tree(arr, (soff * 2) + 2,
mid + 1, e, a, len);

/* Allow the smaller value to rise to the top */
return a[arr[(soff * 2) + 1]] > a[arr[(soff * 2) + 2]] ?
arr[(soff * 2) + 2] : arr[(soff * 2) + 1];
}
[/code]

For more details please refer to the source code repository.

Generation of the segment tree is O(Log(n)), and we sequentially pick n elements and find the smallest within the relevant range with Log(n) complexity. Overall O(nLog(n)).

DMA Architecture

DMA hardware is integral to most ARM SoC designs. They are extensively used for driving various peripheral controllers related to flash memory, LCD displays, USB, SD etc. Here we elaborate the functioning of a typical ARM SoC DMA controller and briefly delve into its internal mechanisms. The intent is to explain the underlying architecture of a typical DMA module and then eventually tie those elementary structural features to ARM’s PrimeCell® DMA Controller, which initially might come across as a more sophisticated hardware.

Direct memory access is simply meant to enable the processor to offload memory transfer operations. In this manner a system will be able to dedicate its valuable computational resources towards more apt alternate purposes. So, in fact we can state that a DMA controller is like a processor module which is capable of only executing load-store operations. While a full blown pipelined processor core possesses several other computational features and hence it should rarely be employed for rudimentary data transfers.

Eventually DMA is about transferring data across different hardware modules. Depending on the use case, the DMA source and destination might vary. And depending on the hardware the specifics of the bus protocol employed for load-store also varies. But just like a processor core, a DMA controller core will also be compatible with the hardware bus protocol features related to burst mode, flow control etc. Otherwise it simply won’t be able to move data around.

 

DMA-Processor-1
Fig 1 : DMA v/s Processor

 

So, a DMA controller should be simply interpreted as a form of rudimentary logical block meant to implement data transfers. Similar to a processor core, this IP block also plugs into the larger data bus matrix involving SRAM and other peripherals. While a processor executes its object code from instruction memory, a DMA controller executes its operations by loading transfer configurations from a RAM memory segment. Such a configuration is usually termed as a descriptor. This recognizable structure would essentially describe all the details regarding the source/destination, burst size, length of the transfer etc. Eventually, such a configuration table would have to be loaded into the RAM by the DMA driver software running on the main processor. And usually the memory location of this descriptor table would be communicated to the DMA controller via some shared register.

Descriptor-2
Fig 2 : Descriptor

DMA controller is eventually a co-processor assisting the main core software with various complex use cases involving heavy data transfers. For example, populating an LCD image bitmap or transferring contents to an SD card are all DMA intensive features. Such a controller would usually support multiple DMA channels enabling simultaneous transfers. In this particular case, a DMA controller with more than two channels should be able to simultaneously drive an LCD and an SD Card.

An interrupt signal would be usually employed to synchronize these DMA transfers with the main processor operations. So, once the requested data transfer is completed, DMA hardware would simply assert an interrupt, which would be acknowledged by the processor.

DMA-int-3
Fig 3 : DMA Interrupt

 

Usually, the contents of a transfer descriptor would resemble the DMA controller register structures. Essentially, these memory descriptor contents are blindly loaded into its actual hardware registers by the DMA controller itself. For example, a descriptor would include entries for the source and destination address which would basically correspond to the actual DMA controller source and destination address registers. Such a DMA controller block would also involve a separate core logic which reads these registers and then implements the actual memory transfers. Illustrated by the Fig 4 below.

GenericDMA-4
Fig 4 : Generic DMA architecture

The transfer descriptor structures written to the memory by the processor would be loaded into the controller registers by the DMA hardware and eventually parsed and executed by its core load/store transfer logic. Undoubtedly, any errors within these memory resident transfer descriptors would also be propagated to the core transfer logic via these registers. Essentially, such a register set is meant to communicate all the relevant transfer configuration details to the DMA load/store engine. It clarifies all the details like source/destination, burst size, transfer length etc, but the actual circuit implementing the load-store loop would be already present within this DMA core.

Every hardware module has a primary interfacing mechanism, simple peripherals use registers and complex processors employ binary object codes. A typical DMA controller interfacing mechanism as elaborated above involves descriptor table reflecting its register set and transfer capabilities. PrimeCell is a DMA IP block from ARM, but unlike a typical DMA controller, its primary interfacing mechanism is not register set or descriptor tables. But it’s an instruction set, very much resembling a simple processor.

PrimeCell-5
Fig 5 : PrimeCell

PrimeCell integrates a rudimentary processor core which understands a limited instruction set, mostly meant for implementing load store operations and other synchronization mechanisms related to data bus and processor interrupts. So, here, instead of a descriptor table, we have an object code describing the transfer. Essentially a binary code implementing memcpy, but written using assembly instruction code understood by the PrimeCell DMA processor. Obviously, such an instruction code would also specify the source/destination address, burst size etc. It’s an actual binary object code.

PrimeCell6
Fig 6 : PrimeCell Architecture

This DMA processor core is also multi-threaded with one transfer channel being associated with each thread. These threads possess the same scheduling priority,  and the system implements a round-robin scheme. The internal FIFO is utilized as a buffer during DMA transfers and it’s shared across all the channels. Each channel is allowed access to the whole FIFO memory, but of course, at any instant, FIFO utilization across all the running channels can never exceed its total size. Such a flexible shared memory design adds to software complexity but it can also lead to a more productive hardware resource allocation pattern.

Finally, PrimeCell IP block also includes an instruction cache to optimize object code access. Also provides scope for instruction abort handling and watchdog exceptions to avoid lockups. In other words, the module almost resembles a full blown multi tasking processor. But do note that, such transfer related or hardware related exception triggers are present in a typical DMA controller also. So, seems like, with PrimeCell IP, a software engineer is being exposed to the underlying machinery which already exist in other typical DMA hardwares but is eventually hidden behind the veil of register set abstractions. As always, nothing is without a trade-off, the more complicated aspect of managing this module is definitely the object code generation. Which makes a crude compiler out of an otherwise relatively simple DMA driver.

Linux Storage Cache

Linux Content Index

File System Architecture – Part I
File System Architecture– Part II
File System Write
Buffer Cache
Linux File Write 

Every system use case involves a combination of software modules talking to each other. But an optimal system is more than set of efficient modules. The overall architectural arrangement of these modules and their data/control interfaces also impacts the performance. In that sense, several useful design methodologies can be observed from examining the storage sub system within Linux.

Application to VFS
Fig 1

Various abstraction layers have their own purpose. For example, VFS exposes a uniform interface to the user applications, while it’s the file system which actually implements the structural representation of files within the device. Eventually, how each storage sector is accessed will depend on the block driver interactions with the hardware. Productivity depends on how quickly a system can service user requests. In other words, the response time is critical. The pure logic implemented for each of the involved modules could be efficient, but if they are merely stacked up vertically then it will be only as good as using a primitive processor with no internal caches or pipelined executions.

*********************************

Inode-Dentry Cache

Book keeping information associated with each file is represented by two kernel structures — Inodes and Dentries. In bare simple terms these constructs hold the information necessary to identify and locate the storage blocks associated with a file. A multitasking environment might have numerous processes accessing the same volume or even the same file. Caching this information at higher levels can reduce contentions associated with the corresponding lower layers. If the repeated file accesses would be picked from the VFS cache, then the file system will be free to quickly service the new requests. Consider a use case where there are multiple processes looking up for a file in different dedicated directories; without a cache, the contention on the file system will scale with the number of processes. But with a good cache mechanism, only the brand new requests would end up accessing the file system module. The idea is quite like a fully associative hierarchical hardware cache.

Inode/Dentry Cache
Fig 2 – Inode/Dentry Cache

*********************************

Data Cache

Inodes and Dentries are accessed by the kernel to eventually locate the file contents, hence the above cache is only a part of the solution while the data cache handles the rest (see below Fig -3). Even here we need frequently accessed file data to be cached and only the new writes or new reads should end up contending for file system module attention. Please note that inside the kernel, the pages associated with file are arranged as a radix tree indexed via file offsets. While the dentry cache is managed by a hash table. Also, each such data cache is eventually associated with the corresponding Inode.

Cache mechanism also determines the program execution speed, it’s closely related to how efficiently the binary sections of the executable are managed within the RAM. For example, the read-ahead heuristics would determine how often the system endures an executable page miss. Also another useful illustration of the utility of a cache is the use case where a program is accessing files from the same volume where its executable is paging from; without a cache the contention on the file system module would be tremendous.

Full Cache
Fig 3 – Full Cache

*********************************

Cache Management

Cache management involves its own complexities. How these buffers are filled, flushed, invalidated or marked as dirty depends on the file system module. Linux kernel merely invokes the file system registered call-backs and expects that the associated cache buffers would be handled accordingly. For example, during a file overwrite the file system should mark the overwritten cached page as dirty, otherwise the flush may never get invoked for that page.

We can also state that the Linux kernel VFS abstracts the pure logic of how data is represented on the storage device from how it’s viewed by kernel on a running system. In this manner the role of the file system is limited to its core functionality of mapping file access to actual storage blocks and it also deals with the transformation between internal file system specific representation of files to the generic VFS Inode, Dentry and data cache constructs. Eventually, how file is created, deleted or modified and how the associated data cache is managed depends solely on the file system module.

For example, a new file creation would involve invoking of the file system module call-backs registered via kernel structure “inode_operations”. The lookup call-backs registered here would be employed to avoid file name duplication and also the obvious “create inode” call would be needed to create the file. An fopen would invoke “file_operations” callback and similarly a data read/write would involve use of “address_space_operations”.

How the registered file system call-backs would eventually manage the inode, dentry and the page data cache buffers would depend on its own inherent logic. So here there is a kind of loop where the VFS calls the file system module call-backs and those modules would in turn call other kernel helper functions to handle specific cache buffer operations like clean, flush, write-back etc. In the end when the file system call-backs return the control back to VFS the associated caches should be already configured and ready for use.

Inode/Page Cache maps
Fig 4 – Inode/Page Cache maps

*********************************

Load Store Analogy

A generic Linux file system can be described as a module which directly services the kernel cache by implementing functions for creating, invalidating or flushing various cache entries and in this manner it only indirectly caters to the POSIX file operations. The methodology here is in a way analogous to how a RISC load-store would differ from a 8051 like CISC design, separation of the slow memory access from the register operations constitute the core attribute of a load-store mechanism. Here the Linux kernel ensures that the operations on the cache can happen in parallel with an ongoing load or store which might involve the file system and the driver modules, the obvious trade-offs are complexity and memory consumption.

*********************************

Pipelines

Caches are effective when there are read ahead mechanisms and background flusher tasks. A pipelined architecture would involve as much parallel execution as possible. Cache memories are filled in advance by the reader threads while the dirty entries are constantly flushed in the background by worker threads. Such sector I/O requests are asynchronously queued and processed by the block driver task which will eventually DMA them into the interfaced device. After the writes, the corresponding cache entries are marked as clean and similarly after the reads the entries are marked as up-to-date. The pipeline stages can be seen in the below Fig 5.

Kernel Storage Pipeline
Fig 5 – Kernel Storage Pipeline

System organization evolves over time when various use cases are analyzed for performance bottlenecks and the system arrangement itself is constantly revisited to remove those identified issues. Matured generic software architectures adapt effectively to all the intended use cases while remaining  within various resource constraints. For a general purpose operating system like Linux, the infrastructure for cache memory utilization is a good illustration of how various abstract software components can seamlessly coordinate with widely varying user requests.

Emergent Order of Computer Science and Engineering

Why software APIs and hardware protocols are two expressions of the same modular system

When people speak of computer science and computer engineering, they often imagine two separate territories: one abstract, algorithmic, and mathematical; the other tangible, electrical, and physical. But step back, and a different picture emerges. Both fields are simply building blocks in a larger, self-organizing order — where modules define clear functionality, interfaces enforce shared rules, and integration yields systems greater than the sum of their parts.

A POSIX system call and a PCI bus handshake may look worlds apart, but at their essence, they are both contracts. They define how one part of a system communicates with another, regardless of the implementation beneath. This lens — seeing software and hardware as parallel expressions of the same principle — reveals not a divide, but an emergent order that underpins all of computer science and engineering.

 

“A System” → The Modular Lens

Computer engineering could be perceived as a process of discovering various productive arrangements of functional modules (or functional blocks). Such an abstract module can be expressed as a self-contained unit with two key qualities:

  • Functionality — What it does.
  • Interface rules — How inputs are furnished and outputs channeled.

This principle is universal, whether it’s the components in a IoT device talking to a software cloud storage service or an Android phone multi-casting audio on Alexa or a graphics pipeline inside a compute cluster powering AI. The same principle comes to life differently in a software and hardware view of the world. But the modular lens lets us see their symmetry. The same qualitative attributes of functionality and interface rules are integral to both software and the hardware paradigms. Quite easily illustrated by the below diagram.

Figure 1: Software and hardware both emerge as networks of functional blocks bound by contracts (APIs, protocols).

The distinction between software and hardware is merely in methodology, both are conceptually an integration of abstract modules communicating via shared rules to achieve a larger purpose.

 

“Software Hardware Architecture” → Functions and Contracts

Software is a network of modules linked by recognizable data structures. The output of one module often becomes the input of another, creating layered stacks of functionality.

At the higher levels, interfaces may be standardized — for example, POSIX APIs in operating systems. But internally, each module may have its own contracts. Even at the assembly level, the object code is an interface: once decoded, the hardware ALU or Load/Store unit is signaled to act. Eventually the behavior of a generic processor unit will depend on the sequence and also the content of the application’s object code.

Hardware Hardware design mirrors this approach. High-level IP blocks interconnect through bus protocols like AMBA, PCI, or USB. Each functional block samples recognizable input patterns, processes them, and emits outputs across agreed-upon channels. For example, a DMA unit connected to multiple RAM types must support multiple port interfaces, each with its own protocol. A unified abstract representation of such a functional block is attempted below.

Functional Block
Figure 2: A unified view of functional blocks across software and hardware.

Functional Block → The Atom

At a highly abstract level, computer science is either about designing such functional blocks, or about integrating them into coherent wholes. So eventually an electronic product can be perceived as an organization of abstract functional modules, each communicating via shared interface rules, together they deliver a complex use cases valuable to its end user.

When employed at scale human cognition recognizes objects by their abstract high level qualities, not their gritty details. But details matter when developing these abstract functional blocks. So a process intended for engineering a complex system at scale will need to harness specialized detailed knowledge dispersed across many individuals and the functional modules they implement.

For instance, the application engineer cannot be expected to comprehend file system representation of data on the hard disk and similarly a middle-ware engineer can afford to be ignorant of device driver read/write protocols as long as the driver module plays by documented interface rules. An integration engineer need to know only the abstract functionality of modules and their corresponding interface rules to combine them for delivering a product.

Thus by lowering the knowledge barrier we reduced the cost and time to market. Now the challenge of implementing functional blocks lies in balancing abstraction with performance. Too much modular generality slows a system; too little makes it rigid and fragile.

 

“Open v/s Closed Source” → Impact of the Extended Order

Open-source framework accentuates the advantages of the previously mentioned modular construction. While proprietary systems evolve only among a set of known collaborators, open source leverages on a global extended order, enabling contributions from both known and unknown individuals. So from the development perspective it harnesses the expertise of a larger group.

Market economics is about finding the most productive employment of time and resources, in our case it would be about discovering all the possible uses of an abstract functional module. The lower barrier to knowledge within the open-source market accelerates this discovery process. It also leverages on the modular structure for coordinating dispersed expertise. In other words, depending on their individual expertise any one can integrate new functional blocks, improve or tailor the existing ones. For instance, a generic Linux kernel driver might eventually end up on a server or a TV or a smartphone depending on how that module is combined with rest of the system.

Figure 3: Open vs Closed ecosystems — cohesion vs disjointed growth.

The above Venn diagrams illustrate how the nature of an order can influence the development, cohesion and organization of these functional blocks.

“Universal Epistemological Problem” → The Knowledge Challenge

What emerges from these modular interactions is not merely technology, but an order — a living system shaped by countless contracts, shared rules, and dispersed expertise.

This is the emergent order of computer science and engineering: a subset of the larger economic order, subject to the same knowledge problem Friedrich Hayek famously described. No single mind can master it—yet through modularity, openness, and shared rules, it flourishes.

Linux File System Write

Linux Content Index

File System Architecture – Part I
File System Architecture– Part II
File System Write
Buffer Cache
Storage Cache

Linux file write methodology is specific to particular file systems. For example, JFFS2 will sync data within the user process ‘context’ itself (inside write_end call), while file systems like UBIFS, LogFS, Ext & FAT tends to create dirty pages and let the background flusher task manage the sync.

These two methodologies come with their own advantages, but usage of the second method leads to Linux kernel exhibiting certain quirks which might drastically impact the write throughput measurements.

Linux File Write

The above diagram does give an overview of the design, there are in fact two main attributes to this architecture:

COPY : User space data is copied into a kernel page associated with the file and marked as dirty via“write_begin” and “write_end” file system call backs, they are in fact invoked for every “write” system call made by the user process. So while they are indeed kernel mode functions they get executed on behalf of the user space process which in turn simply waits on the the system call. The sequence is illustrated below:

    write_begin()  /* Informs FS about the number of bytes being written */
    mem_copy()    /* Copies data to kernel page cache.*/
    write_end()     /* Copied kernel page marked as dirty and FS internal */
/*  meta information is updated */

&

Write-Back : The Linux kernel will spawn a flusher task to write these above created dirty pages into the storage. For that this newly spawned task would typically invoke file system specific “writepage” or “writepages” call-back.

Some Nitty-Gritty Details

The call-backs mentioned above are specific to the file system and are essentially registered during its initialization with kernel. You might have noticed that the above design almost resemble a classic producer consumer scenario, but there are certain nuances:

Lower Threshold: The flusher task (consumer) is spawned only when a certain percentage of page cache becomes dirty. So there is indeed a lower threshold (configured in /proc/sys/vm/dirty_background_ratio) for spawning this task. Please note that it can also be invoked after an internal time-out. So even if the dirty pages are in small numbers, they are not kept around in volatile memory for a long time.

Higher Threshold: Once a certain high percentage of page cache is dirtied by a user process (producer), this process is forced to sleep to avoid the kernel running out of memory. Such a user process is not unhooked from this involuntary sleep until the flusher task catches up and brings down the dirty page levels to acceptable limits. This way kernel tends to push back on the process which comes with a massive capacity to write data. This higher threshold can be specified by writing to /proc/sys/vm/dirty_ratio.

So between the above mentioned lower and upper thresholds the system tends to have two concurrent tasks, one which dirty the pages with contents and the other which cleans them up with a write back. If there are too many dirty pages being created, then Linux might also spawn multiple flusher tasks. Seems like this design might scale well on SMP architectures with multiple storage disk paths.

We could bring some more clarity with the exposition of code level details. The critical aspect of the above mechanism is implemented in the below mentioned balance_dirty_pages (page-writeback.c) which gets invoked after “write_end” call inside kernel.

The function audits the number of dirty pages in the system and then decides whether the user process should be stalled.

static void balance_dirty_pages (struct address_space *mapping, unsigned long write_chunk)

{
int pause = 1;


……
………
for (;;)

{
….
……         /* Code checks for dirty page status and breaks out of the loop only if the dirty page ratio has gone down */
……..

io_schedule_timeout(pause);

/*
* Increase the delay for each loop, up to our previous
* default of taking a 100ms nap.
*/
pause <<= 1; if (pause > HZ / 10)
pause = HZ / 10;
}
}

The argument ‘pause’ to the function io_schedule_timeout specify the sleep time in terms of jiffies while the big for (;;) loop repeatedly checks the dirty page ratio after each sleep cycle and if there is no ample reduction in that then the sleep time is doubled until it reaches 100mS, which is in fact the maximum time the ‘producer‘ process can be forced to sleep. This algorithm is executed during file writes to ensure that no rogue process goes amok consuming the full page cache memory.

The Gist

We can attribute considerable significance for the above mentioned upper and lower thresholds, if they are not fine tuned for the system specific use cases then the RAM utilization will be far from optimal. The particulars for these configurations will depend on RAM size plus how many processes will concurrently initiate file system writes, how big these writes will be, file sizes being written, typical time intervals between successive file writes etc.

A file system mounted with syncs off is meant to utilize page cache for buffering. Optimal write speeds will mandate that the upper threshold be large enough to buffer the full file while allowing seamless concurrent write backs, otherwise the kernel response time for user process writes will be erratic at best.

Another critical aspect is that the writepage should have minimal interference with write_begin & write_end, only then the flusher task write backs can happen with least friction with regards to the user process which is constantly invoking the latter two functions. If they share some resource, then eventually the user process and the flusher task will end up waiting on each other and the expected quick response time of an asynchronous buffered page cache write will not be materialized.

SD Card Speed Class

All new SDCARDs conform to one of the speed classes defined by the specification, it’s usually stamped on top of the card and they are quite relevant for multiple reasons.

  1. It defines the speed at which you can clock the card, 50MHz for class 10 and otherwise 25MHz.
  2. Also alludes to the RU size, higher class cards pack more internal buffering.
  3. Finally its primary purpose is of conveying the maximum write speed, class 10 being the fastest.


SD Bus Clock

High clocking capability means better internal circuit and more the bill of material$, class 10 cards are said to be capable of high speed (50MHz) mode, at least that is what the spec claims but the scriptures may not always reflect the reality. Running an intensive I/O test showed that in practice even a 44MHz clock tends to choke most of the cards, actually two out of the four popular makes exhibited inconsistencies. Even though card’s CSD register mentioned 50MHz, the general card behavior tends to be intolerant to intensive use cases at higher clocks, sometimes even before switching to high speed mode (via CMD6) the CSD register claim 50MHz compliance which seems contrary to the spec and when practiced bricks or crashes the card. The root cause of the high speed non-compliance is not clear to me but the fact that certain cards are more tolerant than the others should be kept in mind while testing the system.

Buffering

Internal SDCARD buffers expedite I/O by implementing a producer-consumer like scenario, Class 4 cards tend to have an internal buffer (RU) size of 32K while for class 6 it goes up to 64K and for Class 10 a whopping 512K, not surprising that the price of the card also follows a similar pattern, runs from $5 to $10 & to around $30 for high speed cards. Optimal use of cards would be to write in terms of the RU sizes so that the card buffers are utilized to the maximum, writing more than the RU size will make the host unnecessarily wait for the card to flush data to internal flash and transmitting very small packets will only keep the SDCARD firmware unusually idle. Performance depends on the depth and also on the maximum utilization of available computational units within the pipeline. So, with regards to SDCARD also, we need to simultaneously engage the host and the card which eventually leads to better throughput.

Write Speed

Time taken to write 128 MB of data was sampled over a dozen cards, Class 10 PNY card was the fastest at 17 secs while the slowest was 88 secs for the Transcend class 4 card. Below we have a graph which shows how the time to write is inversely proportional to the cost and the RU size and the speed class of SDCARDs. Embedded host used for this exercise was an Embedded  Artist EA 3250 board and it runs a 266MHz NXP controller.

Speed Class Time
Speed Class Time


Inference

Accurate class differentiation emphasized by the spec gets a tad obscured when it trickle down into the eventual product which seems to vary in behavior and performance capabilities across and within the same class. The challenge with SDCARD host productization is primarily that of interoperability testing, here we have considered only a dozen makes which is popular in the US, undoubtedly once we account for the China market there will be little semblance of sanity left in the above graph.

More on this topic – An account on SDCARD internals.