System Optimization: From Feature Phones to Data Lakes

Introduction

What motivates system optimization? Often, it is the arrival of new customer demands: a higher throughput requirement, a tighter latency target, or a new use case that pushes a system beyond its original specifications. In short, the need for optimization begins when an existing system is unable to adapt to new expectations. The task can then be split into two parts: first, to comprehend the current design constraints; and second, to determine what it would take to adapt.

Step 1 : Comprehension – The Why

Why is the current system unable to satisfy the performance specification? The first step is to analyze the design through the lens of requirements — to understand how various use cases impact resource utilization such as CPU cycles, storage bandwidth, or memory footprint etc. Such analysis provides insights and hypotheses about possible systemic imbalances. By modeling use cases and capturing performance metrics, these hypotheses can be validated or refined. Depending on the scale and complexity of the system, isolating the bottleneck may require iterations. Still, measurement is vital to confirming assumptions and deriving solutions.

Step 2 : Solutions – The How

The details of the bottleneck identified through measurements form the input to the next step — identifying a solution. Captured data should highlight which resources are heavily utilized (ie: expensive) and which remain relatively idle. The goal is to target these imbalances at the architectural, algorithmic, or design level. In other words, every subsystem should be efficiently utilized throughout the execution of a use case. This balancing act typically involves either reducing contention or introducing greater concurrency. Contention can be reduced by optimizing algorithms, adding caches, batching operations, or improving prioritization. Concurrency, on the other hand, often requires reworking the pipeline.

Practicing What We Preach

The following real-world industry examples illustrate how this methodology can be applied in practice. It also explains the challenges that arise when theory meets practice in projects of different scales and complexities.

Asynchronous Raw NAND Writes

Similar to how branch prediction and multi-issue execution keeps CPU pipelines busy, concurrency can be applied at every layer of a computing system to ensure efficient resource usage. One such example came from NAND flash file system throughput improvement. Here customer  demanded 50-60% improvement in write throughput. The existing design was simple and effective, until Samsung determined it was inadequate for their next generation of feature phones.

Step 1 : Original Design

As shown below (Fig 1), all components executed sequentially.

Fig 1 : Existing Stack

In a resource-constrained RTOS environment, this simplicity had its benefits. But with a new performance specification, this became the limitation.

End to end latency was around 400μs, of which 250μs was spent by CPU waiting on NAND programming. In effect, the CPU — running application and file system — was idle majority of the time. Measurements identified the bottleneck — Step 1 completed.

Step 2 : Improved Design – Concurrency

Fig2 below shows the new architecture. Since the bottleneck was obvious, the solution was to introduce concurrency between CPU and Raw NAND by making that lowest level programming interface asynchronous. Alternative solutions like asynchronous I/O at higher file system or application level would have been more expensive and likely much less effective. Because they all can create resource contentions down the stack.

Fig 2 : Improved Stack with concurrency between CPU and NAND

A Cache

Introducing a simple cache allowed NAND programming to be decoupled from rest of the sequence. File system will now just update the cache, while the actual programming of NAND happens in the background. This allowed both CPU and NAND to execute concurrently.

  • Throughput improvement: >70%
  • Trade-offs: additional memory usage and an acceptable increase in system complexity

While conceptually simple, the implementation within a memory-constrained RTOS raised interesting challenges. Cache memory management had to be carefully tuned for efficiency, and NAND bad block management had to be performed asynchronously. These intricacies led to an interesting debate on patentability and an eventual grant.

The Verdict

Illustrates how the two step methodology delivered results and even an intellectual property. Next example applies these same principles within a more complex environment with similar results.

Linux File System Optimization

The requirement here was to maximize streaming read/write throughput for a proprietary file system — ideally outperforming native solutions like Ext4 and UBIFS. This project required running a benchmark such as IOZone and iteratively isolating bottlenecks across the storage stack. The simplified architecture diagram (Fig 3) highlights the critical paths uncovered by measurement. (See more design details about Linux Storage Stack in this link)

Step 1 : Design Resource Contentions

Fig 3 : Existing Stack with two levels of contention

Two major bottlenecks were identified through design review and measurements:

  1. File System Contention — Multiple applications accessing the file system introduced latencies in the tens of microseconds.

  2. Block Driver Contention — Concurrent access from the file system and page cache to the block device caused latencies up to the order of single-digit seconds.

Page cache is essentially data cache from file system user perspective, while block Driver is basically storage device management layer which abstracts disk I/O functions.

Step 2: Caches

File System Contention Resolution

During streaming I/O, the primary role of a file system in Linux is to map file offset to storage block addresses. Concurrent access from multiple processes to the file system for this mapping operation became a source of contention. Optimizing the file system mapping algorithm itself would be expensive, so the solution was to prefetch and cache the mapping data at higher VFS layer.

To minimize overhead, prefetching was triggered only when streaming I/O was detected — essentially repeated reads or writes to contiguous offsets. This cache effectively reduced latency from tens of microseconds to ~5 μs on average.

Block Driver Contention Resolution

Several strategies could have resolved this — optimizing file system or application, increasing page cache memory, or adding dedicated storage paths for these relatively independent access streams. But the solution employed was much more targeted and simpler:

  1. Increase the in-memory cache for file system metadata.
  2. Disable automatic file system metadata syncs, instead rely on background sync that executed only when file system was idle.
Fig 4 : Improved Stack

Now during peak I/O activity, block driver was practically dedicated to service page cache requests. Metadata syncs were also deferred to idle time windows. This combination of hierarchical caching and careful prioritization resolved the block driver bottleneck.

The trade-offs were cache management overhead, increase in memory footprint, and some risk in loss of data due to deferred background syncs. File system itself was power fail safe so it’s guaranteed to work across power failures, but the last written file contents before a metadata sync could be lost. All these were acceptable from customer’s use case perspective.

The Verdict

Identification of bottlenecks was possible only by layering metrics across the stack — from application level down to block driver. Then executing customer workloads, and then analyzing correlating spikes across time.

The solutions were then targeted to resolve specific resource contentions with minimal impact to other use cases. Even within a relatively complex stack like Linux kernel, the two step methodology of comprehension and then problem solving was effective.

While the Linux FS project tackled performance inside a single device, the next example scales up to cloud infrastructure, where data pipelines process gigabytes of metrics across distributed systems.

AWS Data Pipeline

The requirement here was to collect gigabytes of metrics from a test network, then process and monitor them for key performance indicators. These metrics were generated by the verification target — a cloud storage transport protocol (SRD) which was still under development.  The project goal was to design a system to extract, transform, and load end-to-end test metrics from this storage stack.

Because the storage stack under test was experimental, the requirements for data transformation were open-ended. Debugging relied on running SQL queries and looking for correlations in collected metrics to isolate bottlenecks. The expectation was to support a wide range of queries with “reasonable” latency, despite little clarity on the specific use cases.

Open-Ended Requirements

The experimental nature of this project imposed three constraints:

  1. Quick project execution due to tight timelines
  2. A simple, adaptable architecture to handle unknown future use cases.
  3. Low latency access to stored data for interactive debug.

Step 1 : Data Lake and S3 Measurements

After evaluating multiple solutions, AWS S3, an unstructured data storage was chosen as the database backend. Compared to traditional databases, it offered lower maintenance overhead and schema flexibility. Even though a simple object store, when used to save structured files like JSON or CSV, it can act as a database backend to services like AWS Athena and Redshift Spectrum. But, the drawback was performance: throughput fell below 1 MB/sec when accessing large numbers of small files. Measurements identified this as the primary bottleneck.

Step 2 : Partitions and ETL Chains

Partitioning

Partitioning was the first aspect to the solution. As AWS notes :

By partitioning your data, you can restrict the amount of data scanned by downstream analytics engines, thereby improving performance and reducing the cost for queries. 

Files corresponding to a single day’s  worth of data are placed under a prefix such as
s3://my_bucket/logs/year=2022/month=06/day=01/. We can use a WHERE clause to query the data as follows: "SELECT * FROM table WHERE year=2023 AND month=06 AND day=01"
The preceding query reads only the data inside the partition folder year=2022/month=06/day=01 instead of scanning through the files under all partitions

With partitioning and by limiting SQL query ranges, access complexity dropped to O(Log n) from O(n) (see below Fig 5).

Fig 5 : O (Log n) partitioned access

Data Lake ETL Chains

Initially, data was stored with year/month/day partition. This was later extracted and transformed using AWS Athena. Filtered output was redirected in CSV form into another partitioned S3 bucket for further queries and debug.

For example :

Output of a query like “”SELECT * FROM table WHERE year=2023 AND month=06 AND day=01 AND server_type=”storage”“” was redirected to location s3://my_bucket/logs/server_type=storage/

This created chained and tailored datasets for detailed analysis — in above case storage server metrics.

Depending on emerging requirements, these ETL chains were developed, and this allowed for simple scalable organization of large scale data.

The Verdict

The two-step methodology applied here was:

  1. Identify — Small file access on S3 was measured and confirmed as the bottleneck.

  2. Resolve — Partitioning and ETL chaining were introduced to limit query scope and improve throughput.

Despite relying on a low-cost S3 backend, the system achieved acceptable query latencies for debugging, while remaining flexible enough to handle evolving requirements. This combination of simplicity and scalability allowed gigabytes of test data to be managed efficiently under aggressive development timelines.

Example proved the two step works at gigabyte scale, next illustration dials it down to kilobytes level, shows how the same principle can work on a wearable embedded device.

Noisy Neighbor

The problem here was data loss on shared diagnostic channels running on a resource-constrained wearable device (in this case, the now discontinued Amazon Halo). Diagnostic in this case involved logs, metrics and other critical information required to assess general health of the device. So, several components used these channels, and it was difficult to pinpoint the “noisy neighbor” causing these drops. Usual solutions of enforcing a per-component quota would have been too expensive in terms of complexity and overhead.

Following the two-step method, the first step in this case would be to instrument metrics to monitor bandwidth usage and gain a clear understanding of the problem.

Step 1 : Instrument Usage Metrics

Instrumented metrics tracked :

  • Min/Maximum/Average bandwidth utilization
  • Dropped byte count.

First three metrics monitored overall trends, while dropped bytes highlighted actual data loss. Plotting dropped bytes across time allowed easy identification of relevant time windows.

Fig 6 : Dropped bytes

Correlation & Cross-correlation

Focusing on logs or metrics captured during dropped bytes intervals like 12:05 or 12:20 helped narrow down:

  • What was happening (the active use case)

  • Who was involved (the specific component)

For example : Correlating a spike at 12:05 with actual logs conveyed eMMC or FAT32 as likely culprits. Investigating the code path related to eMMC error handling helped further isolate the reason.

  • [12:01][1233][INFO][FAT32]File opened
  • [12:02][1234][INFO][FAT32]Reading LBA 20
  • [12:02][1235][INFO][eMMC]Device read initated
  • [12:03][1236][ERROR][eMMC] Error interrupt
  • [12:03][1237][ERROR][eMMC] CRC error detected
  • [12:03][1238][ERROR][eMMC] Unrecoverable error returned
  • [12:07][1249][ERROR][FAT32] Directory entry read failed

When multiple diagnostic channels exist — for example, logs and metrics — cross-correlation can be even more powerful. By examining related metrics during a log drop window (or vice versa), you can triangulate the root cause with higher confidence.

Step 2 : Resolution through Prioritization

Once the noisy culprits are identified, the responsibility was now on the component or use case designer to prioritize and reduce the usage.

Final question was — why is dropped_bytes metric itself never dropped.

  • First, it’s a low frequency, aggregated metric.
  • Second, it’s reported through low priority idle task. This prioritization ensured bandwidth is always available for it.

Unlike timing sensitive inline metrics like errors or warnings, these aggregate values can be safely prioritized lower. This design guarantees accurate reporting of overall system health without impacting inline real-time reported diagnostics.

The Verdict

Once again, the two-step methodology proved effective:

  1. Identify by measuring and correlating metrics to isolate the noisy neighbor.

  2. Resolve by fixing noisy element and by introducing prioritization within telemetry.

While this example applied noisy neighbor within an embedded device, next illustration will explain how these same principles can be applied at slightly bigger scale for memory optimization.

Dynamic Memory Allocation

The goal here was to reduce Linux DRAM footprint on Amazon Alexa. As expected, initial measurements indicated kernel footprint was relatively small compared to user-space. With approximately 200 runtime processes, getting the overall measurements itself seemed like a complex task.

Step 1 : The Haystack – Capturing data

With high number of concurrent processes, the measurement itself  demanded a methodical approach. Approach used was a combination of Linux pmap and python scripting.

PMAP and Python Pandas

Linux provides a useful tool named pmap, this provides memory map for a process. With 200 processes, there will be 200 such maps with detailed information like size, permissions, and segment type (e.g., library, stack, or [anon] for dynamically allocated memory).

Example below:

12345: /usr/bin/my_applicationAddress           Kbytes     RSS   Dirty Mode   Mapping0000555a297b6000      12       8       0 r-x--  /usr/bin/my_application0000555a297b9000       4       4       4 r----  /usr/bin/my_application0000555a297ba000       4       4       4 rw---  /usr/bin/my_application0000555a297bb000      64      64      64 rw---  [ anon ]00007f9c87d46000    1864    1024       0 r-x--  /usr/lib/x86_64-linux-gnu/libc-2.31.so00007f9c87f18000     512      64       0 r----  /usr/lib/x86_64-linux-gnu/libc-2.31.so00007f9c87f98000      16      16      16 rw---  /usr/lib/x86_64-linux-gnu/libc-2.31.so00007f9c87f9c000      20      20      20 rw---  [ anon ]00007f9c87fa1000     144     144       0 r-x--  /usr/lib/x86_64-linux-gnu/ld-2.31.so00007f9c87fc5000       8       8       8 r----  /usr/lib/x86_64-linux-gnu/ld-2.31.so00007f9c87fc7000       4       4       4 rw---  /usr/lib/x86_64-linux-gnu/ld-2.31.so00007f9c87fc8000       4       4       4 rw---  [ anon ]00007f9c87fca000    8192    8192    8192 rw---  [ anon ]00007ffc12345000     132      12       0 rw---  [ stack ]

The output of pmap roughly resembles space delimited CSV, so some simple scripting could clean this up, transform into comma delimited form and then load it into a Pandas Dataframe.

A union of 200 such tables correlating to each process provided the unified view of the whole system. Now some SQL queries to slice, filter, and aggregate provided different insights into system-wide memory distribution.

Step 2 : Finding The Needle

Grouping rows in the unified table based on different types of memory maps indicated [anon] (ie: dynamic allocation) as the bottleneck. Studying the output stats of Android Native Memory allocator showed some internal fragmentation. Diving a bit deeper and investigating available memory config led to the discovery of MALLOC_SVELTE option. Setting this to true enables low-memory configuration.

MALLOC_SVELTE := true

Similar to the usual CPU/memory trade-off cases, even here enabling this can impact allocation latency. But it also expected to improve memory utilization and is meant for memory constrained systems. Measurements indicated ~30% reduction in memory footprint with some hit in latency for a subset of products.

The Verdict

Measurements led to identifying the exact bottleneck — with that a clear targeted solution with highest return on investment was discovered. This one line change to add an option reduced memory by a third. System successfully reconfigured for adapting to a memory constrained environment. In this particular case, measurement itself was the challenge and it required a bit of cross-disciplinary approach — applying of some big data concepts to Linux system optimization.

The Finale

Whether it’s an extinct feature phone, or modern data lakes, the same two step methodology delivers results. It works because it mirrors the scientific method: observe, hypothesize, test, refine. Applying this to real-world systems demand some creativity and a degree of patience. And even when it doesn’t fully deliver results, it leaves us with defensible measurements that explain why—and a clearer picture of where to look next.

Emergent Order of Computer Science and Engineering

Why software APIs and hardware protocols are two expressions of the same modular system

When people speak of computer science and computer engineering, they often imagine two separate territories: one abstract, algorithmic, and mathematical; the other tangible, electrical, and physical. But step back, and a different picture emerges. Both fields are simply building blocks in a larger, self-organizing order — where modules define clear functionality, interfaces enforce shared rules, and integration yields systems greater than the sum of their parts.

A POSIX system call and a PCI bus handshake may look worlds apart, but at their essence, they are both contracts. They define how one part of a system communicates with another, regardless of the implementation beneath. This lens — seeing software and hardware as parallel expressions of the same principle — reveals not a divide, but an emergent order that underpins all of computer science and engineering.

 

“A System” → The Modular Lens

Computer engineering could be perceived as a process of discovering various productive arrangements of functional modules (or functional blocks). Such an abstract module can be expressed as a self-contained unit with two key qualities:

  • Functionality — What it does.
  • Interface rules — How inputs are furnished and outputs channeled.

This principle is universal, whether it’s the components in a IoT device talking to a software cloud storage service or an Android phone multi-casting audio on Alexa or a graphics pipeline inside a compute cluster powering AI. The same principle comes to life differently in a software and hardware view of the world. But the modular lens lets us see their symmetry. The same qualitative attributes of functionality and interface rules are integral to both software and the hardware paradigms. Quite easily illustrated by the below diagram.

Figure 1: Software and hardware both emerge as networks of functional blocks bound by contracts (APIs, protocols).

The distinction between software and hardware is merely in methodology, both are conceptually an integration of abstract modules communicating via shared rules to achieve a larger purpose.

 

“Software Hardware Architecture” → Functions and Contracts

Software is a network of modules linked by recognizable data structures. The output of one module often becomes the input of another, creating layered stacks of functionality.

At the higher levels, interfaces may be standardized — for example, POSIX APIs in operating systems. But internally, each module may have its own contracts. Even at the assembly level, the object code is an interface: once decoded, the hardware ALU or Load/Store unit is signaled to act. Eventually the behavior of a generic processor unit will depend on the sequence and also the content of the application’s object code.

Hardware Hardware design mirrors this approach. High-level IP blocks interconnect through bus protocols like AMBA, PCI, or USB. Each functional block samples recognizable input patterns, processes them, and emits outputs across agreed-upon channels. For example, a DMA unit connected to multiple RAM types must support multiple port interfaces, each with its own protocol. A unified abstract representation of such a functional block is attempted below.

Functional Block
Figure 2: A unified view of functional blocks across software and hardware.

Functional Block → The Atom

At a highly abstract level, computer science is either about designing such functional blocks, or about integrating them into coherent wholes. So eventually an electronic product can be perceived as an organization of abstract functional modules, each communicating via shared interface rules, together they deliver a complex use cases valuable to its end user.

When employed at scale human cognition recognizes objects by their abstract high level qualities, not their gritty details. But details matter when developing these abstract functional blocks. So a process intended for engineering a complex system at scale will need to harness specialized detailed knowledge dispersed across many individuals and the functional modules they implement.

For instance, the application engineer cannot be expected to comprehend file system representation of data on the hard disk and similarly a middle-ware engineer can afford to be ignorant of device driver read/write protocols as long as the driver module plays by documented interface rules. An integration engineer need to know only the abstract functionality of modules and their corresponding interface rules to combine them for delivering a product.

Thus by lowering the knowledge barrier we reduced the cost and time to market. Now the challenge of implementing functional blocks lies in balancing abstraction with performance. Too much modular generality slows a system; too little makes it rigid and fragile.

 

“Open v/s Closed Source” → Impact of the Extended Order

Open-source framework accentuates the advantages of the previously mentioned modular construction. While proprietary systems evolve only among a set of known collaborators, open source leverages on a global extended order, enabling contributions from both known and unknown individuals. So from the development perspective it harnesses the expertise of a larger group.

Market economics is about finding the most productive employment of time and resources, in our case it would be about discovering all the possible uses of an abstract functional module. The lower barrier to knowledge within the open-source market accelerates this discovery process. It also leverages on the modular structure for coordinating dispersed expertise. In other words, depending on their individual expertise any one can integrate new functional blocks, improve or tailor the existing ones. For instance, a generic Linux kernel driver might eventually end up on a server or a TV or a smartphone depending on how that module is combined with rest of the system.

Figure 3: Open vs Closed ecosystems — cohesion vs disjointed growth.

The above Venn diagrams illustrate how the nature of an order can influence the development, cohesion and organization of these functional blocks.

“Universal Epistemological Problem” → The Knowledge Challenge

What emerges from these modular interactions is not merely technology, but an order — a living system shaped by countless contracts, shared rules, and dispersed expertise.

This is the emergent order of computer science and engineering: a subset of the larger economic order, subject to the same knowledge problem Friedrich Hayek famously described. No single mind can master it—yet through modularity, openness, and shared rules, it flourishes.