We ported and benchmarked a flash file system to Linux running on an ARM board. Porting was done via FUSE, a user space file system mechanism where the file system module itself runs as a process inside Linux. The file I/O calls from other processes are eventually routed to the FUSE process via inter process communication. This IPC is enabled by a low level FUSE driver running in the kernel.
The above diagram provides an overview of FUSE architecture. The ported file system was proprietary and was not meant to be open sourced, from this perspective file system as a user space library made a lot of sense.
Primary bottleneck with FUSE is its performance. The control path timing for a 2K byte file read use-case is elaborated below. Please note that the 2K corresponds to NAND page size.
1. User space app to kernel FUSE driver switch. – 15 uS
2. Kernel space FUSE to user space FUSE library process context switch. – 1 to15 mS
3. Switch back into kernel mode for flash device driver access – NAND MTD driver overhead without including device delay is in uS.
4. Kernel to FUSE with the data read from flash – 350uS (NAND dependent) + 15uS + 15uS (Kernel to user mode switch and back)
5. From FUSE library back to FUSE kernel driver process context switch. – 1 to 15mS
6. Finally from FUSE kernel driver to the application with the data – 15 uS
As you can see, the two process context switches takes time in terms of Milliseconds, which kills the whole idea. If performance is a crucial, then profile the context switch overhead of an operating system before attempting a FUSE port. Seems loadable kernel module approach would be the best alternative.
Real Time Embedded software running on ASICs with limited memory and capabilities tend to lack paging and virtual memory units also. Which in turns places constraints on the dynamic memory allocation and stack memory usage. Run time allocation of memory heavily depends on the code execution pattern, manually analyzing all possible code execution paths may not be always feasible. So an easier way would be to have an extensive test suit with memory profiling.
Stack for each task in the system can be optimally allocated only by profiling its usage. Here we are not going to detail the various profiling methods, but jot down a strange observation where the stack allocation of a particular function was seen as abnormally high.
Please take a look at the following example code.
As you can see, the above dummy function executes a bunch of switch cases depending on the “code” passed to it, there are some stack variables allocated at the staring of the function (namely x,y & z) . Consider the size of “int” type to be 4 we can comfortably say that the stack allocation at the start of the function will be 4 * 3 = 12 bytes, well instead of this i noticed that the stack allocated by the ARM assembly code is 30. The assembly code looked something like what is given below.
Looks like the assembly code is allocating the stack just once after taking into account the maximum stack usage of the function, in other words the first 3 integers (x,y & z) account for 12 and the switch case ‘c’ accounts for the 18 bytes (3 integers and 3 short integers). So the total depth of the stack needs to be at least 30 bytes. Normally we would have expected stack allocation inside each switch case, this means that the assembly code might have had the following sequence included in it.
The above sequence adds more code and more execution overhead, so not really the most optimal way so instead compiler decided that it will allocate the whole stack in one shot. So the execution speed is more and code size also marginally reduced, everything looks good except the fact that now the stack usage seems to have got a hit, how? The stack used inside function “operation_add” would be allocated right after the 30 byte allocation but in the above “non-optimized” sequence we would have had it allocated only after 12 bytes, this means that the optimization in speed comes with a trade-off in the form of stack usage.
This was indeed an interesting observation, the code must have been compiled with optimization setting set for speed instead of memory, hence the compiler conveniently decides not to care much about stack memory. Actually even function inlining will add the stack of the inlined function to that of the caller function, again this is a typical case of speed/memory trade-off, invariably all kinds of optimization are!
Man versus machine has always been an exhilarating contest, whether its movie making (2001:A Space Odyssey, Terminator, Matrix) or chess (Kasparov v/s Deep Blue) or may be coding (Compiler Generated Code v/s Programmer generated code).
Compilers of the Analog Devices and Freescale DSPs I have been working on exhibited such maturity that even the worst written C code performed like a formula one car. If you are writing system software then it seems futile attempting to increase the performance with direct assembly coding. There are of course some exceptions like.
Certain parts of Operating System has to be written in Assembly because they are platform dependent code(For example: Trap interrupt in linux is generated by a code written in assembly)
DSP algorithms which needs to be optimized by using SIMD capabilities.
For device driver coding and generic firmware development we hardly need to invest time in writing assembly, a well written modular C code will suffice.
It’s a popular misconception that critical code which needs to perform well has to be written in assembly, for example I do remember some of the tech leads insisting on Interrupt Service Routines to be written in Assembly, pretty sure if they had done some empirical analysis on their code they would have got some rude shocks.
A programmer should be able to finish the work assigned in the most optimal way, there is no point investing weeks in optimizing and writing code in assembly for saving a few cycles, you never know later you might see that the code performance deteriorated for different inputs. Usually the following steps should be considered as an optimal methodology.
Write the code in C
Do profiling of the code and identify bottlenecks
Identify those critical parts of the code which should be improved
Analyze the assembly code generated of those critical parts and see if any optimizations can be done with minimal time investment and maximum results (For example, if we focus our optimization to the code which execute in a loop then we get more returns)
Now in an ideal scenario with a reasonable time investment there is no way a programmer can beat the compiler in terms of performance optimization, so is there a way we can beat the compiler? One possible advantage a programmer might have over the compiler is that he knows what will be the nature of the input, if his optimization strategy is focused on exploiting these nuances of the input then he is bound to get amazing results.
Consider an example code which counts the number of odd and even elements in a list.
The C Code for the same is given below:
int i = 0;
bool temp = 0;
int even = 0;
int odd = 0;
If we analyse the code, almost all the cycles are spend in the loop where we check for the odd and the even elements.
Lets look at the assembly code generated for the loop.
/* Set up the zero over head loop */
LSETUP ( 4 /*0xFFA0006C*/ , 14 /*0xFFA00076*/ ) LC0 = P1 ;
R3 = [ P0 + 0x0 ] ; /* Read the element from the list */
CC = BITTST ( R3 , 0x0 ) ; /* Read the first bit */
IF CC JUMP 30 /*0xFFA0008E*/ ;/* Check the first bit and jump if set*/
NOP ; /* NOP to prevent unintented increment */
R1 = R1 + R2 ; /* If branch not taken then increment */
P0 += 4 ; /* Increment the address to be read */
R0 = R0 + R2 ; /* Increment counter for odd */
JUMP.S -26 /*0xFFA00076*/ ; /* Jump back to the starting of the loop */
Now can we optimize the above loop? The main bottleneck in the above code is the jump, each conditional jump misprediction will cost us 8 core clocks.
How can we reduce the cost of this jump? Here we can analyse the input pattern of the array and lets say that the input array always consists of a majority of odd elements, in such a scenario we have a branch misprediction in the majority of the cases. Lets turn the tables here. Below I have given the optimized loop.
/* Set up the loop */
lsetup(Loop_starts,Loop_ends)LC0 = p0;
Loop_starts: r1 = extract(r0,r2.l)(z)||r0 = [i0++]; /* Check the bit 0 and read the next element at the same time */
cc = r1; /*Assign LSB to CC flag */
if cc jump odd_num(bp); /* If CC is set then jump (note branch prediction) */
r5 += 1; /* If branch not taken then increment even */
r6 += 1; /* If branch taken the increment odd*/
The above code has two critical changes inside the loop.
Parallel instruction execution
Branch Prediction for conditional jump
Parallel instruction execution is an obvious advantage. Branch prediction reduced the penalty in the cases of odd numbers to 4 clock cycles but increased the penalty in case of even numbers to 8 clock cycles. As we know that the odd numbers are in majority so on the whole we end up reducing the cycle consumption by almost 30%. So one of the most effective ways to beat the compiler is to exploit the nuances present in the input pattern, compiler is oblivious to such details and cannot generate a code ideal for all kinds of inputs but we on the other hand can tailor the assembly so that it caters to certain specific types of inputs.
Some of the features which distinguishes a DSP processor are.
Zero Overhead loop support
Circular Buffering support
There is a code which used the the circular buffering capabilities of Starcore to write into a common buffer. The code which copied data into the buffer was optimized by writing it in assembly. It was also written to make sure that at a time maximum number of bytes were copied using a MOVE operation. In other words if the source and destination was aligned at 8 byte boundaries then the assembly did a MOVE.64 instruction which copied 64 bits at a time.
The pseudo code of the assembly is given below.
Start of Loop:
if (src%8 == 0) &&(dest%8 == 0)&&(size >=8)
move.64 src,dest (instruction to move 8 bytes)
else if (src%4 == 0) &&(dest%4 == 0)&&(size >=4)
move.32 src,dest (instruction to move 4 bytes)
else if (src%2 == 0) &&(dest%2 == 0)&&(size >=2)
move.16 src,dest (instruction to move 2 bytes)
else move.8 src,dest (byte copy)
The above method adds the overhead of checking the alignment each time but the clock cycles saved by moving more bytes at a time using the move instruction is much more because size of most of the data written into this circular buffer was a multiple of 8 or at least a multiple of 4.
The circular buffering was implemented using the Index, base, length and the modifier register. Lets name them I0,B0,L0 and Mo, the hardware behaves in such a way that as soon as the address in I0 reaches the value B0 + L0 it makes sure that I0 is reset to B0. This was circular buffering is maintained without any additional software overhead of checking bounds.
Lets consider a buffer which has the following attributes.
Size = 8 bytes
Base address = 0x02
Now the writes into the circular buffer of 8 bytes happened in the following order
Write 1 : 2 bytes
Write 2: 4 bytes
Write 3: 4 bytes
Lets see what happens in such a scenario. Below I have given the BlackFin Assembly Code which does the write.
_main: /* Initialize the values of the registers for circular buffering */
i0.l = circular_buff; /*Buffer Address – 0x02*/
i0.h = circular_buff; /*Buffer Address – 0x02*/
l0 = 8; /*Length of the buffer */
b0 = i0; /*Base address */
r0.l = 0xFFFF; /*Dummy Value which will */
r0.h = 0xeeee; /*be written into the buffer */
w[i0++] = r0.l; /*Write 0xFFFF at address I0-0x02 */
[i0++] = r0; /*Write 0xFFFFEEEE at I0-0x04 */
[i0++] = r0; /*Write 0xFFFFEEEE at I0-0x08 */ /*At this point the overflow has happened */ /*0xEEEE has been written to 0x0A, */ /*which is out of bounds for the array circular_buff */
w[i0] = r0.h; /*I0 has properly looped back to 0x04 */
As you can see the comments, the first write of two bytes took up the locations 0x02 and 0x03, second write of 4 bytes used up locations 0x04 to 0x07.
Now after the first two writes we need to write 4 bytes more, the location 0x08 is 4 byte aligned and 4 bytes have to be written also, so ideally 2 bytes should be written into the address 0x08 and 0x09 and the third and the fourth byte should be written to the starting of the buffer afer a loop back.
With the above code this won’t happen and we will end up having a 2 byte overflow but the Index register will loop back properly and leave the first 2 bytes of the buffer unwritten and point itself to 0x04.
This was seen in the starcore when the following conditions were satisfied.
Size of the buffer was a multiple of 8 byte
Start address of the buffer was aligned at 4 byte boundaries so that the end address of the buffer minus 4 bytes will give you a 8 byte aligned address.
When the above conditions were satified we ended up having a 4 byte overflow. This is understandable because a load or a store operation has the following operations and they all happen in sequence.
Calculate the Load/Store address
Send out the Load/Store address on the address bus
Wait for the required amount of wait cycles before Writing the data onto the Store bus or reading data from the Load bus.
The circular buffering logic is implemented by the Data Address generator module in the Core and the Bus protocol which loads or stores a value is not aware of this condition because of which it goes ahead and writes or reads the value from the memory and we will have a corruption.
Making sure that the Index register is pointing to a valid address is done by executing a simple formula like
Index-new = Index-old + Modifier – Length;
This happens during one of the later stages in the pipeline (most probably the Write back Stage or just before that) while the address is generated at a much earlier stage and data gets read during the same time. All this makes sense, hardware is perfectly right when he did a byte overflow.
So this is indeed a bug in the software which we solved by making sure that the base address of circular buffer is always aligned at 8 byte boundaries.