System Programming

Linux File System Stack – 1

June 7, 2012October 31, 2016Mahesh Sreekandath4 Comments

Linux Content Index

— File System Architecture – Part I
–– File System Architecture – Part II
–– File System Write
–– Buffer Cache
–– Storage Cache

Why File System as a Loadable Kernel Module(LKM) ?

As you can see, user space idea didn’t pan out quite well:http://tekrants.me/2012/05/22/fuse-file-system-port-for-embedded-linux/

VFS Data Structures

Inode is probably the most critical abstraction which defines a VFS file entry — This represents every file/directory/link within a file system. If your file system is like FAT and lacks a clear “inode”, then a translation layer will be needed. Eventually it’s about extracting the file information associated with a Linux inode from the file system specific data structures.

File System Initialization Sequence

1. Register file system mount & unmount call backs with the VFS.
2. Mount call is responsible for creation and registration of a root directory inode .
3. Root directory inode is essentially the point of entry to the volume. It furnishes specific function pointers later invoked by VFS for inode related operations (like create) and file operations (like open,read) & directory operations (like readdir).

The above three steps and your file system module is all set, this means Linux will have enough information to translate an “Open” call from application to the file system specific internal open call, thanks to the function pointers inside the root inode.

Dentries

Another kernel structure which exists for every file & directory is a Dentry. For example, accessing a path “/mnt/ramfs” will lead to creation of two in memory dentry structures. One each for “mnt” and “ramfs”. Note that “ramfs” dentry will have a parent pointer to “mnt” dentry and a pointer to its own VFS inode. A Dentry in fact encompasses the attributes like name & handle to parent directory of a file system entry. One of the rationales behind separation of an Inode from these attributes is the existence of file links, where a single Inode is shared across multiple Dentries.

Opening a file — Easily said than done!

a. Consider opening a file with the path “/mnt/ramfs/dir1/dir2/foo.txt”
b. The dentry elements in the above path are “mnt”, “ramfs”, “dir1”, “dir2” & “foo.txt”
c. “mnt” dentry will be part of Linux root file system, all dentries are part of a hash table, the string “mnt” will map to a hash table entry giving its dentry pointer. VFS gets the inode pointer from this dentry and every directory inode has a callback function for look-up operation listing on its file/directory entries.
d. Look up called on “mnt” inode will return the inode for “ramfs” along with its dentry.
c. This is an iterative process and eventually VFS will figure out the inodes & dentries of all the elements in a file path.
d. Note that the Inode associated with “foo.txt” will give the open function pointer to invoke the open call specific to the file system driver.

VFS

A file system ported to Linux is expected to populate the fields of VFS data structures like Inodes and Dentries so that Linux can understand and convey the file attributes and contents to the user. The obvious differentiating factor across file systems like ext4, UBIFS, JFFS2 etc are their respective algorithms, which also defines the internal data structures and device access patterns.

How Dentries/Inodes are represented and accessed from a storage is specific to file system and this inherently defines their strengths and weaknesses. In crude terms, a file system in Linux comprises of a set of call backs for managing generic VFS data structures, basically the Inodes, Dentries, file handlers etc. So we have inode data structure and corresponding associated inode operations, we have file pointer data structure and file operations, dentry data structure and dentry operations and so on.

The crux of a Linux file system is its ability to talk in Linux kernel language of Inodes and Dentries. Also, unless it’s a read only volume this interpretations needs to be done in reverse too. When user makes changes to a file then a file system needs to comprehend the Linux talk and translate those changes into a representation which it might have on the storage. Undoubtedly, comprehending Linux VFS mandates deep understanding of Kernel data structures which might mean that a file system writer needs to have a kernel specific layer in the file system code, this undesirable complexity can be immensely reduced by the use of kernel library functions.

Functions which usually start with “generic_” can be interpreted as such a helper function which abstracts the kernel specifics from a kernel module, this is widely used for file system operations like “read”, “write” and even unmount. The usage of generic helper functions within a kernel module can be confusing when studying the code, because they tend to blur the kernel and a module specific boundaries, this overlap is convoluted but an extremely effective way to avoid kernel dependencies.

Image Source : http://wiki.osdev.org/images/e/e5/Vfs_diagram.png

Some design philosophy

The design thought behind Linux VFS seems to of “choice”, other than a bare minimum there is always a choice provided to the kernel hacker regarding the implementation of an interface. He or she can either use a generic function, create a custom one or simply set it to NULL. This is an ideal approach for supporting a plethora of widely differing devices and file systems, quite easily visible when we look at the code of an ext4 where there is buffer cache abstraction usage over page cache, compared to page cache sans buffer for UBIFS, versus a direct to the device approach of JFFS2. Accommodating all these widely varying designs requires a flexible rule of law driven framework where everyone can find their niche and yet not impinge on the functionality of another kernel module.

A Linux File System – 2

FUSE File System Performance on Embedded Linux

May 22, 2012July 11, 2016Mahesh Sreekandath1 Comment

We ported and benchmarked a flash file system to Linux running on an ARM board. Porting was done via FUSE, a user space file system mechanism where the file system module itself runs as a process inside Linux. The file I/O calls from other processes are eventually routed to the FUSE process via inter process communication. This IPC is enabled by a low level FUSE driver running in the kernel.

http://en.wikipedia.org/wiki/Filesystem_in_Userspace

The above diagram provides an overview of FUSE architecture. The ported file system was proprietary and was not meant to be open sourced, from this perspective file system as a user space library made a lot of sense.

Primary bottleneck with FUSE is its performance. The control path timing for a 2K byte file read use-case is elaborated below. Please note that the 2K corresponds to NAND page size.

1. User space app to kernel FUSE driver switch. – 15 uS

2. Kernel space FUSE to user space FUSE library process context switch. – 1 to15 mS

3. Switch back into kernel mode for flash device driver access – NAND MTD driver overhead without including device delay is in uS.

4. Kernel to FUSE with the data read from flash – 350uS (NAND dependent) + 15uS + 15uS (Kernel to user mode switch and back)

5. From FUSE library back to FUSE kernel driver process context switch. – 1 to 15mS

6. Finally from FUSE kernel driver to the application with the data – 15 uS

As you can see, the two process context switches takes time in terms of Milliseconds, which kills the whole idea. If performance is a crucial, then profile the context switch overhead of an operating system before attempting a FUSE port. Seems loadable kernel module approach would be the best alternative.

Stack Memory Usage

September 11, 2011June 11, 2016Mahesh SreekandathLeave a comment

Real Time Embedded software running on ASICs with limited memory and capabilities tend to lack paging and virtual memory units also. Which in turns places constraints on the dynamic memory allocation and stack memory usage. Run time allocation of memory heavily depends on the code execution pattern, manually analyzing all possible code execution paths may not be always feasible. So an easier way would be to have an extensive test suit with memory profiling.

Stack for each task in the system can be optimally allocated only by profiling its usage. Here we are not going to detail the various profiling methods, but jot down a strange observation where the stack allocation of a particular function was seen as abnormally high.

Please take a look at the following example code.

As you can see, the above dummy function executes a bunch of switch cases depending on the “code” passed to it, there are some stack variables allocated at the staring of the function (namely x,y & z) . Consider the size of “int” type to be 4 we can comfortably say that the stack allocation at the start of the function will be 4 * 3 = 12 bytes, well instead of this i noticed that the stack allocated by the ARM assembly code is 30. The assembly code looked something like what is given below.

Looks like the assembly code is allocating the stack just once after taking into account the maximum stack usage of the function, in other words the first 3 integers (x,y & z) account for 12 and the switch case ‘c’ accounts for the 18 bytes (3 integers and 3 short integers). So the total depth of the stack needs to be at least 30 bytes. Normally we would have expected stack allocation inside each switch case, this means that the assembly code might have had the following sequence included in it.

The above sequence adds more code and more execution overhead, so not really the most optimal way so instead compiler decided that it will allocate the whole stack in one shot. So the execution speed is more and code size also marginally reduced, everything looks good except the fact that now the stack usage seems to have got a hit, how? The stack used inside function “operation_add” would be allocated right after the 30 byte allocation but in the above “non-optimized” sequence we would have had it allocated only after 12 bytes, this means that the optimization in speed comes with a trade-off in the form of stack usage.

This was indeed an interesting observation, the code must have been compiled with optimization setting set for speed instead of memory, hence the compiler conveniently decides not to care much about stack memory. Actually even function inlining will add the stack of the inlined function to that of the caller function, again this is a typical case of speed/memory trade-off, invariably all kinds of optimization are!

Man Versus Compiler

June 25, 2009August 29, 2025Mahesh SreekandathLeave a comment

Man versus machine has always been an exhilarating contest, whether its movie making (2001:A Space Odyssey, Terminator, Matrix) or chess (Kasparov v/s Deep Blue) or may be coding (Compiler Generated Code v/s Programmer generated code).

Compilers of the Analog Devices and Freescale DSPs I have been working on exhibited such maturity that even the worst written C code performed like a formula one car. If you are writing system software then it seems futile attempting to increase the performance with direct assembly coding. There are of course some exceptions like.

Certain parts of Operating System has to be written in Assembly because they are platform dependent code(For example: Trap interrupt in linux is generated by a code written in assembly)
DSP algorithms which needs to be optimized by using SIMD capabilities.

For device driver coding and generic firmware development we hardly need to invest time in writing assembly, a well written modular C code will suffice.

It’s a popular misconception that critical code which needs to perform well has to be written in assembly, for example I do remember some of the tech leads insisting on Interrupt Service Routines to be written in Assembly, pretty sure if they had done some empirical analysis on their code they would have got some rude shocks.

A programmer should be able to finish the work assigned in the most optimal way, there is no point investing weeks in optimizing and writing code in assembly for saving a few cycles, you never know later you might see that the code performance deteriorated for different inputs. Usually the following steps should be considered as an optimal methodology.

Write the code in C
Do profiling of the code and identify bottlenecks
Identify those critical parts of the code which should be improved
Analyze the assembly code generated of those critical parts and see if any optimizations can be done with minimal time investment and maximum results (For example, if we focus our optimization to the code which execute in a loop then we get more returns)

Now in an ideal scenario with a reasonable time investment there is no way a programmer can beat the compiler in terms of performance optimization, so is there a way we can beat the compiler? One possible advantage a programmer might have over the compiler is that he knows what will be the nature of the input. Optimization strategy focused on exploiting this could be effective.

Consider an example code which counts the number of odd and even elements in a list.

The C Code for the same is given below:

void count_list()
{
int i = 0;
bool temp = 0;
int even = 0;
int odd = 0;

for(i = 0; i < 40 ; i++)
{
temp = (array[i] & 0x00000001);

if(temp == 0)
{
even = even + 1;
}
else
{
odd = odd + 1;
}
}
final_num = even;
final_odd_num = odd;
}

If we analyse the code, almost all the cycles are spend in the loop where it checks for the odd and the even elements.Assembly code generated below.

/* Set up the zero over head loop */

LSETUP ( 4 /*0xFFA0006C*/ , 14 /*0xFFA00076*/ ) LC0 = P1 ;
R3 = [ P0 + 0x0 ] ; /* Read the element from the list */
CC = BITTST ( R3 , 0x0 ) ;                     /* Read the first bit */
IF CC JUMP 30 /*0xFFA0008E*/ ;/* Check the first bit and jump if set*/
NOP ;                                                              /* NOP to prevent unintented increment */
R1 = R1 + R2 ;                                             /* If branch not taken then increment */
P0 += 4 ;                                                       /* Increment the address to be read */

….

R0 = R0 + R2 ; /* Increment counter for odd */
JUMP.S -26 /*0xFFA00076*/ ; /* Jump back to the starting of the loop */

Now can we optimize the above loop? The main bottleneck in the above code is the jump, each conditional jump misprediction costs 8 core clocks.

How can we reduce the cost of this jump? Here we can analyse the input pattern of the array and lets say that the input array always consists of a majority of odd elements, in such a scenario we have a branch misprediction in the majority of the cases. So let’s reverse branch prediction decision. Optimized loop given below.

/* Set up the loop */

lsetup(Loop_starts,Loop_ends)LC0 = p0;
Loop_starts:
r1 = extract(r0,r2.l)(z)||r0 = [i0++]; /* Check the bit 0 and read the next element at the same time */
cc = r1;                                                             /*Assign LSB to CC flag */
if cc jump odd_num(bp); /* If CC is set then jump (note branch prediction) */
r5 += 1;                                                            /* If branch not taken then increment even */
odd_num:
NOP;
Loop_ends:
r6 += 1;                                                           /* If branch taken the increment odd*/

The above code has two critical changes inside the loop.

Parallel instruction execution
Branch Prediction for conditional jump

Parallel instruction execution is an obvious advantage. Branch prediction reduced the penalty in the cases of odd numbers to 4 clock cycles but increased the penalty in case of even numbers to 8 clock cycles. As we know that the odd numbers are in majority so on the whole we end up reducing the cycle consumption by almost 30%. So one of the most effective ways to beat the compiler is to leverage the use case specific details.

Catch in Circular Buffering

May 21, 2009June 6, 2016Mahesh SreekandathLeave a comment

Some of the features which distinguishes a DSP processor are.

Multipliers
Video ALUs
Zero Overhead loop support
Circular Buffering support

There is a code which used the the circular buffering capabilities of Starcore to write into a common buffer. The code which copied data into the buffer was optimized by writing it in assembly. It was also written to make sure that at a time maximum number of bytes were copied using a MOVE operation. In other words if the source and destination was aligned at 8 byte boundaries then the assembly did a MOVE.64 instruction which copied 64 bits at a time.

The pseudo code of the assembly is given below.

memcopy(src,dest,size)

Start of Loop:
if (src%8 == 0) &&(dest%8 == 0)&&(size >=8)
move.64 src,dest (instruction to move 8 bytes)
else if (src%4 == 0) &&(dest%4 == 0)&&(size >=4)
move.32 src,dest (instruction to move 4 bytes)
else if (src%2 == 0) &&(dest%2 == 0)&&(size >=2)
move.16 src,dest (instruction to move 2 bytes)
else move.8 src,dest (byte copy)

The above method adds the overhead of checking the alignment each time but the clock cycles saved by moving more bytes at a time using the move instruction is much more because size of most of the data written into this circular buffer was a multiple of 8 or at least a multiple of 4.

The circular buffering was implemented using the Index, base, length and the modifier register. Lets name them I0,B0,L0 and Mo, the hardware behaves in such a way that as soon as the address in I0 reaches the value B0 + L0 it makes sure that I0 is reset to B0. This was circular buffering is maintained without any additional software overhead of checking bounds.

Lets consider a buffer which has the following attributes.

Size = 8 bytes
Base address = 0x02

Now the writes into the circular buffer of 8 bytes happened in the following order

Write 1 : 2 bytes
Write 2: 4 bytes
Write 3: 4 bytes

Lets see what happens in such a scenario. Below I have given the BlackFin Assembly Code which does the write.

_main:
/* Initialize the values of the registers for circular buffering */
i0.l = circular_buff; /*Buffer Address – 0x02*/
i0.h = circular_buff; /*Buffer Address – 0x02*/
l0 = 8; /*Length of the buffer */
b0 = i0; /*Base address */

r0.l = 0xFFFF; /*Dummy Value which will */
r0.h = 0xeeee; /*be written into the buffer */

w[i0++] = r0.l;        /*Write 0xFFFF at address I0-0x02               */
[i0++] = r0;              /*Write 0xFFFFEEEE at I0-0x04                     */
[i0++] = r0;              /*Write 0xFFFFEEEE at I0-0x08                   */
/*At this point the overflow has happened */
/*0xEEEE has been written to 0x0A, */ /*which is out of bounds for the array circular_buff */

w[i0] = r0.h; /*I0 has properly looped back to 0x04 */

As you can see the comments, the first write of two bytes took up the locations 0x02 and 0x03, second write of 4 bytes used up locations 0x04 to 0x07.

Now after the first two writes we need to write 4 bytes more, the location 0x08 is 4 byte aligned and 4 bytes have to be written also, so ideally 2 bytes should be written into the address 0x08 and 0x09 and the third and the fourth byte should be written to the starting of the buffer afer a loop back.

With the above code this won’t happen and we will end up having a 2 byte overflow but the Index register will loop back properly and leave the first 2 bytes of the buffer unwritten and point itself to 0x04.

This was seen in the starcore when the following conditions were satisfied.

Size of the buffer was a multiple of 8 byte
Start address of the buffer was aligned at 4 byte boundaries so that the end address of the buffer minus 4 bytes will give you a 8 byte aligned address.

When the above conditions were satified we ended up having a 4 byte overflow. This is understandable because a load or a store operation has the following operations and they all happen in sequence.

Calculate the Load/Store address
Send out the Load/Store address on the address bus
Wait for the required amount of wait cycles before Writing the data onto the Store bus or reading data from the Load bus.

The circular buffering logic is implemented by the Data Address generator module in the Core and the Bus protocol which loads or stores a value is not aware of this condition because of which it goes ahead and writes or reads the value from the memory and we will have a corruption.

Making sure that the Index register is pointing to a valid address is done by executing a simple formula like

Index-new = Index-old + Modifier – Length;

This happens during one of the later stages in the pipeline (most probably the Write back Stage or just before that) while the address is generated at a much earlier stage and data gets read during the same time. All this makes sense, hardware is perfectly right when he did a byte overflow.

So this is indeed a bug in the software which we solved by making sure that the base address of circular buffer is always aligned at 8 byte boundaries.

Embedded Sense

System Software

Linux File System Stack – 1

FUSE File System Performance on Embedded Linux

Stack Memory Usage

Man Versus Compiler

Catch in Circular Buffering