Man versus machine has always been an exhilarating contest, whether its movie making (2001:A Space Odyssey, Terminator, Matrix) or chess (Kasparov v/s Deep Blue) or may be coding (Compiler Generated Code v/s Programmer generated code).

Compilers of the Analog Devices and Freescale DSPs I have been working on exhibited such maturity that even the worst written C code performed like a formula one car. If you are writing system software then it seems futile attempting to increase the performance with direct assembly coding. There are of course some exceptions like.
- Certain parts of Operating System has to be written in Assembly because they are platform dependent code(For example: Trap interrupt in linux is generated by a code written in assembly)
- DSP algorithms which needs to be optimized by using SIMD capabilities.
For device driver coding and generic firmware development we hardly need to invest time in writing assembly, a well written modular C code will suffice.
It’s a popular misconception that critical code which needs to perform well has to be written in assembly, for example I do remember some of the tech leads insisting on Interrupt Service Routines to be written in Assembly, pretty sure if they had done some empirical analysis on their code they would have got some rude shocks.
A programmer should be able to finish the work assigned in the most optimal way, there is no point investing weeks in optimizing and writing code in assembly for saving a few cycles, you never know later you might see that the code performance deteriorated for different inputs. Usually the following steps should be considered as an optimal methodology.
- Write the code in C
- Do profiling of the code and identify bottlenecks
- Identify those critical parts of the code which should be improved
- Analyze the assembly code generated of those critical parts and see if any optimizations can be done with minimal time investment and maximum results (For example, if we focus our optimization to the code which execute in a loop then we get more returns)
Now in an ideal scenario with a reasonable time investment there is no way a programmer can beat the compiler in terms of performance optimization, so is there a way we can beat the compiler? One possible advantage a programmer might have over the compiler is that he knows what will be the nature of the input, if his optimization strategy is focused on exploiting these nuances of the input then he is bound to get amazing results.
Consider an example code which counts the number of odd and even elements in a list.
The C Code for the same is given below:
void count_list()
{
int i = 0;
bool temp = 0;
int even = 0;
int odd = 0;
for(i = 0; i < 40 ; i++)
{
temp = (array[i] & 0x00000001);
if(temp == 0)
{
even = even + 1;
}
else
{
odd = odd + 1;
}
}
final_num = even;
final_odd_num = odd;
}
If we analyse the code, almost all the cycles are spend in the loop where we check for the odd and the even elements.
Lets look at the assembly code generated for the loop.
/* Set up the zero over head loop */
LSETUP ( 4 /*0xFFA0006C*/ , 14 /*0xFFA00076*/ ) LC0 = P1 ;
R3 = [ P0 + 0x0 ] ; /* Read the element from the list */
CC = BITTST ( R3 , 0x0 ) ; /* Read the first bit */
IF CC JUMP 30 /*0xFFA0008E*/ ;/* Check the first bit and jump if set*/
NOP ; /* NOP to prevent unintented increment */
R1 = R1 + R2 ; /* If branch not taken then increment */
P0 += 4 ; /* Increment the address to be read */
….
..
R0 = R0 + R2 ; /* Increment counter for odd */
JUMP.S -26 /*0xFFA00076*/ ; /* Jump back to the starting of the loop */
Now can we optimize the above loop? The main bottleneck in the above code is the jump, each conditional jump misprediction will cost us 8 core clocks.
How can we reduce the cost of this jump? Here we can analyse the input pattern of the array and lets say that the input array always consists of a majority of odd elements, in such a scenario we have a branch misprediction in the majority of the cases. Lets turn the tables here. Below I have given the optimized loop.
/* Set up the loop */
lsetup(Loop_starts,Loop_ends)LC0 = p0;
Loop_starts:
r1 = extract(r0,r2.l)(z)||r0 = [i0++]; /* Check the bit 0 and read the next element at the same time */
cc = r1; /*Assign LSB to CC flag */
if cc jump odd_num(bp); /* If CC is set then jump (note branch prediction) */
r5 += 1; /* If branch not taken then increment even */
odd_num:
NOP;
Loop_ends:
r6 += 1; /* If branch taken the increment odd*/
The above code has two critical changes inside the loop.
- Parallel instruction execution
- Branch Prediction for conditional jump
Parallel instruction execution is an obvious advantage. Branch prediction reduced the penalty in the cases of odd numbers to 4 clock cycles but increased the penalty in case of even numbers to 8 clock cycles. As we know that the odd numbers are in majority so on the whole we end up reducing the cycle consumption by almost 30%. So one of the most effective ways to beat the compiler is to exploit the nuances present in the input pattern, compiler is oblivious to such details and cannot generate a code ideal for all kinds of inputs but we on the other hand can tailor the assembly so that it caters to certain specific types of inputs.