Man Versus Compiler

Man versus machine has always been an exhilarating contest, whether its movie making (2001:A Space Odyssey, Terminator, Matrix) or chess (Kasparov v/s Deep Blue) or may be coding (Compiler Generated Code v/s Programmer generated code).

A Space Odyssey
A Space Odyssey

Compilers of the Analog Devices and Freescale DSPs I have been working on exhibited such maturity that even the worst written C code performed like a formula one car. If you are writing system software then it seems futile attempting to increase the performance with direct assembly coding. There are of course some exceptions like.

  • Certain parts of Operating System has to be written in Assembly because they are platform dependent code(For example: Trap interrupt in linux is generated by a code written in assembly)
  • DSP algorithms which needs to be optimized by using SIMD capabilities.

For device driver coding and generic firmware development we hardly need to invest time in writing assembly, a well written modular C code will suffice.

It’s a popular misconception that critical code which needs to perform well has to be written in assembly, for example I do remember some of the tech leads insisting on Interrupt Service Routines to be written in Assembly, pretty sure if they had done some empirical analysis on their code they would have got some rude shocks.

A programmer should be able to finish the work assigned in the most optimal way, there is no point investing weeks in optimizing and writing code in assembly for saving a few cycles, you never know later you might see that the code performance deteriorated for different inputs. Usually the following steps should be considered as an optimal methodology.

  • Write the code in C
  • Do profiling of the code and identify bottlenecks
  • Identify those critical parts of the code which should be improved
  • Analyze the assembly code generated of those critical parts and see if any optimizations can be done with minimal time investment and maximum results (For example, if we focus our optimization to the code which execute in a loop then we get more returns)

Now in an ideal scenario with a reasonable time investment  there is no way a programmer can beat the compiler in terms of performance optimization, so is there a way we can beat the compiler? One possible advantage a programmer might have over the compiler is that he knows what will be the nature of the input, if his optimization strategy is focused on exploiting these nuances of the input then he is bound to get amazing results.

Consider an example code which counts the number of odd and even elements in a list.

The C Code for the same is given below:

void count_list()
int  i    = 0;
bool temp = 0;
int  even = 0;
int  odd  = 0;

for(i = 0; i < 40 ; i++)
temp = (array[i] & 0x00000001);

if(temp == 0)
even = even + 1;
odd = odd + 1;
final_num = even;
final_odd_num = odd;

If we analyse the code, almost all the cycles are spend in the loop where we check for the odd and the even elements.

Lets look at the assembly code generated for the loop.

/* Set up the zero over head loop */

LSETUP ( 4 /*0xFFA0006C*/ , 14 /*0xFFA00076*/ ) LC0 = P1 ;
R3 = [ P0 + 0x0 ] ;                                   /* Read the element from the list */
CC = BITTST ( R3 , 0x0 ) ;                     /* Read the first bit */
IF CC JUMP 30 /*0xFFA0008E*/ ;/*  Check the first bit and jump if set*/
NOP ;                                                              /* NOP to prevent unintented increment */
R1 = R1 + R2 ;                                             /* If branch not taken then increment */
P0 += 4 ;                                                       /* Increment the address to be read */



R0 = R0 + R2 ;                                            /* Increment counter for odd */
JUMP.S -26 /*0xFFA00076*/ ;        /* Jump back to the starting of the loop */

Now can we optimize the above loop? The main bottleneck in the above code is the jump, each conditional jump misprediction will cost us 8 core clocks.

How can we reduce the cost of this jump? Here we can analyse the input pattern of the array and lets say that the input array always consists of a majority of odd elements, in such a scenario we have a branch misprediction in the majority of the cases. Lets turn the tables here. Below I have given the optimized loop.

/* Set up the loop */

lsetup(Loop_starts,Loop_ends)LC0 = p0;
r1 = extract(r0,r2.l)(z)||r0 = [i0++]; /* Check the bit 0 and read the next element at the same time */
cc = r1;                                                             /*Assign LSB to CC flag */
if cc jump odd_num(bp); /* If CC is set then jump (note branch prediction) */
r5 += 1;                                                            /* If branch not taken then increment even */
r6 += 1;                                                           /* If branch taken the increment odd*/

The above code has two critical changes inside the loop.

  • Parallel instruction execution
  • Branch Prediction for conditional jump

Parallel instruction execution is an obvious advantage. Branch prediction reduced the penalty in the cases of odd numbers to 4 clock cycles but increased the penalty in case of even numbers to 8 clock cycles. As we know that the odd numbers are in majority so on the whole we end up reducing the cycle consumption by almost 30%. So one of the most effective ways to beat the compiler is to exploit the nuances present in the input pattern, compiler is oblivious to such details and cannot generate a code ideal for all kinds of inputs but we on the other hand can tailor the assembly so that it caters to certain specific types of inputs.

Catch in Circular Buffering

Some of the features which distinguishes a DSP processor are.

  • Multipliers
  • Video ALUs
  • Zero Overhead loop support
  • Circular Buffering support

There is a code which used the the circular buffering capabilities of Starcore to write into a common buffer. The code which copied data into the buffer was optimized by writing it in assembly. It was also written to make sure that at a time maximum number of bytes were copied using a MOVE operation. In other words if the source and destination was aligned at 8 byte boundaries then the assembly did a MOVE.64 instruction which copied 64 bits at a time.

The pseudo code of the assembly is given below.


  1. Start of Loop:
  2. if (src%8 == 0) &&(dest%8 == 0)&&(size >=8)
  3. move.64 src,dest (instruction to move 8 bytes)
  4. else if (src%4 == 0) &&(dest%4 == 0)&&(size >=4)
  5. move.32 src,dest (instruction to move 4 bytes)
  6. else if (src%2 == 0) &&(dest%2 == 0)&&(size >=2)
  7. move.16 src,dest (instruction to move 2 bytes)
  8. else move.8 src,dest  (byte copy)

The above method adds the overhead of checking the alignment each time but the clock cycles saved by moving more bytes at a time using the move instruction is much more because size of most of the data written into this circular buffer was a multiple of 8 or at least a multiple of 4.

The circular buffering was implemented using the Index, base, length and the modifier register. Lets name them I0,B0,L0 and Mo, the hardware behaves in such a way that as soon as the address in I0 reaches the value B0 + L0 it makes sure that I0 is reset to B0. This was circular buffering is maintained without any additional software overhead of checking bounds.

Lets consider a buffer which has the following attributes.

  • Size = 8 bytes
  • Base address = 0x02

Now the writes into the circular buffer of 8 bytes happened in the following order

  • Write 1 : 2 bytes
  • Write 2: 4 bytes
  • Write 3: 4 bytes

Lets see what happens in such a scenario. Below I have given the BlackFin Assembly Code which does the write.

/* Initialize the values of the registers  for circular buffering */
i0.l = circular_buff;  /*Buffer Address – 0x02*/
i0.h = circular_buff; /*Buffer Address – 0x02*/
l0 = 8;                             /*Length of the buffer */
b0 = i0;                          /*Base address */

r0.l = 0xFFFF;        /*Dummy Value which will */
r0.h = 0xeeee;        /*be written into the buffer */

w[i0++] = r0.l;        /*Write 0xFFFF at address I0-0x02               */
[i0++] = r0;              /*Write 0xFFFFEEEE at I0-0x04                     */
[i0++] = r0;              /*Write 0xFFFFEEEE at I0-0x08                     */
/*At this point the overflow has happened  */
/*0xEEEE has been written to 0x0A,
*/ /*which is out of bounds for the array circular_buff */

w[i0] = r0.h;            /*I0 has properly looped back to 0x04 */

As you can see the comments, the first write of two bytes took up the locations 0x02 and 0x03, second write of 4 bytes used up locations 0x04 to 0x07.

Now after the first two writes we need to write 4 bytes more, the location 0x08 is 4 byte aligned and 4 bytes have to be written also, so ideally 2 bytes should be written into the address 0x08 and 0x09 and the third and the fourth byte should be written to the starting of the buffer afer a loop back.

With the above code this won’t happen and we will end up having a 2 byte overflow but the Index register will loop back properly and leave the first 2 bytes of the buffer unwritten and point itself to 0x04.

This was seen in the starcore when the following conditions were satisfied.

  • Size of the buffer was a multiple of 8 byte
  • Start address of the buffer was aligned at 4 byte boundaries so that the end address of the buffer minus 4 bytes will give you a 8 byte aligned address.

When the above conditions were satified we ended up having a 4 byte overflow. This is understandable because a load or a store operation has the following operations and they all happen in sequence.

  1. Calculate the Load/Store address
  2. Send out the Load/Store address on the address bus
  3. Wait for the required amount of wait cycles before Writing the data onto the Store bus or reading data from the Load bus.

The circular buffering logic is implemented by the Data Address generator module in the Core and the Bus protocol which loads or stores a value is not aware of this condition because of which it goes ahead and writes or reads the value from the memory and we will have a corruption.

Making sure that the Index register is pointing to a valid address is done by executing a simple formula like

Index-new = Index-old + Modifier – Length;

This happens during one of the later stages in the pipeline (most probably the Write back Stage or just before that) while the address is generated at a much earlier stage and data gets read during the same time. All this makes sense, hardware is perfectly right when he did a byte overflow.

So this is indeed a bug in the software which we solved by making sure that the base address of circular buffer is always aligned at 8 byte boundaries.

How does Onchip Breakpoints work?

Internal specifics of Onchip breakpoints on Freescale StarCore processors is easily understood if we are familiar with its emulator. So it’s good idea to read the following two posts before going through the details below.

Introduction to Debug modules

On Chip Emulator


On Chip Breakpoints:

A descriptive diagram of the whole debug infrastructure is given below.

EOnCE Controller
EOnCE Controller
  • EDU: Event Detection unit
  • EC: Event Counter
  • TB unit: Trace Buffer unit
  • EEx & EED : General purpose control signals which can be configured for input or output. Usage of this is SOC specific (Derivative specific).

Before we configure the break points the processor core should be placed in DEBUG mode, this DEBUG mode can invoked using two methods:

  • Send a Command to the controller
  • Assert the EEx signal as soon as the core comes out of reset

Using the second method will need configuring of registers accordingly. Once the Core is in the DEBUG mode, breakpoints can be configured via JTAG. Here we are not going to discuss the details of register settings but a design level overview of how this OnChip Emulator works.

The breakpoints/watchpoints should be enabled only after programming the EDU (Event Detection Unit) registers with the configuration which defines that breakpoint. Namely the address in the memory, type of the breakpoint etc. These registers can be written via the JTAG.

So the two methods employed for enabling an already configured breakpoint/watchpoint are:

  • Configure the breakpoint/watchpoint enabling register by writing into it using the JTAG port.
  • Use the EEx or the EED control lines to signal the EOnCE controller to enable or disable the debug functionalities.

Once we program the debug configuration, the host needs to be informed when an event happened or when a breakpoint got hit. EOnCE will ensure that the core enters DEBUG mode by sending it a signal but for informing the HOST we need a mechanism like an interrupt.

  • We can configure one out of the EEx signals to send an interrupt to the host once the core enters Debug mode.

From the above discussion it’s clear that breakpoints are supported in hardware via setting of few registers and they can be configured via EOnCE which in turn responds to JTAG interface. If there are multiple break points configured then there are mechanisms to identify which one was hit.

  • Read the EOnCE controller status or monitor registers which will tell you which event has got hit
  • Configure EEx or EED signal to assert an interrupt to the HOST, depending on which signal has asserted the interrupt, HOST can know which break point has got hit.
  • Read the PC Breakpoint detection register from EOnCE, this gives the PC of the address which caused the event.

PC Breakpoint detection register should me used in combination with reading of EOnCE status register because debug mode can be caused by other reasons also.One important point to be noted here is that EOnCE gives many options for the usage of EEx and EED signals but the way in which it will be used is SOC specific and each platform may have different uses of these.

Similarly, if the Event counter has to count “off core” events like Cache hits or misses, the the external signals EC0 and EC1 needs to be used, this is also SOC dependent.

Hardware Design


Logic for configuring breakpoints
Logic for configuring breakpoints


It’s obvious that onchip breakpoints need support in hardware, but we could still use some more clarity on how this support is provided.

  • EOnCE is closely integrated with the processor core because it needs to probe address and data buses for certain reference values configured in breakpoint/watchpoint registers.
  • Whenever processor core accesses an address, EOnCE comes to know about this address value and it compares it with the reference value. If they are same then, depending on the Event selector configuration an event will be raised.
  • The event can be a DEBUG exception, DEBUG signal to the processor, triggering a trace, disabling a trace etc.

Similar logic can be employed for probing data values also. The above diagram we can see two comparators, A & B, these are two address buses for data access, if we do not know in which bus the data we want will be accessed then we need to configure both Comparator A & B with the reference value and then set the condition for an OR-ing of the comparisons, so that even if one comparison returns a success we will have an event raised.

MASK register: The way it works is pretty self explanatory, the address bus values sampled is masked with the MASK register value and then compared with the reference values.

Trace Buffer Unit can detect and record program flow related details and send it to an off-core module like Nexus, from Nexus it can be sent to the host using a Nexus port on the board or it can be saved in a circular buffer inside any On-Chip memory.

EOnCE: The onchip emulator

EOnCE comprises of 6 core components.

• EOnCE Controller
• Event Counter
• Event Detection Unit (EDU)
• Synchronizer
• Event Selector
• Trace Unit

EOnCE Controller:

Without this contoller JTAG wont be able to access or program EOnCE. Using this interface we can put the core into debug mode & also write into EOnCE registers, basically this is the JTAG gateway into EOnCE.

Event Counter

Counters can count various events! These events can be configured by setting its registers. For example, it can be the number of times a watchpoint was hit or the number of instructions executed, or any external events like L1 cache hits, cache miss etc.

Event Detection Unit (EDU)

EDU forms the core logic of EOnCE, this module can be configured to set breakpoints, watchpoints, or to monitor the address on the data bus etc. EDU can also send a signal to Even Counter so that it can count the number of times an address was referenced or a data was accessed etc.


This module is required to synchronize external signals with the internal clock. There are a set of general purpose signals which can be programmed to do various operations. For example, we can configure the General purpose signal EE0 to configure the EDU whenever it is asserted through JTAG.

Event Selector

This is the last unit in the debugging chain, in other words this decides what action needs to be taken when an event is generated by EDU or Event Counter. The action can be something like putting the Core to the debug mode, or it can be starting a program trace, or raising of a debug exception etc.

Trace Unit

As the name suggests, this is the hardware which monitors the program counter and generates the call trace.

Below Pic depicts the EOnCE controller and its component modules.

EOnCE framework
EOnCE framework

Debug Modules : Introduction

Working with on-chip debug IP modules was a definite learning experience. It included writing low level software for setting breakpoints, watchpoints,  configuring of call trace and also maintenance of a hardware dependent high performance data logging module. Such debug modules rarely get the appreciation which they deserve, considerable amount of chip space is dedicated for hardware IPs which enable us to develop such complex embedded software.

Each processor comes with its own unique ways of implementing debug modules, my understanding is based on the experience with StarCore. Lets look at the the features of the On chip Emulator which gives the infrastructure for doing all the debug activities. Below we can see the diagram of a typical JTAG  based debugging setup.

JTAG based Debugging setup
JTAG based Debugging setup

EOnCE(Enhanced Onchip Emulator)

EOnCE is the debug module which add all the above mentioned capabilities to the StarCore based platforms. There are various points which we need to note about EONCE.

  • Eonce is accessible using JTAG as well as from core by addressing its memory mapped registers.
  • Simultaneous access of Eonce from JTAG as well as Core will result in JTAG access getting prioritized over the StarCore access.
  • JTAG writes into EOnCE registers using JTAG port and configures it for debugging related features.
  • When we set a break point or configure a trace using the debugger on the Host PC at that time a message is send to the EOnCE module on the Target, this message makes sure that our debug configurations are applied on the chip.
EOnCE framework
EOnCE framework


Downloading binary using JTAG!

This is one of the most obvious reasons to use a JTAG debugger, pretty much sure that we can never deliver anything on time if we have to start flashing the binary each time we compile and test it. Every module on the chip has a controller(TAP Controller)associated with it which makes it JTAG compliant , in other words this very TAP controller connects the JTAG port to the hardware module.

TAP interfaced with the System JTAG Controller
TAP interfaced with the System JTAG Controller

In the above pic we can see the DMA module having a TAP controller and it being interfaced with the System JTAG controller(SJC), which is the main JTAG controller, as illustrated in the above pic every peripheral and core logic inside a chip will have a TAP controller and this is accessed and configured using SJC. Now lets see what all steps are involved in downloading code into the target RAM.

  • First we need to place the processor (StarCore) is in DEBUG mode, for that we need to send the DEBUG Command to SJC.
  • Next step is for accessing the EOnCE, execute the command to select EONCE TAP controller through SJC. From this point on we will be able to directly control EOnCE using JTAG.
  • EOnCE module has the capability to make the star core execute commands like MOVE and and some of the arithmetic operations, this feature of EOnCE will be used to transfer the program data into the memory
  • Now send the program data to the EOnCE recieve register, now we need to move this program data to the StarCore accessible memory. For that we need to execute a MOVE instruction.
  • Every command which needs to be issued to the StarCore through JTAG has to be written into a specific Econce Command register.
  • Now write a MOVE instruction into the Core comand register to transfer the Program data from the receive register to any CORE register (Because we can write data into memory only from a CORE register not from any EOnCE register)
  • Now send a Core command to move the data from the Core Register to the memory location. This completes one transfer
  • The above steps are repeated till the whole program code is written into the memory.

Now that makes up the rather elaborate sequence for downloading contents into the RAM.

Blackfin Core

This picture is right out of the BlackFin HRM.


Without repeating what is already covered in the HRM let us try to analyse the capabilities of the Core from a different perspective.

Load Store Architecture means

  • All the operations are done on the registers
  • That would in turn mean that all the computational units in the core needs their inputs from the registers
  • This mandates that there has to be internal buses which connect the computational units to the registers.
  • A set of buses as the input and another set which transmits the output back to the registers.

The Diagram Below depicts exactly what we need.

Core Modules connected by internal Buses
Core Modules connected by internal Buses

This is quite fascinating, we can exactly know the reason why some instructions work while others do not.

Consider the Operation :

R0 = R1+ R2;

How does this work? There are two 32 bit buses running from the Data register file to the ALU0, values inside R1 & R2 will be transferred through them into ALU0 and once the operation is done the data come out back and gets written into R0. Perfect!

Now why does the following operation does not work?

P0 = R0 + R1;

The answer  lies again in the bus architecture. Eventhough R0 and R1 can be transferred to ALU0 we do not have a Bus running from ALU0 to Pointer Register File, hence there is no way that this operation can succeed.

*Please note that by default ALU0 will be used and ALU 1 will come into picture only in case of parallel operations, which can be discussed in another post.

Idea here is to prod us to start thinking from this perspective. When we understand the bus architecture, we understand how to write the best possible assembly code while remaining within the constraints of the system. All this data and instruction buses connecting memory, registers & computational units form the backbone of an architecture. This inherently determines the strengths as well as the weakness of the processor.

Load Store Continued

The LS (Load Store) story has some more twists which i did not mention in the previous post simply because it had already become way too long.

Stack Memory?

RISC machines always come with sufficient amount of registers which in turn results in storing of all the local variables inside registers itself and hardly any stack memory usage.Example code illustrating this phenomenon is given below.

Consider the below given C Code:

int add_on(int a,int b)
int i =0;
for(i =0;i < 10; i++)
a = a+b;

The corresponding BlackFin Assembly code does not use a single stack variable. The register P0 and Loop counter register( LC0 ) takes care of the purpose of stack variable “i”.

P0 = 10;
/* Initialize the Zero Overhead Loop registers */
Lsetup(Loop_Starts,Loop_Ends) LC0 = P0;
/* Start of the loop */
Loop_Starts: Loop_Ends:
R0 = R0 + r1;

Zero What! Overhead Loop?

Compilers also reduce the overhead of context switching between function calls, how? Ideally in between calls, all the registers need not be pushed on to stack, but only those registers which are used inside that particular function is saved and restored. Because the rest  anyway remain untouched and hence the  costly affair of stack memory usage and slow memory access is minimized. Result is a lean stack usage, have seen this feature on BlackFin compilers but yet to see on Starcore though!

Bogus Store

Within the load-store structure, most of the time store follows the load. But sometimes even though in the code we might see a store written before another memory read we really cannot guarantee the sequence in which it might execute .  Let me quote the lucid BlackFin HRM.

“The relaxation of synchronization between memory access instructions
and their surrounding instructions is referred to as weak ordering of loads
and stores. Weak ordering implies that the timing of the actual completion
of the memory operations—even the order in which these events
occur—may not align with how they appear in the sequence of the program
source code.”

“Because of weak ordering, the memory system is allowed to prioritize
reads over writes. In this case, a write that is queued anywhere in the pipeline, but not completed, may be deferred by a subsequent read operation,and the read is allowed to be completed before the write. Reads are prioritized over writes because the read operation has a dependent operation
waiting on its completion, whereas the processor considers the write operation complete, and the write does not stall the pipeline if it takes more
cycles to propagate the value out to memory. This behavior could cause a
read that occurs in the program source code after a write in the program
flow to actually return its value before the write has been completed.
This ordering provides significant performance advantages in the operation
of most memory instructions. However, it can cause side effects that
the programmer must be aware of to avoid improper system operation.”

When we get optimizations, we also get trade-offs. But let us keep the price limited to what we pay for in the die space, and not write buggy code and learn the hard way.

The side effect which is mentioned above will come into picture when we are configuring any kind of registers or any such memory, on which a read after write might give different results compared to a read before write. Consider the following sequence

1. Write Register 1

2. Read Register 2

What if Register 2 value depend on the Register1 write? In such situations we need to ensure strict ordering of reads and writes by using the SYNC instructions (CSYNC & SSYNC).


Black Fin Jargons

This post is not of much use on its own, but this is meant to clarify some of the technical jargon mentioned in the other posts and this will be updated as and when.

Multicycle Operations-

Quote from the BlackFin HRM:

“Multi-cycle instructions behave as multiple single-cycle instructions being
issued from the decoder over several clock cycles. For example, the Push
Multiple or Pop Multiple instruction can push or pop from 1 to 14 DREGS
and/or PREGS, and the instruction remains in the decode stage for a number
of clock cycles equal to the number of registers being accessed.”

In other words, one single instruction will be decoded and executed multiple times.

Zero-Overhead loop

Very much self explanatory, whenever we have a loop in a code we always have two types of overheads

  • > Adding an additional overhead of incrementing and checking a counter which decides the iterations in a loop
  • > Pipeline stalls because of branch mis-prediction at the end of an iteration.

Both the above bottlenecks are eliminated by making hardware aware of the existence of the loop by specifying the starting point and the ending point and the number of iterations which needs to be executed.

Below i have given a memcpy assembly code which uses zero overhead looping.

/* Set  up the loop which runs from the label Loop_starts till the label Loop_ends for P0 times */

Lsetup(Loop_starts,Loop_ends)LC0 = p0;
r4 = b[p1++](z);
/* Read from the source */
b[p2++] = r4;
/* Write to the destination */

Zero Overhead looping is implemented by a hardware module called sequencer.


Why have a Load Store architecture?

Typically a RISC architecture is built on the fundamentals of Load Store methodology. There are a plethora of internet sources for understanding Load-Store, this link should definitely be helpful.

Considerable study has gone into the RISC architecture, these instructions are relatively simple and does simple operations, in other words, a single CISC operation may be equivalent to a a bunch of RISC operations. Let me quote a para from the Hardware Reference manual of BlackFin.

“The Blackfin processor architecture supports the RISC concept of a
Load/Store machine. This machine is the characteristic in RISC architectures whereby memory operations (loads and stores) are intentionally separated from the arithmetic functions that use the targets of the memory operations. The separation is made because memory operations, particularly instructions that access off-chip memory or I/O devices, often take multiple cycles to complete and would normally halt the processor, preventing an instruction execution rate of one instruction per cycle.

Separating load operations from their associated arithmetic functions
allows compilers or assembly language programmers to place unrelated
instructions between the load and its dependent instructions. If the value
is returned before the dependent operation reaches the execution stage of
the pipeline, the operation completes in one cycle.”

This theoretical possibility can be illustrated by the below example. Here we saved 4 processor clock cycles by using the above mentioned logic of interleaving the load operations with an unrelated stack operation .

Initial Code BEFORE inserting unrelated operations between Loads:

(r7:2) = [ sp ++ ];/* STACK POP (Unrelated operation)*/
(p5:4) = [ sp ++ ];/* STACK POP (Unrelated operation)*/
/* MOVE the address of the variables  /*
/* which needs to be operated on */
p0.l = _oneByte; /* Variable 1 */
p0.h = _oneByte; /* Variable 1 */

p1.l = _twoBytes; /* Variable 2 */
p1.h = _twoBytes; /* Variable 2 */

R0 = B[p0](Z); /* LOAD Operation 1 */
R1 = W[p1](Z); /* LOAD Operation 2 */

R0 = R0 + R1; /* Actual Add Operation */

B[P0] = R0; /* Store Operation */

The above code will behave “almost” similar to a code which runs on a CISC machine, now lets look at the code after interleaving.

Code After Interleaving STACK operation with LOAD operation:

/* MOVE the address of the variables  /*
/* which needs to be operated on */
p0.l = _oneByte; /* Variable 1 */
p0.h = _oneByte; /* Variable 1 */

p1.l = _twoBytes; /* Variable 2 */
p1.h = _twoBytes; /* Variable 2 */

R0 = B[p0](Z); /* LOAD Operation 1 */
(r7:2) = [ sp ++ ]; /* STACK POP (Unrelated operation)*/

R1 = W[p1](Z); /* LOAD Operation 2 */
(p5:4) = [ sp ++ ]; /* STACK POP (Unrelated operation)*/

R0 = R0 + R1; /* Actual Add Operation */

B[P0] = R0; /* Store Operation */

Now we know ONE of the reasons for the out of order code generation done by compilers when optimization is enabled. Having said all this, we definitely need to investigate how did we end up saving 4 clock cycles.

For the above code snippet the following conditions are satisfied.

  • > Code and Data on the off chip SDRAM
  • > LOAD operation consumes 8 clock cycles within the execution stage of pipeline.
  • > The interleaved STACK POP operation is a “Multicycle” operation and it will spend multiple cycles at the decoder stage of the pipeline.
  • >BlackFin does not execute unless it decodes the instruction and it will definitely execute as soon as it decodes. Hence, the stalling of “load” operation at the execution stage and stalling of the stack operation at the decoding stage happens in parallel because of which overall stalls seems to reduce but in effect the number of stalls for each instruction remain same.

Initial Pipeline Execution BEFORE interleaving:





Write Back


Stack POP





p0.l =_oneByte;

[ sp ++ ]




p0.l =_oneByte;

[ sp ++ ]




p0.l =_oneByte;

[ sp ++ ]




p0.h =_oneByte


[ sp ++ ]





[ sp ++ ]

The blocks marked in RED are pipeline stalls, here the operation p0.l =_oneByte; stalled because the decoder is still decoding the previous stack pop operation unlike the normal scenario in which this should have completed in a single cycle

Pipeline Execution After Interleaving:





Write Back


Stack POP

R0 = B[p0](Z);




R1 = W[p1](Z);

[ sp ++ ]

R0 = B[p0](Z);



R1 = W[p1](Z);

[ sp ++ ]

R0 = B[p0](Z);



R1 = W[p1](Z);

[ sp ++ ]

R0 = B[p0](Z);



p0.h =_oneByte

[ sp ++ ]

R0 = B[p0](Z);




[ sp ++ ]

R0 = B[p0](Z);


*Please note that a 4 stage pipeline is described here just for the sake of simplicity, BlackFin actually has a 10 stages.

I would state that the above methodology made optimal usage of the hardware resources. Both decoder unit and execution unit are utilized at the same time. While in the previous case, the stalls within these units where happening separately and hence added up. Reordering simply made them happen at the same time.