Load Store – Embedded Sense

Multi-Issue instructions and Multi-cycle instructions

May 28, 2009June 6, 2016Mahesh SreekandathLeave a comment

Execution of a machine instruction is divided into various steps, like fetching, decoding and execution. The role these steps play are different and hence we also need different hardware units, now dedicating different units also enables simultaneous executions. Essentially an instruction pipeline! BlackFin is not a super scalar processor but there can be a certain amount of parallel execution which can happen within a execution unit.

Core Modules connected by internal Buses

If we observe the above architecture, we can broadly classify BlackFin core into two units

ALU + Multipliers + Shifter + Video ALUs
Data Address Generators

Here in lies the key for identifying the instructions which can execute in parallel, we can broadly say that we should be able to execute computational unit operations and load/store operations in parallel.

r1 = extract(r0,r2.l)(z)||r0 = [i0++]||w[i1]=r5.l;

above given is a multi-issue instruction, this combines a 32 bit instruction (extract) with two 16 bit instructions (load and store).

The Extract instruction is executed by the Barrel Shifter hardware and the load store instructions are executed by Data address generators. So we have both the modules of the core working in parallel.

Looking at this from a “Load Store” architecture point of view we can add one more observation. Such a parallel execution of Computational operation and Load/Store operation is possible because the former is not accessing any memory. And all the operands and the destination registers are within the core because of which there are no data bus accesses. Absence of this bus access is what makes it possible for the core to execute Load and store operation in parallel to the ALU/multiplier operations.

r3.h=(a1+=r6.h*r4.l);

The above instruction is not a multi-issue or a multi cycle instruction, we can probably say that if multi-issue instructions use the breadth of the processor then this instruction used the depth. I would leave it to the reader to guess how this one might be processed.

Multi-Cycle instructions are those which takes more than one cycle to execute, this is more like a CISC concept, one instruction gets decoded into multiple simple instructions.

r3 *= r0;
[ — sp ] = (r7:5, p5:0) ;

The first operation given above is a 32 bit multiplication which is not possible considering the fact that the available multipliers are 16 bits, hence this operation is achieved in the hardware by using the same 16 bit multipliers but by doing more than one multiplication operations.

The second operation is a stack push, this instruction specifies that all the r5 to r7 registers and all the p5 to p0 registers should be pushed to the stack. The hardware will sequentially push all the specified registers one by one and this leads to a multi-cycle operation.

All the multi-cycle operations are decoded many times over and over again!

Catch in Circular Buffering

May 21, 2009June 6, 2016Mahesh SreekandathLeave a comment

Some of the features which distinguishes a DSP processor are.

Multipliers
Video ALUs
Zero Overhead loop support
Circular Buffering support

There is a code which used the the circular buffering capabilities of Starcore to write into a common buffer. The code which copied data into the buffer was optimized by writing it in assembly. It was also written to make sure that at a time maximum number of bytes were copied using a MOVE operation. In other words if the source and destination was aligned at 8 byte boundaries then the assembly did a MOVE.64 instruction which copied 64 bits at a time.

The pseudo code of the assembly is given below.

memcopy(src,dest,size)

Start of Loop:
if (src%8 == 0) &&(dest%8 == 0)&&(size >=8)
move.64 src,dest (instruction to move 8 bytes)
else if (src%4 == 0) &&(dest%4 == 0)&&(size >=4)
move.32 src,dest (instruction to move 4 bytes)
else if (src%2 == 0) &&(dest%2 == 0)&&(size >=2)
move.16 src,dest (instruction to move 2 bytes)
else move.8 src,dest (byte copy)

The above method adds the overhead of checking the alignment each time but the clock cycles saved by moving more bytes at a time using the move instruction is much more because size of most of the data written into this circular buffer was a multiple of 8 or at least a multiple of 4.

The circular buffering was implemented using the Index, base, length and the modifier register. Lets name them I0,B0,L0 and Mo, the hardware behaves in such a way that as soon as the address in I0 reaches the value B0 + L0 it makes sure that I0 is reset to B0. This was circular buffering is maintained without any additional software overhead of checking bounds.

Lets consider a buffer which has the following attributes.

Size = 8 bytes
Base address = 0x02

Now the writes into the circular buffer of 8 bytes happened in the following order

Write 1 : 2 bytes
Write 2: 4 bytes
Write 3: 4 bytes

Lets see what happens in such a scenario. Below I have given the BlackFin Assembly Code which does the write.

_main:
/* Initialize the values of the registers for circular buffering */
i0.l = circular_buff; /*Buffer Address – 0x02*/
i0.h = circular_buff; /*Buffer Address – 0x02*/
l0 = 8; /*Length of the buffer */
b0 = i0; /*Base address */

r0.l = 0xFFFF; /*Dummy Value which will */
r0.h = 0xeeee; /*be written into the buffer */

w[i0++] = r0.l;        /*Write 0xFFFF at address I0-0x02               */
[i0++] = r0;              /*Write 0xFFFFEEEE at I0-0x04                     */
[i0++] = r0;              /*Write 0xFFFFEEEE at I0-0x08                   */
/*At this point the overflow has happened */
/*0xEEEE has been written to 0x0A, */ /*which is out of bounds for the array circular_buff */

w[i0] = r0.h; /*I0 has properly looped back to 0x04 */

As you can see the comments, the first write of two bytes took up the locations 0x02 and 0x03, second write of 4 bytes used up locations 0x04 to 0x07.

Now after the first two writes we need to write 4 bytes more, the location 0x08 is 4 byte aligned and 4 bytes have to be written also, so ideally 2 bytes should be written into the address 0x08 and 0x09 and the third and the fourth byte should be written to the starting of the buffer afer a loop back.

With the above code this won’t happen and we will end up having a 2 byte overflow but the Index register will loop back properly and leave the first 2 bytes of the buffer unwritten and point itself to 0x04.

This was seen in the starcore when the following conditions were satisfied.

Size of the buffer was a multiple of 8 byte
Start address of the buffer was aligned at 4 byte boundaries so that the end address of the buffer minus 4 bytes will give you a 8 byte aligned address.

When the above conditions were satified we ended up having a 4 byte overflow. This is understandable because a load or a store operation has the following operations and they all happen in sequence.

Calculate the Load/Store address
Send out the Load/Store address on the address bus
Wait for the required amount of wait cycles before Writing the data onto the Store bus or reading data from the Load bus.

The circular buffering logic is implemented by the Data Address generator module in the Core and the Bus protocol which loads or stores a value is not aware of this condition because of which it goes ahead and writes or reads the value from the memory and we will have a corruption.

Making sure that the Index register is pointing to a valid address is done by executing a simple formula like

Index-new = Index-old + Modifier – Length;

This happens during one of the later stages in the pipeline (most probably the Write back Stage or just before that) while the address is generated at a much earlier stage and data gets read during the same time. All this makes sense, hardware is perfectly right when he did a byte overflow.

So this is indeed a bug in the software which we solved by making sure that the base address of circular buffer is always aligned at 8 byte boundaries.

How does Onchip Breakpoints work?

May 7, 2009July 29, 2016Mahesh SreekandathLeave a comment

Internal specifics of Onchip breakpoints on Freescale StarCore processors is easily understood if we are familiar with its emulator. So it’s good idea to read the following two posts before going through the details below.

Introduction to Debug modules

On Chip Emulator

On Chip Breakpoints:

A descriptive diagram of the whole debug infrastructure is given below.

EDU: Event Detection unit
EC: Event Counter
TB unit: Trace Buffer unit
EEx & EED : General purpose control signals which can be configured for input or output. Usage of this is SOC specific (Derivative specific).

Before we configure the break points the processor core should be placed in DEBUG mode, this DEBUG mode can invoked using two methods:

Send a Command to the controller
Assert the EEx signal as soon as the core comes out of reset

Using the second method will need configuring of registers accordingly. Once the Core is in the DEBUG mode, breakpoints can be configured via JTAG. Here we are not going to discuss the details of register settings but a design level overview of how this OnChip Emulator works.

The breakpoints/watchpoints should be enabled only after programming the EDU (Event Detection Unit) registers with the configuration which defines that breakpoint. Namely the address in the memory, type of the breakpoint etc. These registers can be written via the JTAG.

So the two methods employed for enabling an already configured breakpoint/watchpoint are:

Configure the breakpoint/watchpoint enabling register by writing into it using the JTAG port.
Use the EEx or the EED control lines to signal the EOnCE controller to enable or disable the debug functionalities.

Once we program the debug configuration, the host needs to be informed when an event happened or when a breakpoint got hit. EOnCE will ensure that the core enters DEBUG mode by sending it a signal but for informing the HOST we need a mechanism like an interrupt.

We can configure one out of the EEx signals to send an interrupt to the host once the core enters Debug mode.

From the above discussion it’s clear that breakpoints are supported in hardware via setting of few registers and they can be configured via EOnCE which in turn responds to JTAG interface. If there are multiple break points configured then there are mechanisms to identify which one was hit.

Read the EOnCE controller status or monitor registers which will tell you which event has got hit
Configure EEx or EED signal to assert an interrupt to the HOST, depending on which signal has asserted the interrupt, HOST can know which break point has got hit.
Read the PC Breakpoint detection register from EOnCE, this gives the PC of the address which caused the event.

PC Breakpoint detection register should me used in combination with reading of EOnCE status register because debug mode can be caused by other reasons also.One important point to be noted here is that EOnCE gives many options for the usage of EEx and EED signals but the way in which it will be used is SOC specific and each platform may have different uses of these.

Similarly, if the Event counter has to count “off core” events like Cache hits or misses, the the external signals EC0 and EC1 needs to be used, this is also SOC dependent.

Hardware Design

Conclusion

It’s obvious that onchip breakpoints need support in hardware, but we could still use some more clarity on how this support is provided.

EOnCE is closely integrated with the processor core because it needs to probe address and data buses for certain reference values configured in breakpoint/watchpoint registers.
Whenever processor core accesses an address, EOnCE comes to know about this address value and it compares it with the reference value. If they are same then, depending on the Event selector configuration an event will be raised.
The event can be a DEBUG exception, DEBUG signal to the processor, triggering a trace, disabling a trace etc.

Similar logic can be employed for probing data values also. The above diagram we can see two comparators, A & B, these are two address buses for data access, if we do not know in which bus the data we want will be accessed then we need to configure both Comparator A & B with the reference value and then set the condition for an OR-ing of the comparisons, so that even if one comparison returns a success we will have an event raised.

MASK register: The way it works is pretty self explanatory, the address bus values sampled is masked with the MASK register value and then compared with the reference values.

Trace Buffer Unit can detect and record program flow related details and send it to an off-core module like Nexus, from Nexus it can be sent to the host using a Nexus port on the board or it can be saved in a circular buffer inside any On-Chip memory.

EOnCE: The onchip emulator

May 7, 2009July 29, 2016Mahesh Sreekandath1 Comment

EOnCE comprises of 6 core components.

• EOnCE Controller
• Event Counter
• Event Detection Unit (EDU)
• Synchronizer
• Event Selector
• Trace Unit

EOnCE Controller:

Without this contoller JTAG wont be able to access or program EOnCE. Using this interface we can put the core into debug mode & also write into EOnCE registers, basically this is the JTAG gateway into EOnCE.

Event Counter

Counters can count various events! These events can be configured by setting its registers. For example, it can be the number of times a watchpoint was hit or the number of instructions executed, or any external events like L1 cache hits, cache miss etc.

Event Detection Unit (EDU)

EDU forms the core logic of EOnCE, this module can be configured to set breakpoints, watchpoints, or to monitor the address on the data bus etc. EDU can also send a signal to Even Counter so that it can count the number of times an address was referenced or a data was accessed etc.

Synchronizer

This module is required to synchronize external signals with the internal clock. There are a set of general purpose signals which can be programmed to do various operations. For example, we can configure the General purpose signal EE0 to configure the EDU whenever it is asserted through JTAG.

Event Selector

This is the last unit in the debugging chain, in other words this decides what action needs to be taken when an event is generated by EDU or Event Counter. The action can be something like putting the Core to the debug mode, or it can be starting a program trace, or raising of a debug exception etc.

Trace Unit

As the name suggests, this is the hardware which monitors the program counter and generates the call trace.

Below Pic depicts the EOnCE controller and its component modules.

Debug Modules : Introduction

May 4, 2009July 29, 2016Mahesh Sreekandath1 Comment

Working with on-chip debug IP modules was a definite learning experience. It included writing low level software for setting breakpoints, watchpoints, configuring of call trace and also maintenance of a hardware dependent high performance data logging module. Such debug modules rarely get the appreciation which they deserve, considerable amount of chip space is dedicated for hardware IPs which enable us to develop such complex embedded software.

Each processor comes with its own unique ways of implementing debug modules, my understanding is based on the experience with StarCore. Lets look at the the features of the On chip Emulator which gives the infrastructure for doing all the debug activities. Below we can see the diagram of a typical JTAG based debugging setup.

EOnCE(Enhanced Onchip Emulator)

EOnCE is the debug module which add all the above mentioned capabilities to the StarCore based platforms. There are various points which we need to note about EONCE.

Eonce is accessible using JTAG as well as from core by addressing its memory mapped registers.
Simultaneous access of Eonce from JTAG as well as Core will result in JTAG access getting prioritized over the StarCore access.
JTAG writes into EOnCE registers using JTAG port and configures it for debugging related features.
When we set a break point or configure a trace using the debugger on the Host PC at that time a message is send to the EOnCE module on the Target, this message makes sure that our debug configurations are applied on the chip.

Downloading binary using JTAG!

This is one of the most obvious reasons to use a JTAG debugger, pretty much sure that we can never deliver anything on time if we have to start flashing the binary each time we compile and test it. Every module on the chip has a controller(TAP Controller)associated with it which makes it JTAG compliant , in other words this very TAP controller connects the JTAG port to the hardware module.

TAP interfaced with the System JTAG Controller

In the above pic we can see the DMA module having a TAP controller and it being interfaced with the System JTAG controller(SJC), which is the main JTAG controller, as illustrated in the above pic every peripheral and core logic inside a chip will have a TAP controller and this is accessed and configured using SJC. Now lets see what all steps are involved in downloading code into the target RAM.

First we need to place the processor (StarCore) is in DEBUG mode, for that we need to send the DEBUG Command to SJC.
Next step is for accessing the EOnCE, execute the command to select EONCE TAP controller through SJC. From this point on we will be able to directly control EOnCE using JTAG.
EOnCE module has the capability to make the star core execute commands like MOVE and and some of the arithmetic operations, this feature of EOnCE will be used to transfer the program data into the memory
Now send the program data to the EOnCE recieve register, now we need to move this program data to the StarCore accessible memory. For that we need to execute a MOVE instruction.
Every command which needs to be issued to the StarCore through JTAG has to be written into a specific Econce Command register.
Now write a MOVE instruction into the Core comand register to transfer the Program data from the receive register to any CORE register (Because we can write data into memory only from a CORE register not from any EOnCE register)
Now send a Core command to move the data from the Core Register to the memory location. This completes one transfer
The above steps are repeated till the whole program code is written into the memory.

Now that makes up the rather elaborate sequence for downloading contents into the RAM.

Useful Links

May 3, 2009September 6, 2011Mahesh SreekandathLeave a comment

Sequential Circuits

A lucid article on how Edge triggered and level triggered devices are designed.

JTAG on Wiki JTAG Link 1 JTAG Link 2 JTAG Link 3 JTAG Link 4 JTAG Link5

JTAG is explained very clearly at many places. I had to go through all the links to get a complete picture

Blackfin Core

April 27, 2009February 20, 2016Mahesh SreekandathLeave a comment

This picture is right out of the BlackFin HRM.

Without repeating what is already covered in the HRM let us try to analyse the capabilities of the Core from a different perspective.

Load Store Architecture means

All the operations are done on the registers
That would in turn mean that all the computational units in the core needs their inputs from the registers
This mandates that there has to be internal buses which connect the computational units to the registers.
A set of buses as the input and another set which transmits the output back to the registers.

The Diagram Below depicts exactly what we need.

This is quite fascinating, we can exactly know the reason why some instructions work while others do not.

Consider the Operation :

R0 = R1+ R2;

How does this work? There are two 32 bit buses running from the Data register file to the ALU0, values inside R1 & R2 will be transferred through them into ALU0 and once the operation is done the data come out back and gets written into R0. Perfect!

Now why does the following operation does not work?

P0 = R0 + R1;

The answer lies again in the bus architecture. Eventhough R0 and R1 can be transferred to ALU0 we do not have a Bus running from ALU0 to Pointer Register File, hence there is no way that this operation can succeed.

*Please note that by default ALU0 will be used and ALU 1 will come into picture only in case of parallel operations, which can be discussed in another post.

Idea here is to prod us to start thinking from this perspective. When we understand the bus architecture, we understand how to write the best possible assembly code while remaining within the constraints of the system. All this data and instruction buses connecting memory, registers & computational units form the backbone of an architecture. This inherently determines the strengths as well as the weakness of the processor.

Load Store Continued

April 26, 2009June 11, 2016Mahesh SreekandathLeave a comment

The LS (Load Store) story has some more twists which i did not mention in the previous post simply because it had already become way too long.

Stack Memory?

RISC machines always come with sufficient amount of registers which in turn results in storing of all the local variables inside registers itself and hardly any stack memory usage.Example code illustrating this phenomenon is given below.

Consider the below given C Code:

[code language=”cpp”]

int add_on(int a,int b)
{
int i =0;
for(i =0;i < 10; i++)
{
a = a+b;
}
return((a));
}

[/code]

The corresponding BlackFin Assembly code does not use a single stack variable. The register P0 and Loop counter register( LC0 ) takes care of the purpose of stack variable “i”.

_add_on:
P0 = 10;
/* Initialize the Zero Overhead Loop registers */
Lsetup(Loop_Starts,Loop_Ends) LC0 = P0;
/* Start of the loop */
Loop_Starts: Loop_Ends:
R0 = R0 + r1;
RTS;
_add_on.end:

Zero What! Overhead Loop?

Compilers also reduce the overhead of context switching between function calls, how? Ideally in between calls, all the registers need not be pushed on to stack, but only those registers which are used inside that particular function is saved and restored. Because the rest anyway remain untouched and hence the costly affair of stack memory usage and slow memory access is minimized. Result is a lean stack usage, have seen this feature on BlackFin compilers but yet to see on Starcore though!

Bogus Store

Within the load-store structure, most of the time store follows the load. But sometimes even though in the code we might see a store written before another memory read we really cannot guarantee the sequence in which it might execute . Let me quote the lucid BlackFin HRM.

“The relaxation of synchronization between memory access instructions
and their surrounding instructions is referred to as weak ordering of loads
and stores. Weak ordering implies that the timing of the actual completion
of the memory operations—even the order in which these events
occur—may not align with how they appear in the sequence of the program
source code.”

“Because of weak ordering, the memory system is allowed to prioritize
reads over writes. In this case, a write that is queued anywhere in the pipeline, but not completed, may be deferred by a subsequent read operation,and the read is allowed to be completed before the write. Reads are prioritized over writes because the read operation has a dependent operation
waiting on its completion, whereas the processor considers the write operation complete, and the write does not stall the pipeline if it takes more
cycles to propagate the value out to memory. This behavior could cause a
read that occurs in the program source code after a write in the program
flow to actually return its value before the write has been completed.
This ordering provides significant performance advantages in the operation
of most memory instructions. However, it can cause side effects that
the programmer must be aware of to avoid improper system operation.”

When we get optimizations, we also get trade-offs. But let us keep the price limited to what we pay for in the die space, and not write buggy code and learn the hard way.

The side effect which is mentioned above will come into picture when we are configuring any kind of registers or any such memory, on which a read after write might give different results compared to a read before write. Consider the following sequence

1. Write Register 1

2. Read Register 2

What if Register 2 value depend on the Register1 write? In such situations we need to ensure strict ordering of reads and writes by using the SYNC instructions (CSYNC & SSYNC).

Black Fin Jargons

April 26, 2009February 20, 2016Mahesh Sreekandath2 Comments

This post is not of much use on its own, but this is meant to clarify some of the technical jargon mentioned in the other posts and this will be updated as and when.

Multicycle Operations-

Quote from the BlackFin HRM:

“Multi-cycle instructions behave as multiple single-cycle instructions being
issued from the decoder over several clock cycles. For example, the Push
Multiple or Pop Multiple instruction can push or pop from 1 to 14 DREGS
and/or PREGS, and the instruction remains in the decode stage for a number
of clock cycles equal to the number of registers being accessed.”

In other words, one single instruction will be decoded and executed multiple times.

Zero-Overhead loop

Very much self explanatory, whenever we have a loop in a code we always have two types of overheads

> Adding an additional overhead of incrementing and checking a counter which decides the iterations in a loop
> Pipeline stalls because of branch mis-prediction at the end of an iteration.

Both the above bottlenecks are eliminated by making hardware aware of the existence of the loop by specifying the starting point and the ending point and the number of iterations which needs to be executed.

Below i have given a memcpy assembly code which uses zero overhead looping.

/* Set up the loop which runs from the label Loop_starts till the label Loop_ends for P0 times */

Lsetup(Loop_starts,Loop_ends)LC0 = p0;
Loop_starts:
r4 = b[p1++](z); /* Read from the source */
Loop_ends:
b[p2++] = r4; /* Write to the destination */

Zero Overhead looping is implemented by a hardware module called sequencer.

Why have a Load Store architecture?

April 26, 2009August 10, 2017Mahesh Sreekandath3 Comments

Typically a RISC architecture is built on the fundamentals of Load Store methodology. There are a plethora of internet sources for understanding Load-Store, this link should definitely be helpful.

Considerable study has gone into the RISC architecture, these instructions are relatively simple and does simple operations, in other words, a single CISC operation may be equivalent to a a bunch of RISC operations. Let me quote a para from the Hardware Reference manual of BlackFin.

“The Blackfin processor architecture supports the RISC concept of a
Load/Store machine. This machine is the characteristic in RISC architectures whereby memory operations (loads and stores) are intentionally separated from the arithmetic functions that use the targets of the memory operations. The separation is made because memory operations, particularly instructions that access off-chip memory or I/O devices, often take multiple cycles to complete and would normally halt the processor, preventing an instruction execution rate of one instruction per cycle.

Separating load operations from their associated arithmetic functions
allows compilers or assembly language programmers to place unrelated
instructions between the load and its dependent instructions. If the value
is returned before the dependent operation reaches the execution stage of
the pipeline, the operation completes in one cycle.”

This theoretical possibility can be illustrated by the below example. Here we saved 4 processor clock cycles by using the above mentioned logic of interleaving the load operations with an unrelated stack operation .

Initial Code BEFORE inserting unrelated operations between Loads:

(r7:2) = [ sp ++ ];/* STACK POP (Unrelated operation)*/
(p5:4) = [ sp ++ ];/* STACK POP (Unrelated operation)*/
/* MOVE the address of the variables /*
/* which needs to be operated on */
p0.l = _oneByte; /* Variable 1 */
p0.h = _oneByte; /* Variable 1 */

p1.l = _twoBytes; /* Variable 2 */
p1.h = _twoBytes; /* Variable 2 */

R0 = B[p0](Z); /* LOAD Operation 1 */
R1 = W[p1](Z); /* LOAD Operation 2 */

R0 = R0 + R1; /* Actual Add Operation */

B[P0] = R0; /* Store Operation */

The above code will behave “almost” similar to a code which runs on a CISC machine, now lets look at the code after interleaving.

Code After Interleaving STACK operation with LOAD operation:

/* MOVE the address of the variables /*
/* which needs to be operated on */
p0.l = _oneByte; /* Variable 1 */
p0.h = _oneByte; /* Variable 1 */

p1.l = _twoBytes; /* Variable 2 */
p1.h = _twoBytes; /* Variable 2 */

R0 = B[p0](Z); /* LOAD Operation 1 */
(r7:2) = [ sp ++ ]; /* STACK POP (Unrelated operation)*/

R1 = W[p1](Z); /* LOAD Operation 2 */
(p5:4) = [ sp ++ ]; /* STACK POP (Unrelated operation)*/

R0 = R0 + R1; /* Actual Add Operation */

B[P0] = R0; /* Store Operation */

Now we know ONE of the reasons for the out of order code generation done by compilers when optimization is enabled. Having said all this, we definitely need to investigate how did we end up saving 4 clock cycles.

For the above code snippet the following conditions are satisfied.

> Code and Data on the off chip SDRAM
> LOAD operation consumes 8 clock cycles within the execution stage of pipeline.
> The interleaved STACK POP operation is a “Multicycle” operation and it will spend multiple cycles at the decoder stage of the pipeline.
>BlackFin does not execute unless it decodes the instruction and it will definitely execute as soon as it decodes. Hence, the stalling of “load” operation at the execution stage and stalling of the stack operation at the decoding stage happens in parallel because of which overall stalls seems to reduce but in effect the number of stalls for each instruction remain same.

Initial Pipeline Execution BEFORE interleaving:

STAGES	Fetch	Decode	Execute	Write Back
CYCLE1	Stack POP	—–	——-	——
CYCLE2	p0.l =_oneByte;	[ sp ++ ]	——-	——-
CYCLE3	p0.l =_oneByte;	[ sp ++ ]	——-	——-
CYCLE4	p0.l =_oneByte;	[ sp ++ ]	——-	——-
CYCLE5	p0.h =_oneByte	p0.l=_oneByte;	[ sp ++ ]
CYCLE6	p1.l=_twoBytes;	p0.h=_oneByte	p0.l=_oneByte;	[ sp ++ ]

The blocks marked in RED are pipeline stalls, here the operation p0.l =_oneByte; stalled because the decoder is still decoding the previous stack pop operation unlike the normal scenario in which this should have completed in a single cycle

Pipeline Execution After Interleaving:

STAGES	Fetch	Decode	Execute	Write Back
CYCLE1	Stack POP	*R0 = B[p0](Z);*	*p0.h=_oneByte;*	*p0.l=_oneByte;*
CYCLE2	*R1 = W[p1](Z);*	[ sp ++ ]	*R0 = B[p0](Z);*	*p0.h=_oneByte;*
CYCLE3	*R1 = W[p1](Z);*	[ sp ++ ]	*R0 = B[p0](Z);*	BUBBLE
CYCLE4	*R1 = W[p1](Z);*	[ sp ++ ]	*R0 = B[p0](Z);*	BUBBLE
CYCLE5	p0.h =_oneByte	[ sp ++ ]	*R0 = B[p0](Z);*	BUBBLE
CYCLE6	p0.h=_oneByte;	[ sp ++ ]	*R0 = B[p0](Z);*	BUBBLE

*Please note that a 4 stage pipeline is described here just for the sake of simplicity, BlackFin actually has a 10 stages.

I would state that the above methodology made optimal usage of the hardware resources. Both decoder unit and execution unit are utilized at the same time. While in the previous case, the stalls within these units where happening separately and hence added up. Reordering simply made them happen at the same time.