Computer Architecture

Load Store Continued

April 26, 2009June 11, 2016Mahesh SreekandathLeave a comment

The LS (Load Store) story has some more twists which i did not mention in the previous post simply because it had already become way too long.

Stack Memory?

RISC machines always come with sufficient amount of registers which in turn results in storing of all the local variables inside registers itself and hardly any stack memory usage.Example code illustrating this phenomenon is given below.

Consider the below given C Code:

[code language=”cpp”]

int add_on(int a,int b)
{
int i =0;
for(i =0;i < 10; i++)
{
a = a+b;
}
return((a));
}

[/code]

The corresponding BlackFin Assembly code does not use a single stack variable. The register P0 and Loop counter register( LC0 ) takes care of the purpose of stack variable “i”.

_add_on:
P0 = 10;
/* Initialize the Zero Overhead Loop registers */
Lsetup(Loop_Starts,Loop_Ends) LC0 = P0;
/* Start of the loop */
Loop_Starts: Loop_Ends:
R0 = R0 + r1;
RTS;
_add_on.end:

Zero What! Overhead Loop?

Compilers also reduce the overhead of context switching between function calls, how? Ideally in between calls, all the registers need not be pushed on to stack, but only those registers which are used inside that particular function is saved and restored. Because the rest anyway remain untouched and hence the costly affair of stack memory usage and slow memory access is minimized. Result is a lean stack usage, have seen this feature on BlackFin compilers but yet to see on Starcore though!

Bogus Store

Within the load-store structure, most of the time store follows the load. But sometimes even though in the code we might see a store written before another memory read we really cannot guarantee the sequence in which it might execute . Let me quote the lucid BlackFin HRM.

“The relaxation of synchronization between memory access instructions
and their surrounding instructions is referred to as weak ordering of loads
and stores. Weak ordering implies that the timing of the actual completion
of the memory operations—even the order in which these events
occur—may not align with how they appear in the sequence of the program
source code.”

“Because of weak ordering, the memory system is allowed to prioritize
reads over writes. In this case, a write that is queued anywhere in the pipeline, but not completed, may be deferred by a subsequent read operation,and the read is allowed to be completed before the write. Reads are prioritized over writes because the read operation has a dependent operation
waiting on its completion, whereas the processor considers the write operation complete, and the write does not stall the pipeline if it takes more
cycles to propagate the value out to memory. This behavior could cause a
read that occurs in the program source code after a write in the program
flow to actually return its value before the write has been completed.
This ordering provides significant performance advantages in the operation
of most memory instructions. However, it can cause side effects that
the programmer must be aware of to avoid improper system operation.”

When we get optimizations, we also get trade-offs. But let us keep the price limited to what we pay for in the die space, and not write buggy code and learn the hard way.

The side effect which is mentioned above will come into picture when we are configuring any kind of registers or any such memory, on which a read after write might give different results compared to a read before write. Consider the following sequence

1. Write Register 1

2. Read Register 2

What if Register 2 value depend on the Register1 write? In such situations we need to ensure strict ordering of reads and writes by using the SYNC instructions (CSYNC & SSYNC).

Black Fin Jargons

April 26, 2009February 20, 2016Mahesh Sreekandath2 Comments

This post is not of much use on its own, but this is meant to clarify some of the technical jargon mentioned in the other posts and this will be updated as and when.

Multicycle Operations-

Quote from the BlackFin HRM:

“Multi-cycle instructions behave as multiple single-cycle instructions being
issued from the decoder over several clock cycles. For example, the Push
Multiple or Pop Multiple instruction can push or pop from 1 to 14 DREGS
and/or PREGS, and the instruction remains in the decode stage for a number
of clock cycles equal to the number of registers being accessed.”

In other words, one single instruction will be decoded and executed multiple times.

Zero-Overhead loop

Very much self explanatory, whenever we have a loop in a code we always have two types of overheads

> Adding an additional overhead of incrementing and checking a counter which decides the iterations in a loop
> Pipeline stalls because of branch mis-prediction at the end of an iteration.

Both the above bottlenecks are eliminated by making hardware aware of the existence of the loop by specifying the starting point and the ending point and the number of iterations which needs to be executed.

Below i have given a memcpy assembly code which uses zero overhead looping.

/* Set up the loop which runs from the label Loop_starts till the label Loop_ends for P0 times */

Lsetup(Loop_starts,Loop_ends)LC0 = p0;
Loop_starts:
r4 = b[p1++](z); /* Read from the source */
Loop_ends:
b[p2++] = r4; /* Write to the destination */

Zero Overhead looping is implemented by a hardware module called sequencer.

Why have a Load Store architecture?

April 26, 2009August 10, 2017Mahesh Sreekandath3 Comments

Typically a RISC architecture is built on the fundamentals of Load Store methodology. There are a plethora of internet sources for understanding Load-Store, this link should definitely be helpful.

Considerable study has gone into the RISC architecture, these instructions are relatively simple and does simple operations, in other words, a single CISC operation may be equivalent to a a bunch of RISC operations. Let me quote a para from the Hardware Reference manual of BlackFin.

“The Blackfin processor architecture supports the RISC concept of a
Load/Store machine. This machine is the characteristic in RISC architectures whereby memory operations (loads and stores) are intentionally separated from the arithmetic functions that use the targets of the memory operations. The separation is made because memory operations, particularly instructions that access off-chip memory or I/O devices, often take multiple cycles to complete and would normally halt the processor, preventing an instruction execution rate of one instruction per cycle.

Separating load operations from their associated arithmetic functions
allows compilers or assembly language programmers to place unrelated
instructions between the load and its dependent instructions. If the value
is returned before the dependent operation reaches the execution stage of
the pipeline, the operation completes in one cycle.”

This theoretical possibility can be illustrated by the below example. Here we saved 4 processor clock cycles by using the above mentioned logic of interleaving the load operations with an unrelated stack operation .

Initial Code BEFORE inserting unrelated operations between Loads:

(r7:2) = [ sp ++ ];/* STACK POP (Unrelated operation)*/
(p5:4) = [ sp ++ ];/* STACK POP (Unrelated operation)*/
/* MOVE the address of the variables /*
/* which needs to be operated on */
p0.l = _oneByte; /* Variable 1 */
p0.h = _oneByte; /* Variable 1 */

p1.l = _twoBytes; /* Variable 2 */
p1.h = _twoBytes; /* Variable 2 */

R0 = B[p0](Z); /* LOAD Operation 1 */
R1 = W[p1](Z); /* LOAD Operation 2 */

R0 = R0 + R1; /* Actual Add Operation */

B[P0] = R0; /* Store Operation */

The above code will behave “almost” similar to a code which runs on a CISC machine, now lets look at the code after interleaving.

Code After Interleaving STACK operation with LOAD operation:

/* MOVE the address of the variables /*
/* which needs to be operated on */
p0.l = _oneByte; /* Variable 1 */
p0.h = _oneByte; /* Variable 1 */

p1.l = _twoBytes; /* Variable 2 */
p1.h = _twoBytes; /* Variable 2 */

R0 = B[p0](Z); /* LOAD Operation 1 */
(r7:2) = [ sp ++ ]; /* STACK POP (Unrelated operation)*/

R1 = W[p1](Z); /* LOAD Operation 2 */
(p5:4) = [ sp ++ ]; /* STACK POP (Unrelated operation)*/

R0 = R0 + R1; /* Actual Add Operation */

B[P0] = R0; /* Store Operation */

Now we know ONE of the reasons for the out of order code generation done by compilers when optimization is enabled. Having said all this, we definitely need to investigate how did we end up saving 4 clock cycles.

For the above code snippet the following conditions are satisfied.

> Code and Data on the off chip SDRAM
> LOAD operation consumes 8 clock cycles within the execution stage of pipeline.
> The interleaved STACK POP operation is a “Multicycle” operation and it will spend multiple cycles at the decoder stage of the pipeline.
>BlackFin does not execute unless it decodes the instruction and it will definitely execute as soon as it decodes. Hence, the stalling of “load” operation at the execution stage and stalling of the stack operation at the decoding stage happens in parallel because of which overall stalls seems to reduce but in effect the number of stalls for each instruction remain same.

Initial Pipeline Execution BEFORE interleaving:

STAGES	Fetch	Decode	Execute	Write Back
CYCLE1	Stack POP	—–	——-	——
CYCLE2	p0.l =_oneByte;	[ sp ++ ]	——-	——-
CYCLE3	p0.l =_oneByte;	[ sp ++ ]	——-	——-
CYCLE4	p0.l =_oneByte;	[ sp ++ ]	——-	——-
CYCLE5	p0.h =_oneByte	p0.l=_oneByte;	[ sp ++ ]
CYCLE6	p1.l=_twoBytes;	p0.h=_oneByte	p0.l=_oneByte;	[ sp ++ ]

The blocks marked in RED are pipeline stalls, here the operation p0.l =_oneByte; stalled because the decoder is still decoding the previous stack pop operation unlike the normal scenario in which this should have completed in a single cycle

Pipeline Execution After Interleaving:

STAGES	Fetch	Decode	Execute	Write Back
CYCLE1	Stack POP	*R0 = B[p0](Z);*	*p0.h=_oneByte;*	*p0.l=_oneByte;*
CYCLE2	*R1 = W[p1](Z);*	[ sp ++ ]	*R0 = B[p0](Z);*	*p0.h=_oneByte;*
CYCLE3	*R1 = W[p1](Z);*	[ sp ++ ]	*R0 = B[p0](Z);*	BUBBLE
CYCLE4	*R1 = W[p1](Z);*	[ sp ++ ]	*R0 = B[p0](Z);*	BUBBLE
CYCLE5	p0.h =_oneByte	[ sp ++ ]	*R0 = B[p0](Z);*	BUBBLE
CYCLE6	p0.h=_oneByte;	[ sp ++ ]	*R0 = B[p0](Z);*	BUBBLE

*Please note that a 4 stage pipeline is described here just for the sake of simplicity, BlackFin actually has a 10 stages.

I would state that the above methodology made optimal usage of the hardware resources. Both decoder unit and execution unit are utilized at the same time. While in the previous case, the stalls within these units where happening separately and hence added up. Reordering simply made them happen at the same time.

BlackFin: Introduction

April 23, 2009April 12, 2016Mahesh SreekandathLeave a comment

Let’s set the ball rolling on a rather simple note by mentioning the basic attributes of BlackFin Core and then slowly build up the complexity.

> Dual MAC signal processing engine.

Having a hardware multiplier is one of the important feature of any DSP. Two Multipliers along with two accumulator registers assists in performing a range of fixed point 16bit multiplier operations in a very efficient way.

> RISC instruction set

Load Store Architecture emphasizes the RISC nature of BlackFin and the same helps in better instruction ordering resulting in quick execution of critical DSP operation.

> SIMD Capabilities

Its common for DSP algorithms to do similar operations on a set of values and to optimize such operations we have Single Instruction Multiple Data capabilities.SIMD capabilities provided by the video ALUs are necessary for critical video processing algorithms.

The above three attributes form the fundamental DSP characteristic of Blackfin core.

The existence of a minimal MMU unit and peripherals like PPI and SPI makes the BlackFin an efficient general purpose microcontroller also.

BlackFin?

February 26, 2009April 12, 2016Mahesh SreekandathLeave a comment

Nope, not talking about the tuna fish, BlackFin is a Microprocessor manufactured by Analog Devices & its architecture was co-developed with Intel Corporation. We can possibly call the BlackFin Core as a Hybrid Between the SHARC & Xscale processor families. in other words it has the both micro controller features & DSP features.

More about the DSP v/s Microcontroller can be found at edaBoard

Boot up to BlackFin here

I hope to dissect BlackFin Core and clear some neural fog and hopefully try and cloud the readers by using some fancy jargon, just kidding!

Embedded Sense

System Software

Load Store Continued

Black Fin Jargons

Why have a Load Store architecture?

BlackFin: Introduction

BlackFin?