Why have a Load Store architecture?

Typically a RISC architecture is built on the fundamentals of Load Store methodology. There are a plethora of internet sources for understanding Load-Store, this link should definitely be helpful.

Considerable study has gone into the RISC architecture, these instructions are relatively simple and does simple operations, in other words, a single CISC operation may be equivalent to a a bunch of RISC operations. Let me quote a para from the Hardware Reference manual of BlackFin.

“The Blackfin processor architecture supports the RISC concept of a
Load/Store machine. This machine is the characteristic in RISC architectures whereby memory operations (loads and stores) are intentionally separated from the arithmetic functions that use the targets of the memory operations. The separation is made because memory operations, particularly instructions that access off-chip memory or I/O devices, often take multiple cycles to complete and would normally halt the processor, preventing an instruction execution rate of one instruction per cycle.


Separating load operations from their associated arithmetic functions
allows compilers or assembly language programmers to place unrelated
instructions between the load and its dependent instructions. If the value
is returned before the dependent operation reaches the execution stage of
the pipeline, the operation completes in one cycle.”

This theoretical possibility can be illustrated by the below example. Here we saved 4 processor clock cycles by using the above mentioned logic of interleaving the load operations with an unrelated stack operation .

Initial Code BEFORE inserting unrelated operations between Loads:

(r7:2) = [ sp ++ ];/* STACK POP (Unrelated operation)*/
(p5:4) = [ sp ++ ];/* STACK POP (Unrelated operation)*/
/* MOVE the address of the variables  /*
/* which needs to be operated on */
p0.l = _oneByte; /* Variable 1 */
p0.h = _oneByte; /* Variable 1 */

p1.l = _twoBytes; /* Variable 2 */
p1.h = _twoBytes; /* Variable 2 */

R0 = B[p0](Z); /* LOAD Operation 1 */
R1 = W[p1](Z); /* LOAD Operation 2 */

R0 = R0 + R1; /* Actual Add Operation */

B[P0] = R0; /* Store Operation */

The above code will behave “almost” similar to a code which runs on a CISC machine, now lets look at the code after interleaving.

Code After Interleaving STACK operation with LOAD operation:

/* MOVE the address of the variables  /*
/* which needs to be operated on */
p0.l = _oneByte; /* Variable 1 */
p0.h = _oneByte; /* Variable 1 */

p1.l = _twoBytes; /* Variable 2 */
p1.h = _twoBytes; /* Variable 2 */

R0 = B[p0](Z); /* LOAD Operation 1 */
(r7:2) = [ sp ++ ]; /* STACK POP (Unrelated operation)*/

R1 = W[p1](Z); /* LOAD Operation 2 */
(p5:4) = [ sp ++ ]; /* STACK POP (Unrelated operation)*/

R0 = R0 + R1; /* Actual Add Operation */

B[P0] = R0; /* Store Operation */

Now we know ONE of the reasons for the out of order code generation done by compilers when optimization is enabled. Having said all this, we definitely need to investigate how did we end up saving 4 clock cycles.

For the above code snippet the following conditions are satisfied.

  • > Code and Data on the off chip SDRAM
  • > LOAD operation consumes 8 clock cycles within the execution stage of pipeline.
  • > The interleaved STACK POP operation is a “Multicycle” operation and it will spend multiple cycles at the decoder stage of the pipeline.
  • >BlackFin does not execute unless it decodes the instruction and it will definitely execute as soon as it decodes. Hence, the stalling of “load” operation at the execution stage and stalling of the stack operation at the decoding stage happens in parallel because of which overall stalls seems to reduce but in effect the number of stalls for each instruction remain same.

Initial Pipeline Execution BEFORE interleaving:

STAGES

Fetch

Decode

Execute

Write Back

CYCLE1

Stack POP

—–

——-

——

CYCLE2

p0.l =_oneByte;

[ sp ++ ]

——-

——-

CYCLE3

p0.l =_oneByte;

[ sp ++ ]

——-

——-

CYCLE4

p0.l =_oneByte;

[ sp ++ ]

——-

——-

CYCLE5

p0.h =_oneByte

p0.l=_oneByte;

[ sp ++ ]

CYCLE6

p1.l=_twoBytes;

p0.h=_oneByte

p0.l=_oneByte;

[ sp ++ ]

The blocks marked in RED are pipeline stalls, here the operation p0.l =_oneByte; stalled because the decoder is still decoding the previous stack pop operation unlike the normal scenario in which this should have completed in a single cycle

Pipeline Execution After Interleaving:

STAGES

Fetch

Decode

Execute

Write Back

CYCLE1

Stack POP

R0 = B[p0](Z);

p0.h=_oneByte;

p0.l=_oneByte;

CYCLE2

R1 = W[p1](Z);

[ sp ++ ]

R0 = B[p0](Z);

p0.h=_oneByte;

CYCLE3

R1 = W[p1](Z);

[ sp ++ ]

R0 = B[p0](Z);

BUBBLE

CYCLE4

R1 = W[p1](Z);

[ sp ++ ]

R0 = B[p0](Z);

BUBBLE

CYCLE5

p0.h =_oneByte

[ sp ++ ]

R0 = B[p0](Z);

BUBBLE

CYCLE6

p0.h=_oneByte;

[ sp ++ ]

R0 = B[p0](Z);

BUBBLE

*Please note that a 4 stage pipeline is described here just for the sake of simplicity, BlackFin actually has a 10 stages.

I would state that the above methodology made optimal usage of the hardware resources. Both decoder unit and execution unit are utilized at the same time. While in the previous case, the stalls within these units where happening separately and hence added up. Reordering simply made them happen at the same time.

BlackFin
BlackFin

BlackFin: Introduction

Let’s set the ball rolling on a rather simple note by mentioning the basic attributes of BlackFin Core and then slowly build up the complexity.

> Dual MAC signal processing engine.

Having a hardware multiplier is one of the important feature of any DSP. Two Multipliers along with two accumulator registers assists in performing a range of fixed point 16bit multiplier operations in a very efficient way.

> RISC instruction set

Load Store Architecture emphasizes the RISC nature of BlackFin and the same helps in better instruction ordering resulting in quick execution of critical DSP operation.

> SIMD Capabilities

Its common for DSP algorithms to do similar operations on a set of values and to optimize such operations we have Single Instruction Multiple Data capabilities.SIMD capabilities provided by the video ALUs are necessary for critical video processing algorithms.

The above three attributes form the fundamental DSP characteristic of Blackfin core.

The existence of a minimal MMU unit and peripherals like PPI and SPI makes the BlackFin an efficient general purpose microcontroller also.

BlackFin?

Nope, not talking about the tuna fish, BlackFin is a Microprocessor manufactured by Analog Devices &  its architecture was co-developed with Intel Corporation. We can possibly call the BlackFin Core as a Hybrid Between the SHARC & Xscale processor families. in other words it has the both micro controller features & DSP features.

More about the DSP v/s Microcontroller can be found at edaBoard

Boot up to BlackFin here

I hope to dissect BlackFin Core and clear some neural fog and hopefully try and cloud the readers by using some fancy jargon, just kidding!

BlackFinning
BlackFinning