Why have a Load Store architecture?

Typically a RISC architecture is built on the fundamentals of Load Store methodology. There are a plethora of internet sources for understanding Load-Store, this link should definitely be helpful.

Considerable study has gone into the RISC architecture, these instructions are relatively simple and does simple operations, in other words, a single CISC operation may be equivalent to a a bunch of RISC operations. Let me quote a para from the Hardware Reference manual of BlackFin.

“The Blackfin processor architecture supports the RISC concept of a
Load/Store machine. This machine is the characteristic in RISC architectures whereby memory operations (loads and stores) are intentionally separated from the arithmetic functions that use the targets of the memory operations. The separation is made because memory operations, particularly instructions that access off-chip memory or I/O devices, often take multiple cycles to complete and would normally halt the processor, preventing an instruction execution rate of one instruction per cycle.

Separating load operations from their associated arithmetic functions
allows compilers or assembly language programmers to place unrelated
instructions between the load and its dependent instructions. If the value
is returned before the dependent operation reaches the execution stage of
the pipeline, the operation completes in one cycle.”

This theoretical possibility can be illustrated by the below example. Here we saved 4 processor clock cycles by using the above mentioned logic of interleaving the load operations with an unrelated stack operation .

Initial Code BEFORE inserting unrelated operations between Loads:

(r7:2) = [ sp ++ ];/* STACK POP (Unrelated operation)*/
(p5:4) = [ sp ++ ];/* STACK POP (Unrelated operation)*/
/* MOVE the address of the variables /*
/* which needs to be operated on */
p0.l = _oneByte; /* Variable 1 */
p0.h = _oneByte; /* Variable 1 */

p1.l = _twoBytes; /* Variable 2 */
p1.h = _twoBytes; /* Variable 2 */

R0 = B[p0](Z); /* LOAD Operation 1 */
R1 = W[p1](Z); /* LOAD Operation 2 */

R0 = R0 + R1; /* Actual Add Operation */

B[P0] = R0; /* Store Operation */

The above code will behave “almost” similar to a code which runs on a CISC machine, now lets look at the code after interleaving.

Code After Interleaving STACK operation with LOAD operation:

/* MOVE the address of the variables /*
/* which needs to be operated on */
p0.l = _oneByte; /* Variable 1 */
p0.h = _oneByte; /* Variable 1 */

p1.l = _twoBytes; /* Variable 2 */
p1.h = _twoBytes; /* Variable 2 */

R0 = B[p0](Z); /* LOAD Operation 1 */
(r7:2) = [ sp ++ ]; /* STACK POP (Unrelated operation)*/

R1 = W[p1](Z); /* LOAD Operation 2 */
(p5:4) = [ sp ++ ]; /* STACK POP (Unrelated operation)*/

R0 = R0 + R1; /* Actual Add Operation */

B[P0] = R0; /* Store Operation */

Now we know ONE of the reasons for the out of order code generation done by compilers when optimization is enabled. Having said all this, we definitely need to investigate how did we end up saving 4 clock cycles.

For the above code snippet the following conditions are satisfied.

> Code and Data on the off chip SDRAM
> LOAD operation consumes 8 clock cycles within the execution stage of pipeline.
> The interleaved STACK POP operation is a “Multicycle” operation and it will spend multiple cycles at the decoder stage of the pipeline.
>BlackFin does not execute unless it decodes the instruction and it will definitely execute as soon as it decodes. Hence, the stalling of “load” operation at the execution stage and stalling of the stack operation at the decoding stage happens in parallel because of which overall stalls seems to reduce but in effect the number of stalls for each instruction remain same.

Initial Pipeline Execution BEFORE interleaving:

STAGES	Fetch	Decode	Execute	Write Back
CYCLE1	Stack POP	—–	——-	——
CYCLE2	p0.l =_oneByte;	[ sp ++ ]	——-	——-
CYCLE3	p0.l =_oneByte;	[ sp ++ ]	——-	——-
CYCLE4	p0.l =_oneByte;	[ sp ++ ]	——-	——-
CYCLE5	p0.h =_oneByte	p0.l=_oneByte;	[ sp ++ ]
CYCLE6	p1.l=_twoBytes;	p0.h=_oneByte	p0.l=_oneByte;	[ sp ++ ]

The blocks marked in RED are pipeline stalls, here the operation p0.l =_oneByte; stalled because the decoder is still decoding the previous stack pop operation unlike the normal scenario in which this should have completed in a single cycle

Pipeline Execution After Interleaving:

STAGES	Fetch	Decode	Execute	Write Back
CYCLE1	Stack POP	*R0 = B[p0](Z);*	*p0.h=_oneByte;*	*p0.l=_oneByte;*
CYCLE2	*R1 = W[p1](Z);*	[ sp ++ ]	*R0 = B[p0](Z);*	*p0.h=_oneByte;*
CYCLE3	*R1 = W[p1](Z);*	[ sp ++ ]	*R0 = B[p0](Z);*	BUBBLE
CYCLE4	*R1 = W[p1](Z);*	[ sp ++ ]	*R0 = B[p0](Z);*	BUBBLE
CYCLE5	p0.h =_oneByte	[ sp ++ ]	*R0 = B[p0](Z);*	BUBBLE
CYCLE6	p0.h=_oneByte;	[ sp ++ ]	*R0 = B[p0](Z);*	BUBBLE

*Please note that a 4 stage pipeline is described here just for the sake of simplicity, BlackFin actually has a 10 stages.

I would state that the above methodology made optimal usage of the hardware resources. Both decoder unit and execution unit are utilized at the same time. While in the previous case, the stalls within these units where happening separately and hence added up. Reordering simply made them happen at the same time.

Embedded Sense

System Software

Why have a Load Store architecture?

3 thoughts on “Why have a Load Store architecture?”

Leave a comment Cancel reply

Share this:

3 thoughts on “Why have a Load Store architecture?”

Leave a comment Cancel reply