Why have a Load Store architecture?

Typically a RISC architecture is built on the fundamentals of Load Store methodology. There are a plethora of internet sources for understanding Load-Store, this link should definitely be helpful.

Considerable study has gone into the RISC architecture, these instructions are relatively simple and does simple operations, in other words, a single CISC operation may be equivalent to a a bunch of RISC operations. Let me quote a para from the Hardware Reference manual of BlackFin.

“The Blackfin processor architecture supports the RISC concept of a
Load/Store machine. This machine is the characteristic in RISC architectures whereby memory operations (loads and stores) are intentionally separated from the arithmetic functions that use the targets of the memory operations. The separation is made because memory operations, particularly instructions that access off-chip memory or I/O devices, often take multiple cycles to complete and would normally halt the processor, preventing an instruction execution rate of one instruction per cycle.


Separating load operations from their associated arithmetic functions
allows compilers or assembly language programmers to place unrelated
instructions between the load and its dependent instructions. If the value
is returned before the dependent operation reaches the execution stage of
the pipeline, the operation completes in one cycle.”

This theoretical possibility can be illustrated by the below example. Here we saved 4 processor clock cycles by using the above mentioned logic of interleaving the load operations with an unrelated stack operation .

Initial Code BEFORE inserting unrelated operations between Loads:

(r7:2) = [ sp ++ ];/* STACK POP (Unrelated operation)*/
(p5:4) = [ sp ++ ];/* STACK POP (Unrelated operation)*/
/* MOVE the address of the variables  /*
/* which needs to be operated on */
p0.l = _oneByte; /* Variable 1 */
p0.h = _oneByte; /* Variable 1 */

p1.l = _twoBytes; /* Variable 2 */
p1.h = _twoBytes; /* Variable 2 */

R0 = B[p0](Z); /* LOAD Operation 1 */
R1 = W[p1](Z); /* LOAD Operation 2 */

R0 = R0 + R1; /* Actual Add Operation */

B[P0] = R0; /* Store Operation */

The above code will behave “almost” similar to a code which runs on a CISC machine, now lets look at the code after interleaving.

Code After Interleaving STACK operation with LOAD operation:

/* MOVE the address of the variables  /*
/* which needs to be operated on */
p0.l = _oneByte; /* Variable 1 */
p0.h = _oneByte; /* Variable 1 */

p1.l = _twoBytes; /* Variable 2 */
p1.h = _twoBytes; /* Variable 2 */

R0 = B[p0](Z); /* LOAD Operation 1 */
(r7:2) = [ sp ++ ]; /* STACK POP (Unrelated operation)*/

R1 = W[p1](Z); /* LOAD Operation 2 */
(p5:4) = [ sp ++ ]; /* STACK POP (Unrelated operation)*/

R0 = R0 + R1; /* Actual Add Operation */

B[P0] = R0; /* Store Operation */

Now we know ONE of the reasons for the out of order code generation done by compilers when optimization is enabled. Having said all this, we definitely need to investigate how did we end up saving 4 clock cycles.

For the above code snippet the following conditions are satisfied.

  • > Code and Data on the off chip SDRAM
  • > LOAD operation consumes 8 clock cycles within the execution stage of pipeline.
  • > The interleaved STACK POP operation is a “Multicycle” operation and it will spend multiple cycles at the decoder stage of the pipeline.
  • >BlackFin does not execute unless it decodes the instruction and it will definitely execute as soon as it decodes. Hence, the stalling of “load” operation at the execution stage and stalling of the stack operation at the decoding stage happens in parallel because of which overall stalls seems to reduce but in effect the number of stalls for each instruction remain same.

Initial Pipeline Execution BEFORE interleaving:

STAGES

Fetch

Decode

Execute

Write Back

CYCLE1

Stack POP

—–

——-

——

CYCLE2

p0.l =_oneByte;

[ sp ++ ]

——-

——-

CYCLE3

p0.l =_oneByte;

[ sp ++ ]

——-

——-

CYCLE4

p0.l =_oneByte;

[ sp ++ ]

——-

——-

CYCLE5

p0.h =_oneByte

p0.l=_oneByte;

[ sp ++ ]

CYCLE6

p1.l=_twoBytes;

p0.h=_oneByte

p0.l=_oneByte;

[ sp ++ ]

The blocks marked in RED are pipeline stalls, here the operation p0.l =_oneByte; stalled because the decoder is still decoding the previous stack pop operation unlike the normal scenario in which this should have completed in a single cycle

Pipeline Execution After Interleaving:

STAGES

Fetch

Decode

Execute

Write Back

CYCLE1

Stack POP

R0 = B[p0](Z);

p0.h=_oneByte;

p0.l=_oneByte;

CYCLE2

R1 = W[p1](Z);

[ sp ++ ]

R0 = B[p0](Z);

p0.h=_oneByte;

CYCLE3

R1 = W[p1](Z);

[ sp ++ ]

R0 = B[p0](Z);

BUBBLE

CYCLE4

R1 = W[p1](Z);

[ sp ++ ]

R0 = B[p0](Z);

BUBBLE

CYCLE5

p0.h =_oneByte

[ sp ++ ]

R0 = B[p0](Z);

BUBBLE

CYCLE6

p0.h=_oneByte;

[ sp ++ ]

R0 = B[p0](Z);

BUBBLE

*Please note that a 4 stage pipeline is described here just for the sake of simplicity, BlackFin actually has a 10 stages.

I would state that the above methodology made optimal usage of the hardware resources. Both decoder unit and execution unit are utilized at the same time. While in the previous case, the stalls within these units where happening separately and hence added up. Reordering simply made them happen at the same time.

BlackFin
BlackFin

Advertisement

3 thoughts on “Why have a Load Store architecture?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s