This post is not of much use on its own, but this is meant to clarify some of the technical jargon mentioned in the other posts and this will be updated as and when.
Multicycle Operations-
Quote from the BlackFin HRM:
“Multi-cycle instructions behave as multiple single-cycle instructions being
issued from the decoder over several clock cycles. For example, the Push
Multiple or Pop Multiple instruction can push or pop from 1 to 14 DREGS
and/or PREGS, and the instruction remains in the decode stage for a number
of clock cycles equal to the number of registers being accessed.”
In other words, one single instruction will be decoded and executed multiple times.
Zero-Overhead loop
Very much self explanatory, whenever we have a loop in a code we always have two types of overheads
> Adding an additional overhead of incrementing and checking a counter which decides the iterations in a loop
> Pipeline stalls because of branch mis-prediction at the end of an iteration.
Both the above bottlenecks are eliminated by making hardware aware of the existence of the loop by specifying the starting point and the ending point and the number of iterations which needs to be executed.
Below i have given a memcpy assembly code which uses zero overhead looping.
/* Set up the loop which runs from the label Loop_starts till the label Loop_ends for P0 times */
Lsetup(Loop_starts,Loop_ends)LC0 = p0;
Loop_starts:
r4 = b[p1++](z); /* Read from the source */ Loop_ends:
b[p2++] = r4; /* Write to the destination */
Zero Overhead looping is implemented by a hardware module called sequencer.
Typically a RISC architecture is built on the fundamentals of Load Store methodology. There are a plethora of internet sources for understanding Load-Store, this link should definitely be helpful.
Considerable study has gone into the RISC architecture, these instructions are relatively simple and does simple operations, in other words, a single CISC operation may be equivalent to a a bunch of RISC operations. Let me quote a para from the Hardware Reference manual of BlackFin.
“The Blackfin processor architecture supports the RISC concept of a
Load/Store machine. This machine is the characteristic in RISC architectures whereby memory operations (loads and stores) are intentionally separated from the arithmetic functions that use the targets of the memory operations. The separation is made because memory operations, particularly instructions that access off-chip memory or I/O devices, often take multiple cycles to complete and would normally halt the processor, preventing an instruction execution rate of one instruction per cycle.
Separating load operations from their associated arithmetic functions
allows compilers or assembly language programmers to place unrelated
instructions between the load and its dependent instructions. If the value
is returned before the dependent operation reaches the execution stage of
the pipeline, the operation completes in one cycle.”
This theoretical possibility can be illustrated by the below example. Here we saved 4 processor clock cycles by using the above mentioned logic of interleaving the load operations with an unrelated stack operation .
Initial Code BEFORE inserting unrelated operations between Loads:
(r7:2) = [ sp ++ ];/* STACK POP (Unrelated operation)*/ (p5:4) = [ sp ++ ];/* STACK POP (Unrelated operation)*/
/* MOVE the address of the variables /*
/* which needs to be operated on */ p0.l = _oneByte; /* Variable 1 */ p0.h = _oneByte; /* Variable 1 */
Now we know ONE of the reasons for the out of order code generation done by compilers when optimization is enabled. Having said all this, we definitely need to investigate how did we end up saving 4 clock cycles.
For the above code snippet the following conditions are satisfied.
> Code and Data on the off chip SDRAM
> LOAD operation consumes 8 clock cycles within the execution stage of pipeline.
> The interleaved STACK POP operation is a “Multicycle” operation and it will spend multiple cycles at the decoder stage of the pipeline.
>BlackFin does not execute unless it decodes the instruction and it will definitely execute as soon as it decodes. Hence, the stalling of “load” operation at the execution stage and stalling of the stack operation at the decoding stage happens in parallel because of which overall stalls seems to reduce but in effect the number of stalls for each instruction remain same.
Initial Pipeline Execution BEFORE interleaving:
STAGES
Fetch
Decode
Execute
Write Back
CYCLE1
Stack POP
—–
——-
——
CYCLE2
p0.l =_oneByte;
[ sp ++ ]
——-
——-
CYCLE3
p0.l =_oneByte;
[ sp ++ ]
——-
——-
CYCLE4
p0.l =_oneByte;
[ sp ++ ]
——-
——-
CYCLE5
p0.h =_oneByte
p0.l=_oneByte;
[ sp ++ ]
CYCLE6
p1.l=_twoBytes;
p0.h=_oneByte
p0.l=_oneByte;
[ sp ++ ]
The blocks marked in RED are pipeline stalls, here the operation p0.l =_oneByte; stalled because the decoder is still decoding the previous stack pop operation unlike the normal scenario in which this should have completed in a single cycle
Pipeline Execution After Interleaving:
STAGES
Fetch
Decode
Execute
Write Back
CYCLE1
Stack POP
R0 = B[p0](Z);
p0.h=_oneByte;
p0.l=_oneByte;
CYCLE2
R1 = W[p1](Z);
[ sp ++ ]
R0 = B[p0](Z);
p0.h=_oneByte;
CYCLE3
R1 = W[p1](Z);
[ sp ++ ]
R0 = B[p0](Z);
BUBBLE
CYCLE4
R1 = W[p1](Z);
[ sp ++ ]
R0 = B[p0](Z);
BUBBLE
CYCLE5
p0.h =_oneByte
[ sp ++ ]
R0 = B[p0](Z);
BUBBLE
CYCLE6
p0.h=_oneByte;
[ sp ++ ]
R0 = B[p0](Z);
BUBBLE
*Please note that a 4 stage pipeline is described here just for the sake of simplicity, BlackFin actually has a 10 stages.
I would state that the above methodology made optimal usage of the hardware resources. Both decoder unit and execution unit are utilized at the same time. While in the previous case, the stalls within these units where happening separately and hence added up. Reordering simply made them happen at the same time.
Let’s set the ball rolling on a rather simple note by mentioning the basic attributes of BlackFin Core and then slowly build up the complexity.
> Dual MAC signal processing engine.
Having a hardware multiplier is one of the important feature of any DSP. Two Multipliers along with two accumulator registers assists in performing a range of fixed point 16bit multiplier operations in a very efficient way.
> RISC instruction set
Load Store Architecture emphasizes the RISC nature of BlackFin and the same helps in better instruction ordering resulting in quick execution of critical DSP operation.
> SIMD Capabilities
Its common for DSP algorithms to do similar operations on a set of values and to optimize such operations we have Single Instruction Multiple Data capabilities.SIMD capabilities provided by the video ALUs are necessary for critical video processing algorithms.
The above three attributes form the fundamental DSP characteristic of Blackfin core.
The existence of a minimal MMU unit and peripherals like PPI and SPI makes the BlackFin an efficient general purpose microcontroller also.
Nope, not talking about the tuna fish, BlackFin is a Microprocessor manufactured by Analog Devices & its architecture was co-developed with Intel Corporation. We can possibly call the BlackFin Core as a Hybrid Between the SHARC & Xscale processor families. in other words it has the both micro controller features & DSP features.
More about the DSP v/s Microcontroller can be found at edaBoard