
Execution of a machine instruction is divided into various steps, like fetching, decoding and execution. The role these steps play are different and hence we also need different hardware units, now dedicating different units also enables simultaneous executions. Essentially an instruction pipeline! BlackFin is not a super scalar processor but there can be a certain amount of parallel execution which can happen within a execution unit.

If we observe the above architecture, we can broadly classify BlackFin core into two units
- ALU + Multipliers + Shifter + Video ALUs
- Data Address Generators
Here in lies the key for identifying the instructions which can execute in parallel, we can broadly say that we should be able to execute computational unit operations and load/store operations in parallel.
r1 = extract(r0,r2.l)(z)||r0 = [i0++]||w[i1]=r5.l;
above given is a multi-issue instruction, this combines a 32 bit instruction (extract) with two 16 bit instructions (load and store).
The Extract instruction is executed by the Barrel Shifter hardware and the load store instructions are executed by Data address generators. So we have both the modules of the core working in parallel.
Looking at this from a “Load Store” architecture point of view we can add one more observation. Such a parallel execution of Computational operation and Load/Store operation is possible because the former is not accessing any memory. And all the operands and the destination registers are within the core because of which there are no data bus accesses. Absence of this bus access is what makes it possible for the core to execute Load and store operation in parallel to the ALU/multiplier operations.
r3.h=(a1+=r6.h*r4.l);
The above instruction is not a multi-issue or a multi cycle instruction, we can probably say that if multi-issue instructions use the breadth of the processor then this instruction used the depth. I would leave it to the reader to guess how this one might be processed.
Multi-Cycle instructions are those which takes more than one cycle to execute, this is more like a CISC concept, one instruction gets decoded into multiple simple instructions.
- r3 *= r0;
- [ — sp ] = (r7:5, p5:0) ;
The first operation given above is a 32 bit multiplication which is not possible considering the fact that the available multipliers are 16 bits, hence this operation is achieved in the hardware by using the same 16 bit multipliers but by doing more than one multiplication operations.
The second operation is a stack push, this instruction specifies that all the r5 to r7 registers and all the p5 to p0 registers should be pushed to the stack. The hardware will sequentially push all the specified registers one by one and this leads to a multi-cycle operation.
All the multi-cycle operations are decoded many times over and over again!