This post is not of much use on its own, but this is meant to clarify some of the technical jargon mentioned in the other posts and this will be updated as and when.
Quote from the BlackFin HRM:
“Multi-cycle instructions behave as multiple single-cycle instructions being
issued from the decoder over several clock cycles. For example, the Push
Multiple or Pop Multiple instruction can push or pop from 1 to 14 DREGS
and/or PREGS, and the instruction remains in the decode stage for a number
of clock cycles equal to the number of registers being accessed.”
In other words, one single instruction will be decoded and executed multiple times.
Very much self explanatory, whenever we have a loop in a code we always have two types of overheads
- > Adding an additional overhead of incrementing and checking a counter which decides the iterations in a loop
- > Pipeline stalls because of branch mis-prediction at the end of an iteration.
Both the above bottlenecks are eliminated by making hardware aware of the existence of the loop by specifying the starting point and the ending point and the number of iterations which needs to be executed.
Below i have given a memcpy assembly code which uses zero overhead looping.
/* Set up the loop which runs from the label Loop_starts till the label Loop_ends for P0 times */
Lsetup(Loop_starts,Loop_ends)LC0 = p0;
r4 = b[p1++](z); /* Read from the source */
b[p2++] = r4; /* Write to the destination */
Zero Overhead looping is implemented by a hardware module called sequencer.