Some of the features which distinguishes a DSP processor are.
- Video ALUs
- Zero Overhead loop support
- Circular Buffering support
There is a code which used the the circular buffering capabilities of Starcore to write into a common buffer. The code which copied data into the buffer was optimized by writing it in assembly. It was also written to make sure that at a time maximum number of bytes were copied using a MOVE operation. In other words if the source and destination was aligned at 8 byte boundaries then the assembly did a MOVE.64 instruction which copied 64 bits at a time.
The pseudo code of the assembly is given below.
- Start of Loop:
- if (src%8 == 0) &&(dest%8 == 0)&&(size >=8)
- move.64 src,dest (instruction to move 8 bytes)
- else if (src%4 == 0) &&(dest%4 == 0)&&(size >=4)
- move.32 src,dest (instruction to move 4 bytes)
- else if (src%2 == 0) &&(dest%2 == 0)&&(size >=2)
- move.16 src,dest (instruction to move 2 bytes)
- else move.8 src,dest (byte copy)
The above method adds the overhead of checking the alignment each time but the clock cycles saved by moving more bytes at a time using the move instruction is much more because size of most of the data written into this circular buffer was a multiple of 8 or at least a multiple of 4.
The circular buffering was implemented using the Index, base, length and the modifier register. Lets name them I0,B0,L0 and Mo, the hardware behaves in such a way that as soon as the address in I0 reaches the value B0 + L0 it makes sure that I0 is reset to B0. This was circular buffering is maintained without any additional software overhead of checking bounds.
Lets consider a buffer which has the following attributes.
- Size = 8 bytes
- Base address = 0x02
Now the writes into the circular buffer of 8 bytes happened in the following order
- Write 1 : 2 bytes
- Write 2: 4 bytes
- Write 3: 4 bytes
Lets see what happens in such a scenario. Below I have given the BlackFin Assembly Code which does the write.
/* Initialize the values of the registers for circular buffering */
i0.l = circular_buff; /*Buffer Address – 0x02*/
i0.h = circular_buff; /*Buffer Address – 0x02*/
l0 = 8; /*Length of the buffer */
b0 = i0; /*Base address */
r0.l = 0xFFFF; /*Dummy Value which will */
r0.h = 0xeeee; /*be written into the buffer */
w[i0++] = r0.l; /*Write 0xFFFF at address I0-0x02 */
[i0++] = r0; /*Write 0xFFFFEEEE at I0-0x04 */
[i0++] = r0; /*Write 0xFFFFEEEE at I0-0x08 */
/*At this point the overflow has happened */
/*0xEEEE has been written to 0x0A, */ /*which is out of bounds for the array circular_buff */
w[i0] = r0.h; /*I0 has properly looped back to 0x04 */
As you can see the comments, the first write of two bytes took up the locations 0x02 and 0x03, second write of 4 bytes used up locations 0x04 to 0x07.
Now after the first two writes we need to write 4 bytes more, the location 0x08 is 4 byte aligned and 4 bytes have to be written also, so ideally 2 bytes should be written into the address 0x08 and 0x09 and the third and the fourth byte should be written to the starting of the buffer afer a loop back.
With the above code this won’t happen and we will end up having a 2 byte overflow but the Index register will loop back properly and leave the first 2 bytes of the buffer unwritten and point itself to 0x04.
This was seen in the starcore when the following conditions were satisfied.
- Size of the buffer was a multiple of 8 byte
- Start address of the buffer was aligned at 4 byte boundaries so that the end address of the buffer minus 4 bytes will give you a 8 byte aligned address.
When the above conditions were satified we ended up having a 4 byte overflow. This is understandable because a load or a store operation has the following operations and they all happen in sequence.
- Calculate the Load/Store address
- Send out the Load/Store address on the address bus
- Wait for the required amount of wait cycles before Writing the data onto the Store bus or reading data from the Load bus.
The circular buffering logic is implemented by the Data Address generator module in the Core and the Bus protocol which loads or stores a value is not aware of this condition because of which it goes ahead and writes or reads the value from the memory and we will have a corruption.
Making sure that the Index register is pointing to a valid address is done by executing a simple formula like
Index-new = Index-old + Modifier – Length;
This happens during one of the later stages in the pipeline (most probably the Write back Stage or just before that) while the address is generated at a much earlier stage and data gets read during the same time. All this makes sense, hardware is perfectly right when he did a byte overflow.
So this is indeed a bug in the software which we solved by making sure that the base address of circular buffer is always aligned at 8 byte boundaries.