Debug Modules : Introduction

Working with on-chip debug IP modules was a definite learning experience. It included writing low level software for setting breakpoints, watchpoints,  configuring of call trace and also maintenance of a hardware dependent high performance data logging module. Such debug modules rarely get the appreciation which they deserve, considerable amount of chip space is dedicated for hardware IPs which enable us to develop such complex embedded software.

Each processor comes with its own unique ways of implementing debug modules, my understanding is based on the experience with StarCore. Lets look at the the features of the On chip Emulator which gives the infrastructure for doing all the debug activities. Below we can see the diagram of a typical JTAG  based debugging setup.

JTAG based Debugging setup
JTAG based Debugging setup

EOnCE(Enhanced Onchip Emulator)

EOnCE is the debug module which add all the above mentioned capabilities to the StarCore based platforms. There are various points which we need to note about EONCE.

  • Eonce is accessible using JTAG as well as from core by addressing its memory mapped registers.
  • Simultaneous access of Eonce from JTAG as well as Core will result in JTAG access getting prioritized over the StarCore access.
  • JTAG writes into EOnCE registers using JTAG port and configures it for debugging related features.
  • When we set a break point or configure a trace using the debugger on the Host PC at that time a message is send to the EOnCE module on the Target, this message makes sure that our debug configurations are applied on the chip.
EOnCE framework
EOnCE framework

 

Downloading binary using JTAG!

This is one of the most obvious reasons to use a JTAG debugger, pretty much sure that we can never deliver anything on time if we have to start flashing the binary each time we compile and test it. Every module on the chip has a controller(TAP Controller)associated with it which makes it JTAG compliant , in other words this very TAP controller connects the JTAG port to the hardware module.

TAP interfaced with the System JTAG Controller
TAP interfaced with the System JTAG Controller

In the above pic we can see the DMA module having a TAP controller and it being interfaced with the System JTAG controller(SJC), which is the main JTAG controller, as illustrated in the above pic every peripheral and core logic inside a chip will have a TAP controller and this is accessed and configured using SJC. Now lets see what all steps are involved in downloading code into the target RAM.

  • First we need to place the processor (StarCore) is in DEBUG mode, for that we need to send the DEBUG Command to SJC.
  • Next step is for accessing the EOnCE, execute the command to select EONCE TAP controller through SJC. From this point on we will be able to directly control EOnCE using JTAG.
  • EOnCE module has the capability to make the star core execute commands like MOVE and and some of the arithmetic operations, this feature of EOnCE will be used to transfer the program data into the memory
  • Now send the program data to the EOnCE recieve register, now we need to move this program data to the StarCore accessible memory. For that we need to execute a MOVE instruction.
  • Every command which needs to be issued to the StarCore through JTAG has to be written into a specific Econce Command register.
  • Now write a MOVE instruction into the Core comand register to transfer the Program data from the receive register to any CORE register (Because we can write data into memory only from a CORE register not from any EOnCE register)
  • Now send a Core command to move the data from the Core Register to the memory location. This completes one transfer
  • The above steps are repeated till the whole program code is written into the memory.

Now that makes up the rather elaborate sequence for downloading contents into the RAM.

Advertisements

Useful Links

Sequential Circuits

A lucid article on how Edge triggered and level triggered devices are designed.

JTAG on Wiki JTAG Link 1 JTAG Link 2 JTAG Link 3 JTAG Link 4 JTAG Link5

JTAG is explained very clearly at many places. I had to go through all the links to get a complete picture

Blackfin Core

This picture is right out of the BlackFin HRM.

core
core

Without repeating what is already covered in the HRM let us try to analyse the capabilities of the Core from a different perspective.

Load Store Architecture means

  • All the operations are done on the registers
  • That would in turn mean that all the computational units in the core needs their inputs from the registers
  • This mandates that there has to be internal buses which connect the computational units to the registers.
  • A set of buses as the input and another set which transmits the output back to the registers.

The Diagram Below depicts exactly what we need.

Core Modules connected by internal Buses
Core Modules connected by internal Buses

This is quite fascinating, we can exactly know the reason why some instructions work while others do not.

Consider the Operation :

R0 = R1+ R2;

How does this work? There are two 32 bit buses running from the Data register file to the ALU0, values inside R1 & R2 will be transferred through them into ALU0 and once the operation is done the data come out back and gets written into R0. Perfect!

Now why does the following operation does not work?

P0 = R0 + R1;

The answer  lies again in the bus architecture. Eventhough R0 and R1 can be transferred to ALU0 we do not have a Bus running from ALU0 to Pointer Register File, hence there is no way that this operation can succeed.

*Please note that by default ALU0 will be used and ALU 1 will come into picture only in case of parallel operations, which can be discussed in another post.

Idea here is to prod us to start thinking from this perspective. When we understand the bus architecture, we understand how to write the best possible assembly code while remaining within the constraints of the system. All this data and instruction buses connecting memory, registers & computational units form the backbone of an architecture. This inherently determines the strengths as well as the weakness of the processor.

Load Store Continued

The LS (Load Store) story has some more twists which i did not mention in the previous post simply because it had already become way too long.

Stack Memory?

RISC machines always come with sufficient amount of registers which in turn results in storing of all the local variables inside registers itself and hardly any stack memory usage.Example code illustrating this phenomenon is given below.

Consider the below given C Code:


int add_on(int a,int b)
{
int i =0;
for(i =0;i < 10; i++)
{
a = a+b;
}
return((a));
}

The corresponding BlackFin Assembly code does not use a single stack variable. The register P0 and Loop counter register( LC0 ) takes care of the purpose of stack variable “i”.

_add_on:
P0 = 10;
/* Initialize the Zero Overhead Loop registers */
Lsetup(Loop_Starts,Loop_Ends) LC0 = P0;
/* Start of the loop */
Loop_Starts: Loop_Ends:
R0 = R0 + r1;
RTS;
_add_on.end:

Zero What! Overhead Loop?

Compilers also reduce the overhead of context switching between function calls, how? Ideally in between calls, all the registers need not be pushed on to stack, but only those registers which are used inside that particular function is saved and restored. Because the rest  anyway remain untouched and hence the  costly affair of stack memory usage and slow memory access is minimized. Result is a lean stack usage, have seen this feature on BlackFin compilers but yet to see on Starcore though!

Bogus Store

Within the load-store structure, most of the time store follows the load. But sometimes even though in the code we might see a store written before another memory read we really cannot guarantee the sequence in which it might execute .  Let me quote the lucid BlackFin HRM.

“The relaxation of synchronization between memory access instructions
and their surrounding instructions is referred to as weak ordering of loads
and stores. Weak ordering implies that the timing of the actual completion
of the memory operations—even the order in which these events
occur—may not align with how they appear in the sequence of the program
source code.”

“Because of weak ordering, the memory system is allowed to prioritize
reads over writes. In this case, a write that is queued anywhere in the pipeline, but not completed, may be deferred by a subsequent read operation,and the read is allowed to be completed before the write. Reads are prioritized over writes because the read operation has a dependent operation
waiting on its completion, whereas the processor considers the write operation complete, and the write does not stall the pipeline if it takes more
cycles to propagate the value out to memory. This behavior could cause a
read that occurs in the program source code after a write in the program
flow to actually return its value before the write has been completed.
This ordering provides significant performance advantages in the operation
of most memory instructions. However, it can cause side effects that
the programmer must be aware of to avoid improper system operation.”

When we get optimizations, we also get trade-offs. But let us keep the price limited to what we pay for in the die space, and not write buggy code and learn the hard way.

The side effect which is mentioned above will come into picture when we are configuring any kind of registers or any such memory, on which a read after write might give different results compared to a read before write. Consider the following sequence

1. Write Register 1

2. Read Register 2

What if Register 2 value depend on the Register1 write? In such situations we need to ensure strict ordering of reads and writes by using the SYNC instructions (CSYNC & SSYNC).

What?
What?

Black Fin Jargons

This post is not of much use on its own, but this is meant to clarify some of the technical jargon mentioned in the other posts and this will be updated as and when.

Multicycle Operations-

Quote from the BlackFin HRM:

“Multi-cycle instructions behave as multiple single-cycle instructions being
issued from the decoder over several clock cycles. For example, the Push
Multiple or Pop Multiple instruction can push or pop from 1 to 14 DREGS
and/or PREGS, and the instruction remains in the decode stage for a number
of clock cycles equal to the number of registers being accessed.”

In other words, one single instruction will be decoded and executed multiple times.

Zero-Overhead loop

Very much self explanatory, whenever we have a loop in a code we always have two types of overheads

  • > Adding an additional overhead of incrementing and checking a counter which decides the iterations in a loop
  • > Pipeline stalls because of branch mis-prediction at the end of an iteration.

Both the above bottlenecks are eliminated by making hardware aware of the existence of the loop by specifying the starting point and the ending point and the number of iterations which needs to be executed.

Below i have given a memcpy assembly code which uses zero overhead looping.

/* Set  up the loop which runs from the label Loop_starts till the label Loop_ends for P0 times */

Lsetup(Loop_starts,Loop_ends)LC0 = p0;
Loop_starts:
r4 = b[p1++](z);
/* Read from the source */
Loop_ends:
b[p2++] = r4;
/* Write to the destination */

Zero Overhead looping is implemented by a hardware module called sequencer.

Black-Fin
Black-Fin

Why have a Load Store architecture?

Typically a RISC architecture is built on the fundamentals of Load Store methodology. There are a plethora of internet sources for understanding Load-Store, this link should definitely be helpful.

Considerable study has gone into the RISC architecture, these instructions are relatively simple and does simple operations, in other words, a single CISC operation may be equivalent to a a bunch of RISC operations. Let me quote a para from the Hardware Reference manual of BlackFin.

“The Blackfin processor architecture supports the RISC concept of a
Load/Store machine. This machine is the characteristic in RISC architectures whereby memory operations (loads and stores) are intentionally separated from the arithmetic functions that use the targets of the memory operations. The separation is made because memory operations, particularly instructions that access off-chip memory or I/O devices, often take multiple cycles to complete and would normally halt the processor, preventing an instruction execution rate of one instruction per cycle.


Separating load operations from their associated arithmetic functions
allows compilers or assembly language programmers to place unrelated
instructions between the load and its dependent instructions. If the value
is returned before the dependent operation reaches the execution stage of
the pipeline, the operation completes in one cycle.”

This theoretical possibility can be illustrated by the below example. Here we saved 4 processor clock cycles by using the above mentioned logic of interleaving the load operations with an unrelated stack operation .

Initial Code BEFORE inserting unrelated operations between Loads:

(r7:2) = [ sp ++ ];/* STACK POP (Unrelated operation)*/
(p5:4) = [ sp ++ ];/* STACK POP (Unrelated operation)*/
/* MOVE the address of the variables  /*
/* which needs to be operated on */
p0.l = _oneByte; /* Variable 1 */
p0.h = _oneByte; /* Variable 1 */

p1.l = _twoBytes; /* Variable 2 */
p1.h = _twoBytes; /* Variable 2 */

R0 = B[p0](Z); /* LOAD Operation 1 */
R1 = W[p1](Z); /* LOAD Operation 2 */

R0 = R0 + R1; /* Actual Add Operation */

B[P0] = R0; /* Store Operation */

The above code will behave “almost” similar to a code which runs on a CISC machine, now lets look at the code after interleaving.

Code After Interleaving STACK operation with LOAD operation:

/* MOVE the address of the variables  /*
/* which needs to be operated on */
p0.l = _oneByte; /* Variable 1 */
p0.h = _oneByte; /* Variable 1 */

p1.l = _twoBytes; /* Variable 2 */
p1.h = _twoBytes; /* Variable 2 */

R0 = B[p0](Z); /* LOAD Operation 1 */
(r7:2) = [ sp ++ ]; /* STACK POP (Unrelated operation)*/

R1 = W[p1](Z); /* LOAD Operation 2 */
(p5:4) = [ sp ++ ]; /* STACK POP (Unrelated operation)*/

R0 = R0 + R1; /* Actual Add Operation */

B[P0] = R0; /* Store Operation */

Now we know ONE of the reasons for the out of order code generation done by compilers when optimization is enabled. Having said all this, we definitely need to investigate how did we end up saving 4 clock cycles.

For the above code snippet the following conditions are satisfied.

  • > Code and Data on the off chip SDRAM
  • > LOAD operation consumes 8 clock cycles within the execution stage of pipeline.
  • > The interleaved STACK POP operation is a “Multicycle” operation and it will spend multiple cycles at the decoder stage of the pipeline.
  • >BlackFin does not execute unless it decodes the instruction and it will definitely execute as soon as it decodes. Hence, the stalling of “load” operation at the execution stage and stalling of the stack operation at the decoding stage happens in parallel because of which overall stalls seems to reduce but in effect the number of stalls for each instruction remain same.

Initial Pipeline Execution BEFORE interleaving:

STAGES

Fetch

Decode

Execute

Write Back

CYCLE1

Stack POP

—–

——-

——

CYCLE2

p0.l =_oneByte;

[ sp ++ ]

——-

——-

CYCLE3

p0.l =_oneByte;

[ sp ++ ]

——-

——-

CYCLE4

p0.l =_oneByte;

[ sp ++ ]

——-

——-

CYCLE5

p0.h =_oneByte

p0.l=_oneByte;

[ sp ++ ]

CYCLE6

p1.l=_twoBytes;

p0.h=_oneByte

p0.l=_oneByte;

[ sp ++ ]

The blocks marked in RED are pipeline stalls, here the operation p0.l =_oneByte; stalled because the decoder is still decoding the previous stack pop operation unlike the normal scenario in which this should have completed in a single cycle

Pipeline Execution After Interleaving:

STAGES

Fetch

Decode

Execute

Write Back

CYCLE1

Stack POP

R0 = B[p0](Z);

p0.h=_oneByte;

p0.l=_oneByte;

CYCLE2

R1 = W[p1](Z);

[ sp ++ ]

R0 = B[p0](Z);

p0.h=_oneByte;

CYCLE3

R1 = W[p1](Z);

[ sp ++ ]

R0 = B[p0](Z);

BUBBLE

CYCLE4

R1 = W[p1](Z);

[ sp ++ ]

R0 = B[p0](Z);

BUBBLE

CYCLE5

p0.h =_oneByte

[ sp ++ ]

R0 = B[p0](Z);

BUBBLE

CYCLE6

p0.h=_oneByte;

[ sp ++ ]

R0 = B[p0](Z);

BUBBLE

*Please note that a 4 stage pipeline is described here just for the sake of simplicity, BlackFin actually has a 10 stages.

I would state that the above methodology made optimal usage of the hardware resources. Both decoder unit and execution unit are utilized at the same time. While in the previous case, the stalls within these units where happening separately and hence added up. Reordering simply made them happen at the same time.

BlackFin
BlackFin

BlackFin: Introduction

Let’s set the ball rolling on a rather simple note by mentioning the basic attributes of BlackFin Core and then slowly build up the complexity.

> Dual MAC signal processing engine.

Having a hardware multiplier is one of the important feature of any DSP. Two Multipliers along with two accumulator registers assists in performing a range of fixed point 16bit multiplier operations in a very efficient way.

> RISC instruction set

Load Store Architecture emphasizes the RISC nature of BlackFin and the same helps in better instruction ordering resulting in quick execution of critical DSP operation.

> SIMD Capabilities

Its common for DSP algorithms to do similar operations on a set of values and to optimize such operations we have Single Instruction Multiple Data capabilities.SIMD capabilities provided by the video ALUs are necessary for critical video processing algorithms.

The above three attributes form the fundamental DSP characteristic of Blackfin core.

The existence of a minimal MMU unit and peripherals like PPI and SPI makes the BlackFin an efficient general purpose microcontroller also.