DMA Architecture

DMA hardware is integral to most ARM SoC designs. They are extensively used for driving various peripheral controllers related to flash memory, LCD displays, USB, SD etc. Here we elaborate the functioning of a typical ARM SoC DMA controller and briefly delve into its internal mechanisms. The intent is to explain the underlying architecture of a typical DMA module and then eventually tie those elementary structural features to ARM’s PrimeCell® DMA Controller, which initially might come across as a more sophisticated hardware.

Direct memory access is simply meant to enable the processor to offload memory transfer operations. In this manner a system will be able to dedicate its valuable computational resources towards more apt alternate purposes. So, in fact we can state that a DMA controller is like a processor module which is capable of only executing load-store operations. While a full blown pipelined processor core possesses several other computational features and hence it should rarely be employed for rudimentary data transfers.

Eventually DMA is about transferring data across different hardware modules. Depending on the use case, the DMA source and destination might vary. And depending on the hardware the specifics of the bus protocol employed for load-store also varies. But just like a processor core, a DMA controller core will also be compatible with the hardware bus protocol features related to burst mode, flow control etc. Otherwise it simply won’t be able to move data around.

 

DMA-Processor-1
Fig 1 : DMA v/s Processor

 

So, a DMA controller should be simply interpreted as a form of rudimentary logical block meant to implement data transfers. Similar to a processor core, this IP block also plugs into the larger data bus matrix involving SRAM and other peripherals. While a processor executes its object code from instruction memory, a DMA controller executes its operations by loading transfer configurations from a RAM memory segment. Such a configuration is usually termed as a descriptor. This recognizable structure would essentially describe all the details regarding the source/destination, burst size, length of the transfer etc. Eventually, such a configuration table would have to be loaded into the RAM by the DMA driver software running on the main processor. And usually the memory location of this descriptor table would be communicated to the DMA controller via some shared register.

Descriptor-2
Fig 2 : Descriptor

DMA controller is eventually a co-processor assisting the main core software with various complex use cases involving heavy data transfers. For example, populating an LCD image bitmap or transferring contents to an SD card are all DMA intensive features. Such a controller would usually support multiple DMA channels enabling simultaneous transfers. In this particular case, a DMA controller with more than two channels should be able to simultaneously drive an LCD and an SD Card.

An interrupt signal would be usually employed to synchronize these DMA transfers with the main processor operations. So, once the requested data transfer is completed, DMA hardware would simply assert an interrupt, which would be acknowledged by the processor.

DMA-int-3
Fig 3 : DMA Interrupt

 

Usually, the contents of a transfer descriptor would resemble the DMA controller register structures. Essentially, these memory descriptor contents are blindly loaded into its actual hardware registers by the DMA controller itself. For example, a descriptor would include entries for the source and destination address which would basically correspond to the actual DMA controller source and destination address registers. Such a DMA controller block would also involve a separate core logic which reads these registers and then implements the actual memory transfers. Illustrated by the Fig 4 below.

GenericDMA-4
Fig 4 : Generic DMA architecture

The transfer descriptor structures written to the memory by the processor would be loaded into the controller registers by the DMA hardware and eventually parsed and executed by its core load/store transfer logic. Undoubtedly, any errors within these memory resident transfer descriptors would also be propagated to the core transfer logic via these registers. Essentially, such a register set is meant to communicate all the relevant transfer configuration details to the DMA load/store engine. It clarifies all the details like source/destination, burst size, transfer length etc, but the actual circuit implementing the load-store loop would be already present within this DMA core.

Every hardware module has a primary interfacing mechanism, simple peripherals use registers and complex processors employ binary object codes. A typical DMA controller interfacing mechanism as elaborated above involves descriptor table reflecting its register set and transfer capabilities. PrimeCell is a DMA IP block from ARM, but unlike a typical DMA controller, its primary interfacing mechanism is not register set or descriptor tables. But it’s an instruction set, very much resembling a simple processor.

PrimeCell-5
Fig 5 : PrimeCell

PrimeCell integrates a rudimentary processor core which understands a limited instruction set, mostly meant for implementing load store operations and other synchronization mechanisms related to data bus and processor interrupts. So, here, instead of a descriptor table, we have an object code describing the transfer. Essentially a binary code implementing memcpy, but written using assembly instruction code understood by the PrimeCell DMA processor. Obviously, such an instruction code would also specify the source/destination address, burst size etc. It’s an actual binary object code.

PrimeCell6
Fig 6 : PrimeCell Architecture

This DMA processor core is also multi-threaded with one transfer channel being associated with each thread. These threads possess the same scheduling priority,  and the system implements a round-robin scheme. The internal FIFO is utilized as a buffer during DMA transfers and it’s shared across all the channels. Each channel is allowed access to the whole FIFO memory, but of course, at any instant, FIFO utilization across all the running channels can never exceed its total size. Such a flexible shared memory design adds to software complexity but it can also lead to a more productive hardware resource allocation pattern.

Finally, PrimeCell IP block also includes an instruction cache to optimize object code access. Also provides scope for instruction abort handling and watchdog exceptions to avoid lockups. In other words, the module almost resembles a full blown multi tasking processor. But do note that, such transfer related or hardware related exception triggers are present in a typical DMA controller also. So, seems like, with PrimeCell IP, a software engineer is being exposed to the underlying machinery which already exist in other typical DMA hardwares but is eventually hidden behind the veil of register set abstractions. As always, nothing is without a trade-off, the more complicated aspect of managing this module is definitely the object code generation. Which makes a crude compiler out of an otherwise relatively simple DMA driver.

Advertisement

Man Versus Compiler

Man versus machine has always been an exhilarating contest, whether its movie making (2001:A Space Odyssey, Terminator, Matrix) or chess (Kasparov v/s Deep Blue) or may be coding (Compiler Generated Code v/s Programmer generated code).

A Space Odyssey
A Space Odyssey

Compilers of the Analog Devices and Freescale DSPs I have been working on exhibited such maturity that even the worst written C code performed like a formula one car. If you are writing system software then it seems futile attempting to increase the performance with direct assembly coding. There are of course some exceptions like.

  • Certain parts of Operating System has to be written in Assembly because they are platform dependent code(For example: Trap interrupt in linux is generated by a code written in assembly)
  • DSP algorithms which needs to be optimized by using SIMD capabilities.

For device driver coding and generic firmware development we hardly need to invest time in writing assembly, a well written modular C code will suffice.

It’s a popular misconception that critical code which needs to perform well has to be written in assembly, for example I do remember some of the tech leads insisting on Interrupt Service Routines to be written in Assembly, pretty sure if they had done some empirical analysis on their code they would have got some rude shocks.

A programmer should be able to finish the work assigned in the most optimal way, there is no point investing weeks in optimizing and writing code in assembly for saving a few cycles, you never know later you might see that the code performance deteriorated for different inputs. Usually the following steps should be considered as an optimal methodology.

  • Write the code in C
  • Do profiling of the code and identify bottlenecks
  • Identify those critical parts of the code which should be improved
  • Analyze the assembly code generated of those critical parts and see if any optimizations can be done with minimal time investment and maximum results (For example, if we focus our optimization to the code which execute in a loop then we get more returns)

Now in an ideal scenario with a reasonable time investment  there is no way a programmer can beat the compiler in terms of performance optimization, so is there a way we can beat the compiler? One possible advantage a programmer might have over the compiler is that he knows what will be the nature of the input, if his optimization strategy is focused on exploiting these nuances of the input then he is bound to get amazing results.

Consider an example code which counts the number of odd and even elements in a list.

The C Code for the same is given below:

void count_list()
{
int  i    = 0;
bool temp = 0;
int  even = 0;
int  odd  = 0;

for(i = 0; i < 40 ; i++)
{
temp = (array[i] & 0x00000001);

if(temp == 0)
{
even = even + 1;
}
else
{
odd = odd + 1;
}
}
final_num = even;
final_odd_num = odd;
}

If we analyse the code, almost all the cycles are spend in the loop where we check for the odd and the even elements.

Lets look at the assembly code generated for the loop.

/* Set up the zero over head loop */

LSETUP ( 4 /*0xFFA0006C*/ , 14 /*0xFFA00076*/ ) LC0 = P1 ;
R3 = [ P0 + 0x0 ] ;                                   /* Read the element from the list */
CC = BITTST ( R3 , 0x0 ) ;                     /* Read the first bit */
IF CC JUMP 30 /*0xFFA0008E*/ ;/*  Check the first bit and jump if set*/
NOP ;                                                              /* NOP to prevent unintented increment */
R1 = R1 + R2 ;                                             /* If branch not taken then increment */
P0 += 4 ;                                                       /* Increment the address to be read */

….

..

R0 = R0 + R2 ;                                            /* Increment counter for odd */
JUMP.S -26 /*0xFFA00076*/ ;        /* Jump back to the starting of the loop */

Now can we optimize the above loop? The main bottleneck in the above code is the jump, each conditional jump misprediction will cost us 8 core clocks.

How can we reduce the cost of this jump? Here we can analyse the input pattern of the array and lets say that the input array always consists of a majority of odd elements, in such a scenario we have a branch misprediction in the majority of the cases. Lets turn the tables here. Below I have given the optimized loop.

/* Set up the loop */

lsetup(Loop_starts,Loop_ends)LC0 = p0;
Loop_starts:
r1 = extract(r0,r2.l)(z)||r0 = [i0++]; /* Check the bit 0 and read the next element at the same time */
cc = r1;                                                             /*Assign LSB to CC flag */
if cc jump odd_num(bp); /* If CC is set then jump (note branch prediction) */
r5 += 1;                                                            /* If branch not taken then increment even */
odd_num:
NOP;
Loop_ends:
r6 += 1;                                                           /* If branch taken the increment odd*/

The above code has two critical changes inside the loop.

  • Parallel instruction execution
  • Branch Prediction for conditional jump

Parallel instruction execution is an obvious advantage. Branch prediction reduced the penalty in the cases of odd numbers to 4 clock cycles but increased the penalty in case of even numbers to 8 clock cycles. As we know that the odd numbers are in majority so on the whole we end up reducing the cycle consumption by almost 30%. So one of the most effective ways to beat the compiler is to exploit the nuances present in the input pattern, compiler is oblivious to such details and cannot generate a code ideal for all kinds of inputs but we on the other hand can tailor the assembly so that it caters to certain specific types of inputs.

Multi-Issue instructions and Multi-cycle instructions

BlackFin Pipeline
BlackFin Pipeline

Execution of a machine instruction is divided into various steps, like fetching, decoding and execution. The role these steps play are different and hence we also need different hardware units, now dedicating different units also enables simultaneous executions. Essentially an instruction pipeline! BlackFin is not a super scalar processor but there can be a certain amount of parallel execution which can happen within a execution unit.

Core Modules connected by internal Buses
Core Modules connected by internal Buses

If we observe the above architecture, we can broadly classify BlackFin core into two units

  • ALU + Multipliers + Shifter + Video ALUs
  • Data Address Generators

Here in lies the key for identifying the instructions which can execute in parallel, we can broadly say that we should be able to execute computational unit operations and load/store operations in parallel.

r1 = extract(r0,r2.l)(z)||r0 = [i0++]||w[i1]=r5.l;

above given is a multi-issue instruction, this combines a 32 bit instruction (extract) with two 16 bit instructions (load and store).

The Extract instruction is executed by the Barrel Shifter hardware and the load store instructions are executed by Data address generators. So we have both the modules of the core working in parallel.

Looking at this from a “Load Store” architecture point of view we can add one more observation. Such a parallel execution of Computational operation and Load/Store operation is possible because the former is not accessing any memory. And all the operands and the destination registers are within the core because of which there are no data bus accesses. Absence of this bus access is what makes it possible for the core to execute Load and store operation in parallel to the ALU/multiplier operations.

r3.h=(a1+=r6.h*r4.l);

The above instruction is not a multi-issue or a multi cycle instruction, we can probably say that if multi-issue instructions use the breadth of the processor then this instruction used the depth. I would leave it to the reader to guess how this one might be processed.

Multi-Cycle instructions are those which takes more than one cycle to execute, this is more like a CISC concept, one instruction gets decoded into multiple simple instructions.

  • r3 *= r0;
  • [ — sp ] = (r7:5, p5:0) ;

The first operation given above is a 32 bit multiplication which is not possible considering the fact that the available multipliers are 16 bits, hence this operation is achieved in the hardware by using the same 16 bit multipliers but by doing more than one multiplication operations.

The second operation is a stack push, this instruction specifies that all the r5 to r7 registers and all the p5 to p0 registers should be pushed to the stack. The hardware will sequentially push all the specified registers one by one and this leads to a multi-cycle operation.

All the multi-cycle operations are decoded many times over and over again!

Catch in Circular Buffering

Some of the features which distinguishes a DSP processor are.

  • Multipliers
  • Video ALUs
  • Zero Overhead loop support
  • Circular Buffering support

There is a code which used the the circular buffering capabilities of Starcore to write into a common buffer. The code which copied data into the buffer was optimized by writing it in assembly. It was also written to make sure that at a time maximum number of bytes were copied using a MOVE operation. In other words if the source and destination was aligned at 8 byte boundaries then the assembly did a MOVE.64 instruction which copied 64 bits at a time.

The pseudo code of the assembly is given below.

memcopy(src,dest,size)

  1. Start of Loop:
  2. if (src%8 == 0) &&(dest%8 == 0)&&(size >=8)
  3. move.64 src,dest (instruction to move 8 bytes)
  4. else if (src%4 == 0) &&(dest%4 == 0)&&(size >=4)
  5. move.32 src,dest (instruction to move 4 bytes)
  6. else if (src%2 == 0) &&(dest%2 == 0)&&(size >=2)
  7. move.16 src,dest (instruction to move 2 bytes)
  8. else move.8 src,dest  (byte copy)

The above method adds the overhead of checking the alignment each time but the clock cycles saved by moving more bytes at a time using the move instruction is much more because size of most of the data written into this circular buffer was a multiple of 8 or at least a multiple of 4.

The circular buffering was implemented using the Index, base, length and the modifier register. Lets name them I0,B0,L0 and Mo, the hardware behaves in such a way that as soon as the address in I0 reaches the value B0 + L0 it makes sure that I0 is reset to B0. This was circular buffering is maintained without any additional software overhead of checking bounds.

Lets consider a buffer which has the following attributes.

  • Size = 8 bytes
  • Base address = 0x02

Now the writes into the circular buffer of 8 bytes happened in the following order

  • Write 1 : 2 bytes
  • Write 2: 4 bytes
  • Write 3: 4 bytes

Lets see what happens in such a scenario. Below I have given the BlackFin Assembly Code which does the write.


_main:
/* Initialize the values of the registers  for circular buffering */
i0.l = circular_buff;  /*Buffer Address – 0x02*/
i0.h = circular_buff; /*Buffer Address – 0x02*/
l0 = 8;                             /*Length of the buffer */
b0 = i0;                          /*Base address */

r0.l = 0xFFFF;        /*Dummy Value which will */
r0.h = 0xeeee;        /*be written into the buffer */

w[i0++] = r0.l;        /*Write 0xFFFF at address I0-0x02               */
[i0++] = r0;              /*Write 0xFFFFEEEE at I0-0x04                     */
[i0++] = r0;              /*Write 0xFFFFEEEE at I0-0x08                     */
/*At this point the overflow has happened  */
/*0xEEEE has been written to 0x0A,
*/ /*which is out of bounds for the array circular_buff */


w[i0] = r0.h;            /*I0 has properly looped back to 0x04 */

As you can see the comments, the first write of two bytes took up the locations 0x02 and 0x03, second write of 4 bytes used up locations 0x04 to 0x07.

Now after the first two writes we need to write 4 bytes more, the location 0x08 is 4 byte aligned and 4 bytes have to be written also, so ideally 2 bytes should be written into the address 0x08 and 0x09 and the third and the fourth byte should be written to the starting of the buffer afer a loop back.

With the above code this won’t happen and we will end up having a 2 byte overflow but the Index register will loop back properly and leave the first 2 bytes of the buffer unwritten and point itself to 0x04.

This was seen in the starcore when the following conditions were satisfied.

  • Size of the buffer was a multiple of 8 byte
  • Start address of the buffer was aligned at 4 byte boundaries so that the end address of the buffer minus 4 bytes will give you a 8 byte aligned address.

When the above conditions were satified we ended up having a 4 byte overflow. This is understandable because a load or a store operation has the following operations and they all happen in sequence.

  1. Calculate the Load/Store address
  2. Send out the Load/Store address on the address bus
  3. Wait for the required amount of wait cycles before Writing the data onto the Store bus or reading data from the Load bus.

The circular buffering logic is implemented by the Data Address generator module in the Core and the Bus protocol which loads or stores a value is not aware of this condition because of which it goes ahead and writes or reads the value from the memory and we will have a corruption.

Making sure that the Index register is pointing to a valid address is done by executing a simple formula like

Index-new = Index-old + Modifier – Length;

This happens during one of the later stages in the pipeline (most probably the Write back Stage or just before that) while the address is generated at a much earlier stage and data gets read during the same time. All this makes sense, hardware is perfectly right when he did a byte overflow.

So this is indeed a bug in the software which we solved by making sure that the base address of circular buffer is always aligned at 8 byte boundaries.

How does Onchip Breakpoints work?

Internal specifics of Onchip breakpoints on Freescale StarCore processors is easily understood if we are familiar with its emulator. So it’s good idea to read the following two posts before going through the details below.

Introduction to Debug modules

On Chip Emulator

 

On Chip Breakpoints:

A descriptive diagram of the whole debug infrastructure is given below.

EOnCE Controller
EOnCE Controller
  • EDU: Event Detection unit
  • EC: Event Counter
  • TB unit: Trace Buffer unit
  • EEx & EED : General purpose control signals which can be configured for input or output. Usage of this is SOC specific (Derivative specific).

Before we configure the break points the processor core should be placed in DEBUG mode, this DEBUG mode can invoked using two methods:

  • Send a Command to the controller
  • Assert the EEx signal as soon as the core comes out of reset

Using the second method will need configuring of registers accordingly. Once the Core is in the DEBUG mode, breakpoints can be configured via JTAG. Here we are not going to discuss the details of register settings but a design level overview of how this OnChip Emulator works.

The breakpoints/watchpoints should be enabled only after programming the EDU (Event Detection Unit) registers with the configuration which defines that breakpoint. Namely the address in the memory, type of the breakpoint etc. These registers can be written via the JTAG.

So the two methods employed for enabling an already configured breakpoint/watchpoint are:

  • Configure the breakpoint/watchpoint enabling register by writing into it using the JTAG port.
  • Use the EEx or the EED control lines to signal the EOnCE controller to enable or disable the debug functionalities.

Once we program the debug configuration, the host needs to be informed when an event happened or when a breakpoint got hit. EOnCE will ensure that the core enters DEBUG mode by sending it a signal but for informing the HOST we need a mechanism like an interrupt.

  • We can configure one out of the EEx signals to send an interrupt to the host once the core enters Debug mode.

From the above discussion it’s clear that breakpoints are supported in hardware via setting of few registers and they can be configured via EOnCE which in turn responds to JTAG interface. If there are multiple break points configured then there are mechanisms to identify which one was hit.

  • Read the EOnCE controller status or monitor registers which will tell you which event has got hit
  • Configure EEx or EED signal to assert an interrupt to the HOST, depending on which signal has asserted the interrupt, HOST can know which break point has got hit.
  • Read the PC Breakpoint detection register from EOnCE, this gives the PC of the address which caused the event.

PC Breakpoint detection register should me used in combination with reading of EOnCE status register because debug mode can be caused by other reasons also.One important point to be noted here is that EOnCE gives many options for the usage of EEx and EED signals but the way in which it will be used is SOC specific and each platform may have different uses of these.

Similarly, if the Event counter has to count “off core” events like Cache hits or misses, the the external signals EC0 and EC1 needs to be used, this is also SOC dependent.

Hardware Design

 

Logic for configuring breakpoints
Logic for configuring breakpoints

Conclusion

It’s obvious that onchip breakpoints need support in hardware, but we could still use some more clarity on how this support is provided.

  • EOnCE is closely integrated with the processor core because it needs to probe address and data buses for certain reference values configured in breakpoint/watchpoint registers.
  • Whenever processor core accesses an address, EOnCE comes to know about this address value and it compares it with the reference value. If they are same then, depending on the Event selector configuration an event will be raised.
  • The event can be a DEBUG exception, DEBUG signal to the processor, triggering a trace, disabling a trace etc.

Similar logic can be employed for probing data values also. The above diagram we can see two comparators, A & B, these are two address buses for data access, if we do not know in which bus the data we want will be accessed then we need to configure both Comparator A & B with the reference value and then set the condition for an OR-ing of the comparisons, so that even if one comparison returns a success we will have an event raised.

MASK register: The way it works is pretty self explanatory, the address bus values sampled is masked with the MASK register value and then compared with the reference values.

Trace Buffer Unit can detect and record program flow related details and send it to an off-core module like Nexus, from Nexus it can be sent to the host using a Nexus port on the board or it can be saved in a circular buffer inside any On-Chip memory.

EOnCE: The onchip emulator

EOnCE comprises of 6 core components.

• EOnCE Controller
• Event Counter
• Event Detection Unit (EDU)
• Synchronizer
• Event Selector
• Trace Unit

EOnCE Controller:

Without this contoller JTAG wont be able to access or program EOnCE. Using this interface we can put the core into debug mode & also write into EOnCE registers, basically this is the JTAG gateway into EOnCE.

Event Counter

Counters can count various events! These events can be configured by setting its registers. For example, it can be the number of times a watchpoint was hit or the number of instructions executed, or any external events like L1 cache hits, cache miss etc.

Event Detection Unit (EDU)

EDU forms the core logic of EOnCE, this module can be configured to set breakpoints, watchpoints, or to monitor the address on the data bus etc. EDU can also send a signal to Even Counter so that it can count the number of times an address was referenced or a data was accessed etc.

Synchronizer

This module is required to synchronize external signals with the internal clock. There are a set of general purpose signals which can be programmed to do various operations. For example, we can configure the General purpose signal EE0 to configure the EDU whenever it is asserted through JTAG.

Event Selector

This is the last unit in the debugging chain, in other words this decides what action needs to be taken when an event is generated by EDU or Event Counter. The action can be something like putting the Core to the debug mode, or it can be starting a program trace, or raising of a debug exception etc.

Trace Unit

As the name suggests, this is the hardware which monitors the program counter and generates the call trace.

Below Pic depicts the EOnCE controller and its component modules.

EOnCE framework
EOnCE framework

Debug Modules : Introduction

Working with on-chip debug IP modules was a definite learning experience. It included writing low level software for setting breakpoints, watchpoints,  configuring of call trace and also maintenance of a hardware dependent high performance data logging module. Such debug modules rarely get the appreciation which they deserve, considerable amount of chip space is dedicated for hardware IPs which enable us to develop such complex embedded software.

Each processor comes with its own unique ways of implementing debug modules, my understanding is based on the experience with StarCore. Lets look at the the features of the On chip Emulator which gives the infrastructure for doing all the debug activities. Below we can see the diagram of a typical JTAG  based debugging setup.

JTAG based Debugging setup
JTAG based Debugging setup

EOnCE(Enhanced Onchip Emulator)

EOnCE is the debug module which add all the above mentioned capabilities to the StarCore based platforms. There are various points which we need to note about EONCE.

  • Eonce is accessible using JTAG as well as from core by addressing its memory mapped registers.
  • Simultaneous access of Eonce from JTAG as well as Core will result in JTAG access getting prioritized over the StarCore access.
  • JTAG writes into EOnCE registers using JTAG port and configures it for debugging related features.
  • When we set a break point or configure a trace using the debugger on the Host PC at that time a message is send to the EOnCE module on the Target, this message makes sure that our debug configurations are applied on the chip.
EOnCE framework
EOnCE framework

 

Downloading binary using JTAG!

This is one of the most obvious reasons to use a JTAG debugger, pretty much sure that we can never deliver anything on time if we have to start flashing the binary each time we compile and test it. Every module on the chip has a controller(TAP Controller)associated with it which makes it JTAG compliant , in other words this very TAP controller connects the JTAG port to the hardware module.

TAP interfaced with the System JTAG Controller
TAP interfaced with the System JTAG Controller

In the above pic we can see the DMA module having a TAP controller and it being interfaced with the System JTAG controller(SJC), which is the main JTAG controller, as illustrated in the above pic every peripheral and core logic inside a chip will have a TAP controller and this is accessed and configured using SJC. Now lets see what all steps are involved in downloading code into the target RAM.

  • First we need to place the processor (StarCore) is in DEBUG mode, for that we need to send the DEBUG Command to SJC.
  • Next step is for accessing the EOnCE, execute the command to select EONCE TAP controller through SJC. From this point on we will be able to directly control EOnCE using JTAG.
  • EOnCE module has the capability to make the star core execute commands like MOVE and and some of the arithmetic operations, this feature of EOnCE will be used to transfer the program data into the memory
  • Now send the program data to the EOnCE recieve register, now we need to move this program data to the StarCore accessible memory. For that we need to execute a MOVE instruction.
  • Every command which needs to be issued to the StarCore through JTAG has to be written into a specific Econce Command register.
  • Now write a MOVE instruction into the Core comand register to transfer the Program data from the receive register to any CORE register (Because we can write data into memory only from a CORE register not from any EOnCE register)
  • Now send a Core command to move the data from the Core Register to the memory location. This completes one transfer
  • The above steps are repeated till the whole program code is written into the memory.

Now that makes up the rather elaborate sequence for downloading contents into the RAM.

Blackfin Core

This picture is right out of the BlackFin HRM.

core
core

Without repeating what is already covered in the HRM let us try to analyse the capabilities of the Core from a different perspective.

Load Store Architecture means

  • All the operations are done on the registers
  • That would in turn mean that all the computational units in the core needs their inputs from the registers
  • This mandates that there has to be internal buses which connect the computational units to the registers.
  • A set of buses as the input and another set which transmits the output back to the registers.

The Diagram Below depicts exactly what we need.

Core Modules connected by internal Buses
Core Modules connected by internal Buses

This is quite fascinating, we can exactly know the reason why some instructions work while others do not.

Consider the Operation :

R0 = R1+ R2;

How does this work? There are two 32 bit buses running from the Data register file to the ALU0, values inside R1 & R2 will be transferred through them into ALU0 and once the operation is done the data come out back and gets written into R0. Perfect!

Now why does the following operation does not work?

P0 = R0 + R1;

The answer  lies again in the bus architecture. Eventhough R0 and R1 can be transferred to ALU0 we do not have a Bus running from ALU0 to Pointer Register File, hence there is no way that this operation can succeed.

*Please note that by default ALU0 will be used and ALU 1 will come into picture only in case of parallel operations, which can be discussed in another post.

Idea here is to prod us to start thinking from this perspective. When we understand the bus architecture, we understand how to write the best possible assembly code while remaining within the constraints of the system. All this data and instruction buses connecting memory, registers & computational units form the backbone of an architecture. This inherently determines the strengths as well as the weakness of the processor.

Load Store Continued

The LS (Load Store) story has some more twists which i did not mention in the previous post simply because it had already become way too long.

Stack Memory?

RISC machines always come with sufficient amount of registers which in turn results in storing of all the local variables inside registers itself and hardly any stack memory usage.Example code illustrating this phenomenon is given below.

Consider the below given C Code:


int add_on(int a,int b)
{
int i =0;
for(i =0;i < 10; i++)
{
a = a+b;
}
return((a));
}

The corresponding BlackFin Assembly code does not use a single stack variable. The register P0 and Loop counter register( LC0 ) takes care of the purpose of stack variable “i”.

_add_on:
P0 = 10;
/* Initialize the Zero Overhead Loop registers */
Lsetup(Loop_Starts,Loop_Ends) LC0 = P0;
/* Start of the loop */
Loop_Starts: Loop_Ends:
R0 = R0 + r1;
RTS;
_add_on.end:

Zero What! Overhead Loop?

Compilers also reduce the overhead of context switching between function calls, how? Ideally in between calls, all the registers need not be pushed on to stack, but only those registers which are used inside that particular function is saved and restored. Because the rest  anyway remain untouched and hence the  costly affair of stack memory usage and slow memory access is minimized. Result is a lean stack usage, have seen this feature on BlackFin compilers but yet to see on Starcore though!

Bogus Store

Within the load-store structure, most of the time store follows the load. But sometimes even though in the code we might see a store written before another memory read we really cannot guarantee the sequence in which it might execute .  Let me quote the lucid BlackFin HRM.

“The relaxation of synchronization between memory access instructions
and their surrounding instructions is referred to as weak ordering of loads
and stores. Weak ordering implies that the timing of the actual completion
of the memory operations—even the order in which these events
occur—may not align with how they appear in the sequence of the program
source code.”

“Because of weak ordering, the memory system is allowed to prioritize
reads over writes. In this case, a write that is queued anywhere in the pipeline, but not completed, may be deferred by a subsequent read operation,and the read is allowed to be completed before the write. Reads are prioritized over writes because the read operation has a dependent operation
waiting on its completion, whereas the processor considers the write operation complete, and the write does not stall the pipeline if it takes more
cycles to propagate the value out to memory. This behavior could cause a
read that occurs in the program source code after a write in the program
flow to actually return its value before the write has been completed.
This ordering provides significant performance advantages in the operation
of most memory instructions. However, it can cause side effects that
the programmer must be aware of to avoid improper system operation.”

When we get optimizations, we also get trade-offs. But let us keep the price limited to what we pay for in the die space, and not write buggy code and learn the hard way.

The side effect which is mentioned above will come into picture when we are configuring any kind of registers or any such memory, on which a read after write might give different results compared to a read before write. Consider the following sequence

1. Write Register 1

2. Read Register 2

What if Register 2 value depend on the Register1 write? In such situations we need to ensure strict ordering of reads and writes by using the SYNC instructions (CSYNC & SSYNC).

What?
What?