Part 2 – Bluetooth Physical Layer

Content Index:

Introduction : Bluetooth
Part 1: Bluetooth Framework
Part 2: Physical Layer
Part 3: Logical Transports (Link Layer)

 
A bottom up approach is ideal for understanding BT, grasping the aspects of physical medium will clarify how data bits are transmitted over the air and how various abstractions can be enforced over this elementary framework. Two primary attributes which defines the physical layer are explained below:

1. Frequency Usage

BT employs frequency ranging from 2400MHz to 2483.4 MHz, the Unlicensed 2.4GHz ISM, in a typical wireless communication scenario there could be other devices in the vicinity operating with the same frequency and this can cause interference on the BT channel. To overcomes such noise BT articulates a frequency hopping scheme where the particular frequency of communication is switched every 625uS (but it can never switch outside of the 2.4GHz unlicensed spectrum), this unique pattern of hopping is mutually agreed by the devices during the connection establishment.

2. Modulation 

Across all the possible frequencies there are multiple modulation schemes, and this determines the speed of the transmission over the air. For example BT supports following types of modulations.

  • BDR scheme (Basic Data Rate Scheme has over the air speed of 1MBPS) and  
  • EDR Scheme (Enhanced Data Rate Scheme has over the air speed of 2 and 3 MBPS).

Packets transmitted over the air could belong to any one of these two modulation schemes and the header tells the receiver about the scheme used, note that header is always send in BDR scheme.

Physical Channel Transmission

We know how frequency and modulation schemes are used for encoding data over the air but there has to be an efficient way for utilizing the medium too. With Ethernet we have CSMA but BT works in a slightly different manner.

  1.  Similar to wired protocols like SPI and I2C we have a master slave relationship between the BT devices, its pure hegemony wherein the master controls the medium and decides when a slave can transmit.
  2. The physical medium is also divided into time slots of 625 uS and each slot is reserved for either the master or the slave, an illustration is done in the diagram below.
BT Time Slots
BT Time Slots

Master Tx always happens over a even numbered slot and slave Tx over an odd one. In the above graphical illustration we can see the following transmissions

  1. Master and slave exchange single slot packets in the first 3 slots.
  2. After which slave sends a three slot packet for which master sends a five slot packet response.
  3. Finally the slave sends another three slot packet.

Each BT device will have a 28 bit clock running in its controller which can identify the 625uS boundaries and initiate a Transmission (Tx) or a Reception (Rx). Data transmission happen using different packet structures, some of these packet types occupy more than one slot and needless to say every packet needs to be acknowledged by the receiving device. It is critical to note that the packet transmission will always occupy odd number of slots, this means that the slot immediately following the last packet will be reserved for the response from the receiver. Understanding the whole protocol stack is simpler if we start by studying its genetic constitution, in other words the physical layer which elaborates the strengths and the weakness in the most absolutely elementary form.

 

Advertisement

Part 1 – Bluetooth Framework

Content Index:

Introduction : Bluetooth
Part 1: Bluetooth Framework
Part 2: Physical Layer
Part 3: Logical Transports (Link Layer)

 

Bluetooth Protocol Layers
Bluetooth Protocol Layers

The core protocol stack of Bluetooth has the above three main layers:

Physical Layer: Decides how the bits are transmitted over the air, like the modulation schemes and the packet structures to be used for communication.

Logical Layer : Bluetooth is a connection oriented protocol, this means that there is a connection set up, maintenance and tear down. This layer handles these activities. Each type of connection has its own set of associated packets and each connection type is tailored for different purposes like music playback, hands free etc.

L2CAP : L2CAP multiplexes data of various applications (or BT profiles) over the lower layer logical links and connections.

Bluetooth System
Bluetooth System

A typical Bluetooth architecture will look like above diagram, L2CAP and above layers will sit on the host while Logical and Physical layers will sit on the controller. “Host” usually means the main processor, and the controller is a small BT chip sold by guys like Broadcom and CSR. “Host” processors in Laptops are usually Intel CPUs running Windows/Linux, in hand held market the host is usually an ARM SoC which may be running Android/iOS/WP7 or any other embedded OS.

There is usually a transport layer which connects the host to the BT controller, usually this is a high speed UART interface. There is one more hardware which is the lowest and below all the layers, this is the analog RF which actually transmits the bits over the air, receives the response and passes it to the physical layer, this analog logic is usually outside of the BT controller chip. An interesting aspect is that the main consumer of power or current in this system will be the RF, better the RF more optimized will be the power consumption.

 

Bluetooth

Working on Bluetooth(BT) for a few weeks proved to be quite beneficial, got a reasonably good grip on its fundamentals, so exploring that here might prove to be a useful future reference. And also help someone else who might want to get started on this connectivity mechanism.

BT was quite revealing, mostly because of the fact that prior to this I had never worked on any wireless protocol. Even though BT specification is a great piece of documentation, (unlike IEEE 802.11 spec) it does manage to abruptly throw jargons which can raise more questions than it answers. To overcome this someone needs to wrestle through the full 3000 pages of the spec searching for explanations on the new found key words which may spiral into more doubts. Hopefully, with these posts we can craft some clarity on the BT framework, just the elementary skeleton backed by critical design information. This should act as a preface to the full blown BT Spec so that the latter can be read in a more orderly way.

Content Index:

Part 1: Bluetooth Framework

Part 2: Physical Layer

Part 3: Logical Transports (Link Layer)

NVIDIA Acquisition of Icera

What are the prospects for Nvidia having bought Icera?

Icera acquisition is very strategic to NVIDIA in many ways.

  • Considerable effort is required in developing a modem DSP core and RF from scratch, and acquisition seems logically the correct way to go. More importantly, this move complements their existing strength in Application engines.
  • Every semiconductor company aims to offer OEMs a complete solution which is profitable in terms of margin and much easier to manage. Currently NVIDIA has to integrate separate 3rd party modem core which will be outside of the application SoC. But companies like Qualcomm and STE follows a better design where both modem and application engine are built into the same SoC. This is better in terms of density of integration, power consumption and costs, the downside is more SOC complexity, which anyway these guys can handle.
  • NVIDIA is also aggressively moving to a position where they are want to deliver the complete system. Icera acquisition is a move in that same direction. I would expect more acquisitions from NVIDIA in future which could be in the connectivity domain. As far as I know NVIDIA does not have any expertise in WLAN, BT and NFC like technology, these are very critical and companies like CSR are ideal candidates for takeover.

In general handset market is moving in a direction where there will be only complete platform providers. Looks like it is the inevitable case where single component manufactures will either perish or get acquired.

NAND flash musings

Been quite some time since the last post, life has been busy, thanks to the NAND flash chips from Toshiba & Samsung. Ironic enough their seemingly naive data sheets introduce NAND as an angelic technology. Simple protocols, even more simple hardware interface. A totally reasonable requirement  placed on driver to fix one bit errors and detect two bit errors (which is not supposed to happen but still for some unknown reason vendors mention this requirement too, would be ecstatic to know why). A touch of complexity is felt only when bad blocks are encountered, which is totally fair considering the cost effectiveness of NANDs.

 

My initial impression of NAND being a fairly simple fixed hassle free storage media was progressively crushed to shreds during the last one year of NAND torments. Have worked only on SLC NANDs from Toshiba & Samsung, they are extensively used on mobile handset platforms. So, MLC is an unknown inferno to me.  Hopefully the below mentioned points might help the posterity from enduring the crisis. Always remember to religiously follow the Data sheet (henceforth referred to as “the book”)  for NAND salvation.

  • Keep innovative operation sequences for hobby projects.
  1. Do not try stuff like NAND reset command during NAND busy unless the book clearly explains the behavior of its effect on read, program and erase operation with a CLEAR timing diagram.
  2. Do NOT use read back check to detect bad blocks unless that is mentioned as one of the methods in the book
  3. MORAL : Follow ONLY what is written in the book, do not infer or even worse assume.
  • Read wear leveling cannot prevent bit errors nor can erase refresh solve bit errors.
  1. I have managed to induce bit errors on Samsung NAND flash when partial page writes are executed beyond the maximum number specified for a page and also by executing multiple partial page reads. Interestingly, even after continuous block erases, the single bit read errors refused to disappear.
  2. Any deviation from the strict protocol mentioned in the book can result in manifestation of strange symptoms.
  3. BTW: A deterministic read wear count is a myth unless it is mentioned in the book.
  4. MORAL : Symptoms and root causes never have 1:1 ratio.
  • Never go back and correct mistakes within a block
  1. Samsung NAND flashes “prohibits” going back to a lesser numbered page in a block and reprogramming it (for Eg: Do not program page 10  after programming page 20 within a block) the effect of such an operation is not documented so you do not know the symptoms which can incarnate in any form.
  2. Go ahead and question the logic of any file system which does random page programming in a block to mark dirty pages!.
  3. MORAL:Do not question what the book says, just blindly  follow it.

 

Man Versus Compiler

Man versus machine has always been an exhilarating contest, whether its movie making (2001:A Space Odyssey, Terminator, Matrix) or chess (Kasparov v/s Deep Blue) or may be coding (Compiler Generated Code v/s Programmer generated code).

A Space Odyssey
A Space Odyssey

Compilers of the Analog Devices and Freescale DSPs I have been working on exhibited such maturity that even the worst written C code performed like a formula one car. If you are writing system software then it seems futile attempting to increase the performance with direct assembly coding. There are of course some exceptions like.

  • Certain parts of Operating System has to be written in Assembly because they are platform dependent code(For example: Trap interrupt in linux is generated by a code written in assembly)
  • DSP algorithms which needs to be optimized by using SIMD capabilities.

For device driver coding and generic firmware development we hardly need to invest time in writing assembly, a well written modular C code will suffice.

It’s a popular misconception that critical code which needs to perform well has to be written in assembly, for example I do remember some of the tech leads insisting on Interrupt Service Routines to be written in Assembly, pretty sure if they had done some empirical analysis on their code they would have got some rude shocks.

A programmer should be able to finish the work assigned in the most optimal way, there is no point investing weeks in optimizing and writing code in assembly for saving a few cycles, you never know later you might see that the code performance deteriorated for different inputs. Usually the following steps should be considered as an optimal methodology.

  • Write the code in C
  • Do profiling of the code and identify bottlenecks
  • Identify those critical parts of the code which should be improved
  • Analyze the assembly code generated of those critical parts and see if any optimizations can be done with minimal time investment and maximum results (For example, if we focus our optimization to the code which execute in a loop then we get more returns)

Now in an ideal scenario with a reasonable time investment  there is no way a programmer can beat the compiler in terms of performance optimization, so is there a way we can beat the compiler? One possible advantage a programmer might have over the compiler is that he knows what will be the nature of the input, if his optimization strategy is focused on exploiting these nuances of the input then he is bound to get amazing results.

Consider an example code which counts the number of odd and even elements in a list.

The C Code for the same is given below:

void count_list()
{
int  i    = 0;
bool temp = 0;
int  even = 0;
int  odd  = 0;

for(i = 0; i < 40 ; i++)
{
temp = (array[i] & 0x00000001);

if(temp == 0)
{
even = even + 1;
}
else
{
odd = odd + 1;
}
}
final_num = even;
final_odd_num = odd;
}

If we analyse the code, almost all the cycles are spend in the loop where we check for the odd and the even elements.

Lets look at the assembly code generated for the loop.

/* Set up the zero over head loop */

LSETUP ( 4 /*0xFFA0006C*/ , 14 /*0xFFA00076*/ ) LC0 = P1 ;
R3 = [ P0 + 0x0 ] ;                                   /* Read the element from the list */
CC = BITTST ( R3 , 0x0 ) ;                     /* Read the first bit */
IF CC JUMP 30 /*0xFFA0008E*/ ;/*  Check the first bit and jump if set*/
NOP ;                                                              /* NOP to prevent unintented increment */
R1 = R1 + R2 ;                                             /* If branch not taken then increment */
P0 += 4 ;                                                       /* Increment the address to be read */

….

..

R0 = R0 + R2 ;                                            /* Increment counter for odd */
JUMP.S -26 /*0xFFA00076*/ ;        /* Jump back to the starting of the loop */

Now can we optimize the above loop? The main bottleneck in the above code is the jump, each conditional jump misprediction will cost us 8 core clocks.

How can we reduce the cost of this jump? Here we can analyse the input pattern of the array and lets say that the input array always consists of a majority of odd elements, in such a scenario we have a branch misprediction in the majority of the cases. Lets turn the tables here. Below I have given the optimized loop.

/* Set up the loop */

lsetup(Loop_starts,Loop_ends)LC0 = p0;
Loop_starts:
r1 = extract(r0,r2.l)(z)||r0 = [i0++]; /* Check the bit 0 and read the next element at the same time */
cc = r1;                                                             /*Assign LSB to CC flag */
if cc jump odd_num(bp); /* If CC is set then jump (note branch prediction) */
r5 += 1;                                                            /* If branch not taken then increment even */
odd_num:
NOP;
Loop_ends:
r6 += 1;                                                           /* If branch taken the increment odd*/

The above code has two critical changes inside the loop.

  • Parallel instruction execution
  • Branch Prediction for conditional jump

Parallel instruction execution is an obvious advantage. Branch prediction reduced the penalty in the cases of odd numbers to 4 clock cycles but increased the penalty in case of even numbers to 8 clock cycles. As we know that the odd numbers are in majority so on the whole we end up reducing the cycle consumption by almost 30%. So one of the most effective ways to beat the compiler is to exploit the nuances present in the input pattern, compiler is oblivious to such details and cannot generate a code ideal for all kinds of inputs but we on the other hand can tailor the assembly so that it caters to certain specific types of inputs.

Multi-Issue instructions and Multi-cycle instructions

BlackFin Pipeline
BlackFin Pipeline

Execution of a machine instruction is divided into various steps, like fetching, decoding and execution. The role these steps play are different and hence we also need different hardware units, now dedicating different units also enables simultaneous executions. Essentially an instruction pipeline! BlackFin is not a super scalar processor but there can be a certain amount of parallel execution which can happen within a execution unit.

Core Modules connected by internal Buses
Core Modules connected by internal Buses

If we observe the above architecture, we can broadly classify BlackFin core into two units

  • ALU + Multipliers + Shifter + Video ALUs
  • Data Address Generators

Here in lies the key for identifying the instructions which can execute in parallel, we can broadly say that we should be able to execute computational unit operations and load/store operations in parallel.

r1 = extract(r0,r2.l)(z)||r0 = [i0++]||w[i1]=r5.l;

above given is a multi-issue instruction, this combines a 32 bit instruction (extract) with two 16 bit instructions (load and store).

The Extract instruction is executed by the Barrel Shifter hardware and the load store instructions are executed by Data address generators. So we have both the modules of the core working in parallel.

Looking at this from a “Load Store” architecture point of view we can add one more observation. Such a parallel execution of Computational operation and Load/Store operation is possible because the former is not accessing any memory. And all the operands and the destination registers are within the core because of which there are no data bus accesses. Absence of this bus access is what makes it possible for the core to execute Load and store operation in parallel to the ALU/multiplier operations.

r3.h=(a1+=r6.h*r4.l);

The above instruction is not a multi-issue or a multi cycle instruction, we can probably say that if multi-issue instructions use the breadth of the processor then this instruction used the depth. I would leave it to the reader to guess how this one might be processed.

Multi-Cycle instructions are those which takes more than one cycle to execute, this is more like a CISC concept, one instruction gets decoded into multiple simple instructions.

  • r3 *= r0;
  • [ — sp ] = (r7:5, p5:0) ;

The first operation given above is a 32 bit multiplication which is not possible considering the fact that the available multipliers are 16 bits, hence this operation is achieved in the hardware by using the same 16 bit multipliers but by doing more than one multiplication operations.

The second operation is a stack push, this instruction specifies that all the r5 to r7 registers and all the p5 to p0 registers should be pushed to the stack. The hardware will sequentially push all the specified registers one by one and this leads to a multi-cycle operation.

All the multi-cycle operations are decoded many times over and over again!

Catch in Circular Buffering

Some of the features which distinguishes a DSP processor are.

  • Multipliers
  • Video ALUs
  • Zero Overhead loop support
  • Circular Buffering support

There is a code which used the the circular buffering capabilities of Starcore to write into a common buffer. The code which copied data into the buffer was optimized by writing it in assembly. It was also written to make sure that at a time maximum number of bytes were copied using a MOVE operation. In other words if the source and destination was aligned at 8 byte boundaries then the assembly did a MOVE.64 instruction which copied 64 bits at a time.

The pseudo code of the assembly is given below.

memcopy(src,dest,size)

  1. Start of Loop:
  2. if (src%8 == 0) &&(dest%8 == 0)&&(size >=8)
  3. move.64 src,dest (instruction to move 8 bytes)
  4. else if (src%4 == 0) &&(dest%4 == 0)&&(size >=4)
  5. move.32 src,dest (instruction to move 4 bytes)
  6. else if (src%2 == 0) &&(dest%2 == 0)&&(size >=2)
  7. move.16 src,dest (instruction to move 2 bytes)
  8. else move.8 src,dest  (byte copy)

The above method adds the overhead of checking the alignment each time but the clock cycles saved by moving more bytes at a time using the move instruction is much more because size of most of the data written into this circular buffer was a multiple of 8 or at least a multiple of 4.

The circular buffering was implemented using the Index, base, length and the modifier register. Lets name them I0,B0,L0 and Mo, the hardware behaves in such a way that as soon as the address in I0 reaches the value B0 + L0 it makes sure that I0 is reset to B0. This was circular buffering is maintained without any additional software overhead of checking bounds.

Lets consider a buffer which has the following attributes.

  • Size = 8 bytes
  • Base address = 0x02

Now the writes into the circular buffer of 8 bytes happened in the following order

  • Write 1 : 2 bytes
  • Write 2: 4 bytes
  • Write 3: 4 bytes

Lets see what happens in such a scenario. Below I have given the BlackFin Assembly Code which does the write.


_main:
/* Initialize the values of the registers  for circular buffering */
i0.l = circular_buff;  /*Buffer Address – 0x02*/
i0.h = circular_buff; /*Buffer Address – 0x02*/
l0 = 8;                             /*Length of the buffer */
b0 = i0;                          /*Base address */

r0.l = 0xFFFF;        /*Dummy Value which will */
r0.h = 0xeeee;        /*be written into the buffer */

w[i0++] = r0.l;        /*Write 0xFFFF at address I0-0x02               */
[i0++] = r0;              /*Write 0xFFFFEEEE at I0-0x04                     */
[i0++] = r0;              /*Write 0xFFFFEEEE at I0-0x08                     */
/*At this point the overflow has happened  */
/*0xEEEE has been written to 0x0A,
*/ /*which is out of bounds for the array circular_buff */


w[i0] = r0.h;            /*I0 has properly looped back to 0x04 */

As you can see the comments, the first write of two bytes took up the locations 0x02 and 0x03, second write of 4 bytes used up locations 0x04 to 0x07.

Now after the first two writes we need to write 4 bytes more, the location 0x08 is 4 byte aligned and 4 bytes have to be written also, so ideally 2 bytes should be written into the address 0x08 and 0x09 and the third and the fourth byte should be written to the starting of the buffer afer a loop back.

With the above code this won’t happen and we will end up having a 2 byte overflow but the Index register will loop back properly and leave the first 2 bytes of the buffer unwritten and point itself to 0x04.

This was seen in the starcore when the following conditions were satisfied.

  • Size of the buffer was a multiple of 8 byte
  • Start address of the buffer was aligned at 4 byte boundaries so that the end address of the buffer minus 4 bytes will give you a 8 byte aligned address.

When the above conditions were satified we ended up having a 4 byte overflow. This is understandable because a load or a store operation has the following operations and they all happen in sequence.

  1. Calculate the Load/Store address
  2. Send out the Load/Store address on the address bus
  3. Wait for the required amount of wait cycles before Writing the data onto the Store bus or reading data from the Load bus.

The circular buffering logic is implemented by the Data Address generator module in the Core and the Bus protocol which loads or stores a value is not aware of this condition because of which it goes ahead and writes or reads the value from the memory and we will have a corruption.

Making sure that the Index register is pointing to a valid address is done by executing a simple formula like

Index-new = Index-old + Modifier – Length;

This happens during one of the later stages in the pipeline (most probably the Write back Stage or just before that) while the address is generated at a much earlier stage and data gets read during the same time. All this makes sense, hardware is perfectly right when he did a byte overflow.

So this is indeed a bug in the software which we solved by making sure that the base address of circular buffer is always aligned at 8 byte boundaries.

How does Onchip Breakpoints work?

Internal specifics of Onchip breakpoints on Freescale StarCore processors is easily understood if we are familiar with its emulator. So it’s good idea to read the following two posts before going through the details below.

Introduction to Debug modules

On Chip Emulator

 

On Chip Breakpoints:

A descriptive diagram of the whole debug infrastructure is given below.

EOnCE Controller
EOnCE Controller
  • EDU: Event Detection unit
  • EC: Event Counter
  • TB unit: Trace Buffer unit
  • EEx & EED : General purpose control signals which can be configured for input or output. Usage of this is SOC specific (Derivative specific).

Before we configure the break points the processor core should be placed in DEBUG mode, this DEBUG mode can invoked using two methods:

  • Send a Command to the controller
  • Assert the EEx signal as soon as the core comes out of reset

Using the second method will need configuring of registers accordingly. Once the Core is in the DEBUG mode, breakpoints can be configured via JTAG. Here we are not going to discuss the details of register settings but a design level overview of how this OnChip Emulator works.

The breakpoints/watchpoints should be enabled only after programming the EDU (Event Detection Unit) registers with the configuration which defines that breakpoint. Namely the address in the memory, type of the breakpoint etc. These registers can be written via the JTAG.

So the two methods employed for enabling an already configured breakpoint/watchpoint are:

  • Configure the breakpoint/watchpoint enabling register by writing into it using the JTAG port.
  • Use the EEx or the EED control lines to signal the EOnCE controller to enable or disable the debug functionalities.

Once we program the debug configuration, the host needs to be informed when an event happened or when a breakpoint got hit. EOnCE will ensure that the core enters DEBUG mode by sending it a signal but for informing the HOST we need a mechanism like an interrupt.

  • We can configure one out of the EEx signals to send an interrupt to the host once the core enters Debug mode.

From the above discussion it’s clear that breakpoints are supported in hardware via setting of few registers and they can be configured via EOnCE which in turn responds to JTAG interface. If there are multiple break points configured then there are mechanisms to identify which one was hit.

  • Read the EOnCE controller status or monitor registers which will tell you which event has got hit
  • Configure EEx or EED signal to assert an interrupt to the HOST, depending on which signal has asserted the interrupt, HOST can know which break point has got hit.
  • Read the PC Breakpoint detection register from EOnCE, this gives the PC of the address which caused the event.

PC Breakpoint detection register should me used in combination with reading of EOnCE status register because debug mode can be caused by other reasons also.One important point to be noted here is that EOnCE gives many options for the usage of EEx and EED signals but the way in which it will be used is SOC specific and each platform may have different uses of these.

Similarly, if the Event counter has to count “off core” events like Cache hits or misses, the the external signals EC0 and EC1 needs to be used, this is also SOC dependent.

Hardware Design

 

Logic for configuring breakpoints
Logic for configuring breakpoints

Conclusion

It’s obvious that onchip breakpoints need support in hardware, but we could still use some more clarity on how this support is provided.

  • EOnCE is closely integrated with the processor core because it needs to probe address and data buses for certain reference values configured in breakpoint/watchpoint registers.
  • Whenever processor core accesses an address, EOnCE comes to know about this address value and it compares it with the reference value. If they are same then, depending on the Event selector configuration an event will be raised.
  • The event can be a DEBUG exception, DEBUG signal to the processor, triggering a trace, disabling a trace etc.

Similar logic can be employed for probing data values also. The above diagram we can see two comparators, A & B, these are two address buses for data access, if we do not know in which bus the data we want will be accessed then we need to configure both Comparator A & B with the reference value and then set the condition for an OR-ing of the comparisons, so that even if one comparison returns a success we will have an event raised.

MASK register: The way it works is pretty self explanatory, the address bus values sampled is masked with the MASK register value and then compared with the reference values.

Trace Buffer Unit can detect and record program flow related details and send it to an off-core module like Nexus, from Nexus it can be sent to the host using a Nexus port on the board or it can be saved in a circular buffer inside any On-Chip memory.

EOnCE: The onchip emulator

EOnCE comprises of 6 core components.

• EOnCE Controller
• Event Counter
• Event Detection Unit (EDU)
• Synchronizer
• Event Selector
• Trace Unit

EOnCE Controller:

Without this contoller JTAG wont be able to access or program EOnCE. Using this interface we can put the core into debug mode & also write into EOnCE registers, basically this is the JTAG gateway into EOnCE.

Event Counter

Counters can count various events! These events can be configured by setting its registers. For example, it can be the number of times a watchpoint was hit or the number of instructions executed, or any external events like L1 cache hits, cache miss etc.

Event Detection Unit (EDU)

EDU forms the core logic of EOnCE, this module can be configured to set breakpoints, watchpoints, or to monitor the address on the data bus etc. EDU can also send a signal to Even Counter so that it can count the number of times an address was referenced or a data was accessed etc.

Synchronizer

This module is required to synchronize external signals with the internal clock. There are a set of general purpose signals which can be programmed to do various operations. For example, we can configure the General purpose signal EE0 to configure the EDU whenever it is asserted through JTAG.

Event Selector

This is the last unit in the debugging chain, in other words this decides what action needs to be taken when an event is generated by EDU or Event Counter. The action can be something like putting the Core to the debug mode, or it can be starting a program trace, or raising of a debug exception etc.

Trace Unit

As the name suggests, this is the hardware which monitors the program counter and generates the call trace.

Below Pic depicts the EOnCE controller and its component modules.

EOnCE framework
EOnCE framework