Mahesh Sreekandath

Part 3 — Bluetooth Logical Transports

September 6, 2011June 29, 2016Mahesh Sreekandath3 Comments

Content Index:

Introduction : Bluetooth
Part 1: Bluetooth Framework
Part 2: Physical Layer
Part 3: Logical Transports (Link Layer)

Physical links are eventually managed by a combination of master – slave relationship and the time slots within the medium. The most cogent way to grasp how these slots are used for transferring data is to explore this within the context of the popular BT use cases given below.

Hands free
&
Music Playback (A2DP)

Hands free

Hands free use case is a critical one, this has to be full duplex and real time too which mandates a minimum guaranteed bandwidth otherwise the voice quality will take a hit. For achieving this circuit switched connection like behavior we need to reserve over the air slots for transmitting voice. The following two types of logical transports cater to this very stringent requirement.

SCO (Synchronous Connection Oriented) &
eSCO (Extended Synchronous Connection Oriented)

When link manager sets up above types of logical links it will make sure that a certain number of slots are reserved for its use, for more clarity please refer to the following figure.

The above diagram illustrates an eSCO connection with the following attributes.

Total six slots are reserved for eSCO.
Out of the six slots one slot is for master to slave Tx and one for slave to master.
Four slots are reserved for retransmissions
There are two unreserved slots between each eSCO window.

During the connection set up phase itself the devices will negotiate the above eSCO connection parameters. There may be multiple unreserved slots between two eSCO windows which may be used for other types of data transfers. eSCO connections and SCO connections work in a similar fashion but in case of SCO there are no re-transmissions and it does not support EDR modulation schemes. Hopefully this has explained how hands free profile is working.

Music Playback (A2DP)

Music play back involves streaming of music files from a device to BT enabled head set. This music data is usually encoded in some form during transfer and later it is decoded and played by the headset, this use case implements a Asynchronous connection oriented transport(ACL). There are no slots reserved for this type of a logical transport, in case of A2DP we can have the headset to be the master or the device to be the master but in either of the case only master can trigger the data transfer. An ACL link is established by exchanging ACL control packets and during the connection establishment the ACL link parameters are negotiated, this will also determine the frequency of the music data transfer, its encoding, the type of packets and this transmission will be frequent enough to guarantee a clear playback.

In case where SCO/eSCO and ACL transports are coexisting between two BT devices, it will make sure that the reserved SCO/eSCO slots are honored and ACL uses only the intermittent unreserved ones.

Now the crux of this logical transports lie in the packet types used, which is given in the table below.

As you can see above eSCO support more than 1 MBPS rates, this in turn means that EDR (Extended Data Rate) modulation scheme will be used for eSCO but not for SCO, similarly for ACL there are BDR and EDR modulations supported, usage of the packet type will determine the modulation scheme and vice versa is also true.

ACL transport is also used for other profiles like FTP, DUN, SAP etc. One an ACL connection is established to a device then all these application profiles will share this ACL link, multiplexing all these profiles over one link is responsibility of the L2CAP layer.

Part 2 – Bluetooth Physical Layer

September 6, 2011June 22, 2016Mahesh Sreekandath3 Comments

Content Index:

Introduction : Bluetooth
Part 1: Bluetooth Framework
Part 2: Physical Layer
Part 3: Logical Transports (Link Layer)

A bottom up approach is ideal for understanding BT, grasping the aspects of physical medium will clarify how data bits are transmitted over the air and how various abstractions can be enforced over this elementary framework. Two primary attributes which defines the physical layer are explained below:

1. Frequency Usage

BT employs frequency ranging from 2400MHz to 2483.4 MHz, the Unlicensed 2.4GHz ISM, in a typical wireless communication scenario there could be other devices in the vicinity operating with the same frequency and this can cause interference on the BT channel. To overcomes such noise BT articulates a frequency hopping scheme where the particular frequency of communication is switched every 625uS (but it can never switch outside of the 2.4GHz unlicensed spectrum), this unique pattern of hopping is mutually agreed by the devices during the connection establishment.

2. Modulation

Across all the possible frequencies there are multiple modulation schemes, and this determines the speed of the transmission over the air. For example BT supports following types of modulations.

BDR scheme (Basic Data Rate Scheme has over the air speed of 1MBPS) and
EDR Scheme (Enhanced Data Rate Scheme has over the air speed of 2 and 3 MBPS).

Packets transmitted over the air could belong to any one of these two modulation schemes and the header tells the receiver about the scheme used, note that header is always send in BDR scheme.

Physical Channel Transmission

We know how frequency and modulation schemes are used for encoding data over the air but there has to be an efficient way for utilizing the medium too. With Ethernet we have CSMA but BT works in a slightly different manner.

Similar to wired protocols like SPI and I2C we have a master slave relationship between the BT devices, its pure hegemony wherein the master controls the medium and decides when a slave can transmit.
The physical medium is also divided into time slots of 625 uS and each slot is reserved for either the master or the slave, an illustration is done in the diagram below.

Master Tx always happens over a even numbered slot and slave Tx over an odd one. In the above graphical illustration we can see the following transmissions

Master and slave exchange single slot packets in the first 3 slots.
After which slave sends a three slot packet for which master sends a five slot packet response.
Finally the slave sends another three slot packet.

Each BT device will have a 28 bit clock running in its controller which can identify the 625uS boundaries and initiate a Transmission (Tx) or a Reception (Rx). Data transmission happen using different packet structures, some of these packet types occupy more than one slot and needless to say every packet needs to be acknowledged by the receiving device. It is critical to note that the packet transmission will always occupy odd number of slots, this means that the slot immediately following the last packet will be reserved for the response from the receiver. Understanding the whole protocol stack is simpler if we start by studying its genetic constitution, in other words the physical layer which elaborates the strengths and the weakness in the most absolutely elementary form.

Part 1 – Bluetooth Framework

September 6, 2011June 22, 2016Mahesh Sreekandath3 Comments

Content Index:

Introduction : Bluetooth
Part 1: Bluetooth Framework
Part 2: Physical Layer
Part 3: Logical Transports (Link Layer)

The core protocol stack of Bluetooth has the above three main layers:

Physical Layer: Decides how the bits are transmitted over the air, like the modulation schemes and the packet structures to be used for communication.

Logical Layer : Bluetooth is a connection oriented protocol, this means that there is a connection set up, maintenance and tear down. This layer handles these activities. Each type of connection has its own set of associated packets and each connection type is tailored for different purposes like music playback, hands free etc.

L2CAP : L2CAP multiplexes data of various applications (or BT profiles) over the lower layer logical links and connections.

A typical Bluetooth architecture will look like above diagram, L2CAP and above layers will sit on the host while Logical and Physical layers will sit on the controller. “Host” usually means the main processor, and the controller is a small BT chip sold by guys like Broadcom and CSR. “Host” processors in Laptops are usually Intel CPUs running Windows/Linux, in hand held market the host is usually an ARM SoC which may be running Android/iOS/WP7 or any other embedded OS.

There is usually a transport layer which connects the host to the BT controller, usually this is a high speed UART interface. There is one more hardware which is the lowest and below all the layers, this is the analog RF which actually transmits the bits over the air, receives the response and passes it to the physical layer, this analog logic is usually outside of the BT controller chip. An interesting aspect is that the main consumer of power or current in this system will be the RF, better the RF more optimized will be the power consumption.

Bluetooth

September 6, 2011October 9, 2015Mahesh Sreekandath2 Comments

Working on Bluetooth(BT) for a few weeks proved to be quite beneficial, got a reasonably good grip on its fundamentals, so exploring that here might prove to be a useful future reference. And also help someone else who might want to get started on this connectivity mechanism.

BT was quite revealing, mostly because of the fact that prior to this I had never worked on any wireless protocol. Even though BT specification is a great piece of documentation, (unlike IEEE 802.11 spec) it does manage to abruptly throw jargons which can raise more questions than it answers. To overcome this someone needs to wrestle through the full 3000 pages of the spec searching for explanations on the new found key words which may spiral into more doubts. Hopefully, with these posts we can craft some clarity on the BT framework, just the elementary skeleton backed by critical design information. This should act as a preface to the full blown BT Spec so that the latter can be read in a more orderly way.

Content Index:

Part 1: Bluetooth Framework

Part 2: Physical Layer

Part 3: Logical Transports (Link Layer)

NVIDIA Acquisition of Icera

June 10, 2011February 20, 2016Mahesh SreekandathLeave a comment

What are the prospects for Nvidia having bought Icera?

Icera acquisition is very strategic to NVIDIA in many ways.

Considerable effort is required in developing a modem DSP core and RF from scratch, and acquisition seems logically the correct way to go. More importantly, this move complements their existing strength in Application engines.
Every semiconductor company aims to offer OEMs a complete solution which is profitable in terms of margin and much easier to manage. Currently NVIDIA has to integrate separate 3rd party modem core which will be outside of the application SoC. But companies like Qualcomm and STE follows a better design where both modem and application engine are built into the same SoC. This is better in terms of density of integration, power consumption and costs, the downside is more SOC complexity, which anyway these guys can handle.
NVIDIA is also aggressively moving to a position where they are want to deliver the complete system. Icera acquisition is a move in that same direction. I would expect more acquisitions from NVIDIA in future which could be in the connectivity domain. As far as I know NVIDIA does not have any expertise in WLAN, BT and NFC like technology, these are very critical and companies like CSR are ideal candidates for takeover.

In general handset market is moving in a direction where there will be only complete platform providers. Looks like it is the inevitable case where single component manufactures will either perish or get acquired.

NAND flash musings

July 5, 2010June 6, 2016Mahesh Sreekandath1 Comment

Been quite some time since the last post, life has been busy, thanks to the NAND flash chips from Toshiba & Samsung. Ironic enough their seemingly naive data sheets introduce NAND as an angelic technology. Simple protocols, even more simple hardware interface. A totally reasonable requirement placed on driver to fix one bit errors and detect two bit errors (which is not supposed to happen but still for some unknown reason vendors mention this requirement too, would be ecstatic to know why). A touch of complexity is felt only when bad blocks are encountered, which is totally fair considering the cost effectiveness of NANDs.

My initial impression of NAND being a fairly simple fixed hassle free storage media was progressively crushed to shreds during the last one year of NAND torments. Have worked only on SLC NANDs from Toshiba & Samsung, they are extensively used on mobile handset platforms. So, MLC is an unknown inferno to me. Hopefully the below mentioned points might help the posterity from enduring the crisis. Always remember to religiously follow the Data sheet (henceforth referred to as “the book”) for NAND salvation.

Keep innovative operation sequences for hobby projects.

Do not try stuff like NAND reset command during NAND busy unless the book clearly explains the behavior of its effect on read, program and erase operation with a CLEAR timing diagram.
Do NOT use read back check to detect bad blocks unless that is mentioned as one of the methods in the book
MORAL : Follow ONLY what is written in the book, do not infer or even worse assume.

Read wear leveling cannot prevent bit errors nor can erase refresh solve bit errors.

I have managed to induce bit errors on Samsung NAND flash when partial page writes are executed beyond the maximum number specified for a page and also by executing multiple partial page reads. Interestingly, even after continuous block erases, the single bit read errors refused to disappear.
Any deviation from the strict protocol mentioned in the book can result in manifestation of strange symptoms.
BTW: A deterministic read wear count is a myth unless it is mentioned in the book.
MORAL : Symptoms and root causes never have 1:1 ratio.

Never go back and correct mistakes within a block

Samsung NAND flashes “prohibits” going back to a lesser numbered page in a block and reprogramming it (for Eg: Do not program page 10 after programming page 20 within a block) the effect of such an operation is not documented so you do not know the symptoms which can incarnate in any form.
Go ahead and question the logic of any file system which does random page programming in a block to mark dirty pages!.
MORAL:Do not question what the book says, just blindly follow it.

Man Versus Compiler

June 25, 2009August 29, 2025Mahesh SreekandathLeave a comment

Man versus machine has always been an exhilarating contest, whether its movie making (2001:A Space Odyssey, Terminator, Matrix) or chess (Kasparov v/s Deep Blue) or may be coding (Compiler Generated Code v/s Programmer generated code).

Compilers of the Analog Devices and Freescale DSPs I have been working on exhibited such maturity that even the worst written C code performed like a formula one car. If you are writing system software then it seems futile attempting to increase the performance with direct assembly coding. There are of course some exceptions like.

Certain parts of Operating System has to be written in Assembly because they are platform dependent code(For example: Trap interrupt in linux is generated by a code written in assembly)
DSP algorithms which needs to be optimized by using SIMD capabilities.

For device driver coding and generic firmware development we hardly need to invest time in writing assembly, a well written modular C code will suffice.

It’s a popular misconception that critical code which needs to perform well has to be written in assembly, for example I do remember some of the tech leads insisting on Interrupt Service Routines to be written in Assembly, pretty sure if they had done some empirical analysis on their code they would have got some rude shocks.

A programmer should be able to finish the work assigned in the most optimal way, there is no point investing weeks in optimizing and writing code in assembly for saving a few cycles, you never know later you might see that the code performance deteriorated for different inputs. Usually the following steps should be considered as an optimal methodology.

Write the code in C
Do profiling of the code and identify bottlenecks
Identify those critical parts of the code which should be improved
Analyze the assembly code generated of those critical parts and see if any optimizations can be done with minimal time investment and maximum results (For example, if we focus our optimization to the code which execute in a loop then we get more returns)

Now in an ideal scenario with a reasonable time investment there is no way a programmer can beat the compiler in terms of performance optimization, so is there a way we can beat the compiler? One possible advantage a programmer might have over the compiler is that he knows what will be the nature of the input. Optimization strategy focused on exploiting this could be effective.

Consider an example code which counts the number of odd and even elements in a list.

The C Code for the same is given below:

void count_list()
{
int i = 0;
bool temp = 0;
int even = 0;
int odd = 0;

for(i = 0; i < 40 ; i++)
{
temp = (array[i] & 0x00000001);

if(temp == 0)
{
even = even + 1;
}
else
{
odd = odd + 1;
}
}
final_num = even;
final_odd_num = odd;
}

If we analyse the code, almost all the cycles are spend in the loop where it checks for the odd and the even elements.Assembly code generated below.

/* Set up the zero over head loop */

LSETUP ( 4 /*0xFFA0006C*/ , 14 /*0xFFA00076*/ ) LC0 = P1 ;
R3 = [ P0 + 0x0 ] ; /* Read the element from the list */
CC = BITTST ( R3 , 0x0 ) ;                     /* Read the first bit */
IF CC JUMP 30 /*0xFFA0008E*/ ;/* Check the first bit and jump if set*/
NOP ;                                                              /* NOP to prevent unintented increment */
R1 = R1 + R2 ;                                             /* If branch not taken then increment */
P0 += 4 ;                                                       /* Increment the address to be read */

….

R0 = R0 + R2 ; /* Increment counter for odd */
JUMP.S -26 /*0xFFA00076*/ ; /* Jump back to the starting of the loop */

Now can we optimize the above loop? The main bottleneck in the above code is the jump, each conditional jump misprediction costs 8 core clocks.

How can we reduce the cost of this jump? Here we can analyse the input pattern of the array and lets say that the input array always consists of a majority of odd elements, in such a scenario we have a branch misprediction in the majority of the cases. So let’s reverse branch prediction decision. Optimized loop given below.

/* Set up the loop */

lsetup(Loop_starts,Loop_ends)LC0 = p0;
Loop_starts:
r1 = extract(r0,r2.l)(z)||r0 = [i0++]; /* Check the bit 0 and read the next element at the same time */
cc = r1;                                                             /*Assign LSB to CC flag */
if cc jump odd_num(bp); /* If CC is set then jump (note branch prediction) */
r5 += 1;                                                            /* If branch not taken then increment even */
odd_num:
NOP;
Loop_ends:
r6 += 1;                                                           /* If branch taken the increment odd*/

The above code has two critical changes inside the loop.

Parallel instruction execution
Branch Prediction for conditional jump

Parallel instruction execution is an obvious advantage. Branch prediction reduced the penalty in the cases of odd numbers to 4 clock cycles but increased the penalty in case of even numbers to 8 clock cycles. As we know that the odd numbers are in majority so on the whole we end up reducing the cycle consumption by almost 30%. So one of the most effective ways to beat the compiler is to leverage the use case specific details.

Multi-Issue instructions and Multi-cycle instructions

May 28, 2009June 6, 2016Mahesh SreekandathLeave a comment

Execution of a machine instruction is divided into various steps, like fetching, decoding and execution. The role these steps play are different and hence we also need different hardware units, now dedicating different units also enables simultaneous executions. Essentially an instruction pipeline! BlackFin is not a super scalar processor but there can be a certain amount of parallel execution which can happen within a execution unit.

Core Modules connected by internal Buses

If we observe the above architecture, we can broadly classify BlackFin core into two units

ALU + Multipliers + Shifter + Video ALUs
Data Address Generators

Here in lies the key for identifying the instructions which can execute in parallel, we can broadly say that we should be able to execute computational unit operations and load/store operations in parallel.

r1 = extract(r0,r2.l)(z)||r0 = [i0++]||w[i1]=r5.l;

above given is a multi-issue instruction, this combines a 32 bit instruction (extract) with two 16 bit instructions (load and store).

The Extract instruction is executed by the Barrel Shifter hardware and the load store instructions are executed by Data address generators. So we have both the modules of the core working in parallel.

Looking at this from a “Load Store” architecture point of view we can add one more observation. Such a parallel execution of Computational operation and Load/Store operation is possible because the former is not accessing any memory. And all the operands and the destination registers are within the core because of which there are no data bus accesses. Absence of this bus access is what makes it possible for the core to execute Load and store operation in parallel to the ALU/multiplier operations.

r3.h=(a1+=r6.h*r4.l);

The above instruction is not a multi-issue or a multi cycle instruction, we can probably say that if multi-issue instructions use the breadth of the processor then this instruction used the depth. I would leave it to the reader to guess how this one might be processed.

Multi-Cycle instructions are those which takes more than one cycle to execute, this is more like a CISC concept, one instruction gets decoded into multiple simple instructions.

r3 *= r0;
[ — sp ] = (r7:5, p5:0) ;

The first operation given above is a 32 bit multiplication which is not possible considering the fact that the available multipliers are 16 bits, hence this operation is achieved in the hardware by using the same 16 bit multipliers but by doing more than one multiplication operations.

The second operation is a stack push, this instruction specifies that all the r5 to r7 registers and all the p5 to p0 registers should be pushed to the stack. The hardware will sequentially push all the specified registers one by one and this leads to a multi-cycle operation.

All the multi-cycle operations are decoded many times over and over again!

Catch in Circular Buffering

May 21, 2009June 6, 2016Mahesh SreekandathLeave a comment

Some of the features which distinguishes a DSP processor are.

Multipliers
Video ALUs
Zero Overhead loop support
Circular Buffering support

There is a code which used the the circular buffering capabilities of Starcore to write into a common buffer. The code which copied data into the buffer was optimized by writing it in assembly. It was also written to make sure that at a time maximum number of bytes were copied using a MOVE operation. In other words if the source and destination was aligned at 8 byte boundaries then the assembly did a MOVE.64 instruction which copied 64 bits at a time.

The pseudo code of the assembly is given below.

memcopy(src,dest,size)

Start of Loop:
if (src%8 == 0) &&(dest%8 == 0)&&(size >=8)
move.64 src,dest (instruction to move 8 bytes)
else if (src%4 == 0) &&(dest%4 == 0)&&(size >=4)
move.32 src,dest (instruction to move 4 bytes)
else if (src%2 == 0) &&(dest%2 == 0)&&(size >=2)
move.16 src,dest (instruction to move 2 bytes)
else move.8 src,dest (byte copy)

The above method adds the overhead of checking the alignment each time but the clock cycles saved by moving more bytes at a time using the move instruction is much more because size of most of the data written into this circular buffer was a multiple of 8 or at least a multiple of 4.

The circular buffering was implemented using the Index, base, length and the modifier register. Lets name them I0,B0,L0 and Mo, the hardware behaves in such a way that as soon as the address in I0 reaches the value B0 + L0 it makes sure that I0 is reset to B0. This was circular buffering is maintained without any additional software overhead of checking bounds.

Lets consider a buffer which has the following attributes.

Size = 8 bytes
Base address = 0x02

Now the writes into the circular buffer of 8 bytes happened in the following order

Write 1 : 2 bytes
Write 2: 4 bytes
Write 3: 4 bytes

Lets see what happens in such a scenario. Below I have given the BlackFin Assembly Code which does the write.

_main:
/* Initialize the values of the registers for circular buffering */
i0.l = circular_buff; /*Buffer Address – 0x02*/
i0.h = circular_buff; /*Buffer Address – 0x02*/
l0 = 8; /*Length of the buffer */
b0 = i0; /*Base address */

r0.l = 0xFFFF; /*Dummy Value which will */
r0.h = 0xeeee; /*be written into the buffer */

w[i0++] = r0.l;        /*Write 0xFFFF at address I0-0x02               */
[i0++] = r0;              /*Write 0xFFFFEEEE at I0-0x04                     */
[i0++] = r0;              /*Write 0xFFFFEEEE at I0-0x08                   */
/*At this point the overflow has happened */
/*0xEEEE has been written to 0x0A, */ /*which is out of bounds for the array circular_buff */

w[i0] = r0.h; /*I0 has properly looped back to 0x04 */

As you can see the comments, the first write of two bytes took up the locations 0x02 and 0x03, second write of 4 bytes used up locations 0x04 to 0x07.

Now after the first two writes we need to write 4 bytes more, the location 0x08 is 4 byte aligned and 4 bytes have to be written also, so ideally 2 bytes should be written into the address 0x08 and 0x09 and the third and the fourth byte should be written to the starting of the buffer afer a loop back.

With the above code this won’t happen and we will end up having a 2 byte overflow but the Index register will loop back properly and leave the first 2 bytes of the buffer unwritten and point itself to 0x04.

This was seen in the starcore when the following conditions were satisfied.

Size of the buffer was a multiple of 8 byte
Start address of the buffer was aligned at 4 byte boundaries so that the end address of the buffer minus 4 bytes will give you a 8 byte aligned address.

When the above conditions were satified we ended up having a 4 byte overflow. This is understandable because a load or a store operation has the following operations and they all happen in sequence.

Calculate the Load/Store address
Send out the Load/Store address on the address bus
Wait for the required amount of wait cycles before Writing the data onto the Store bus or reading data from the Load bus.

The circular buffering logic is implemented by the Data Address generator module in the Core and the Bus protocol which loads or stores a value is not aware of this condition because of which it goes ahead and writes or reads the value from the memory and we will have a corruption.

Making sure that the Index register is pointing to a valid address is done by executing a simple formula like

Index-new = Index-old + Modifier – Length;

This happens during one of the later stages in the pipeline (most probably the Write back Stage or just before that) while the address is generated at a much earlier stage and data gets read during the same time. All this makes sense, hardware is perfectly right when he did a byte overflow.

So this is indeed a bug in the software which we solved by making sure that the base address of circular buffer is always aligned at 8 byte boundaries.

How does Onchip Breakpoints work?

May 7, 2009July 29, 2016Mahesh SreekandathLeave a comment

Internal specifics of Onchip breakpoints on Freescale StarCore processors is easily understood if we are familiar with its emulator. So it’s good idea to read the following two posts before going through the details below.

Introduction to Debug modules

On Chip Emulator

On Chip Breakpoints:

A descriptive diagram of the whole debug infrastructure is given below.

EDU: Event Detection unit
EC: Event Counter
TB unit: Trace Buffer unit
EEx & EED : General purpose control signals which can be configured for input or output. Usage of this is SOC specific (Derivative specific).

Before we configure the break points the processor core should be placed in DEBUG mode, this DEBUG mode can invoked using two methods:

Send a Command to the controller
Assert the EEx signal as soon as the core comes out of reset

Using the second method will need configuring of registers accordingly. Once the Core is in the DEBUG mode, breakpoints can be configured via JTAG. Here we are not going to discuss the details of register settings but a design level overview of how this OnChip Emulator works.

The breakpoints/watchpoints should be enabled only after programming the EDU (Event Detection Unit) registers with the configuration which defines that breakpoint. Namely the address in the memory, type of the breakpoint etc. These registers can be written via the JTAG.

So the two methods employed for enabling an already configured breakpoint/watchpoint are:

Configure the breakpoint/watchpoint enabling register by writing into it using the JTAG port.
Use the EEx or the EED control lines to signal the EOnCE controller to enable or disable the debug functionalities.

Once we program the debug configuration, the host needs to be informed when an event happened or when a breakpoint got hit. EOnCE will ensure that the core enters DEBUG mode by sending it a signal but for informing the HOST we need a mechanism like an interrupt.

We can configure one out of the EEx signals to send an interrupt to the host once the core enters Debug mode.

From the above discussion it’s clear that breakpoints are supported in hardware via setting of few registers and they can be configured via EOnCE which in turn responds to JTAG interface. If there are multiple break points configured then there are mechanisms to identify which one was hit.

Read the EOnCE controller status or monitor registers which will tell you which event has got hit
Configure EEx or EED signal to assert an interrupt to the HOST, depending on which signal has asserted the interrupt, HOST can know which break point has got hit.
Read the PC Breakpoint detection register from EOnCE, this gives the PC of the address which caused the event.

PC Breakpoint detection register should me used in combination with reading of EOnCE status register because debug mode can be caused by other reasons also.One important point to be noted here is that EOnCE gives many options for the usage of EEx and EED signals but the way in which it will be used is SOC specific and each platform may have different uses of these.

Similarly, if the Event counter has to count “off core” events like Cache hits or misses, the the external signals EC0 and EC1 needs to be used, this is also SOC dependent.

Hardware Design

Conclusion

It’s obvious that onchip breakpoints need support in hardware, but we could still use some more clarity on how this support is provided.

EOnCE is closely integrated with the processor core because it needs to probe address and data buses for certain reference values configured in breakpoint/watchpoint registers.
Whenever processor core accesses an address, EOnCE comes to know about this address value and it compares it with the reference value. If they are same then, depending on the Event selector configuration an event will be raised.
The event can be a DEBUG exception, DEBUG signal to the processor, triggering a trace, disabling a trace etc.

Similar logic can be employed for probing data values also. The above diagram we can see two comparators, A & B, these are two address buses for data access, if we do not know in which bus the data we want will be accessed then we need to configure both Comparator A & B with the reference value and then set the condition for an OR-ing of the comparisons, so that even if one comparison returns a success we will have an event raised.

MASK register: The way it works is pretty self explanatory, the address bus values sampled is masked with the MASK register value and then compared with the reference values.

Trace Buffer Unit can detect and record program flow related details and send it to an off-core module like Nexus, from Nexus it can be sent to the host using a Nexus port on the board or it can be saved in a circular buffer inside any On-Chip memory.

Embedded Sense

System Software

Author: Mahesh Sreekandath

Part 3 — Bluetooth Logical Transports

Part 2 – Bluetooth Physical Layer

Part 1 – Bluetooth Framework

Bluetooth

NVIDIA Acquisition of Icera

NAND flash musings

Man Versus Compiler

Multi-Issue instructions and Multi-cycle instructions

Catch in Circular Buffering

How does Onchip Breakpoints work?