Real Time Embedded software running on ASICs with limited memory and capabilities tend to lack paging and virtual memory units also. Which in turns places constraints on the dynamic memory allocation and stack memory usage. Run time allocation of memory heavily depends on the code execution pattern, manually analyzing all possible code execution paths may not be always feasible. So an easier way would be to have an extensive test suit with memory profiling.
Stack for each task in the system can be optimally allocated only by profiling its usage. Here we are not going to detail the various profiling methods, but jot down a strange observation where the stack allocation of a particular function was seen as abnormally high.
Please take a look at the following example code.

As you can see, the above dummy function executes a bunch of switch cases depending on the “code” passed to it, there are some stack variables allocated at the staring of the function (namely x,y & z) . Consider the size of “int” type to be 4 we can comfortably say that the stack allocation at the start of the function will be 4 * 3 = 12 bytes, well instead of this i noticed that the stack allocated by the ARM assembly code is 30. The assembly code looked something like what is given below.

Looks like the assembly code is allocating the stack just once after taking into account the maximum stack usage of the function, in other words the first 3 integers (x,y & z) account for 12 and the switch case ‘c’ accounts for the 18 bytes (3 integers and 3 short integers). So the total depth of the stack needs to be at least 30 bytes. Normally we would have expected stack allocation inside each switch case, this means that the assembly code might have had the following sequence included in it.

The above sequence adds more code and more execution overhead, so not really the most optimal way so instead compiler decided that it will allocate the whole stack in one shot. So the execution speed is more and code size also marginally reduced, everything looks good except the fact that now the stack usage seems to have got a hit, how? The stack used inside function “operation_add” would be allocated right after the 30 byte allocation but in the above “non-optimized” sequence we would have had it allocated only after 12 bytes, this means that the optimization in speed comes with a trade-off in the form of stack usage.
This was indeed an interesting observation, the code must have been compiled with optimization setting set for speed instead of memory, hence the compiler conveniently decides not to care much about stack memory. Actually even function inlining will add the stack of the inlined function to that of the caller function, again this is a typical case of speed/memory trade-off, invariably all kinds of optimization are!