Often the best way to learn how to avoid something is to experience it. With this in mind, have you experienced a genuinely nasty stack overflow? I don’t mean a quick CPU fault exception. No, I mean a stack overflow that cripples the software gradually over thousands of CPU cycles. If you haven’t seen it, you need to watch video lesson #10:
Lesson 10 – Stack overflow and other pitfalls of functions
The video shows that the call stack contains not just the return addresses but also automatic variables (variables defined inside functions, such as the array foo). But unlike the file-scope variables, automatic variables are uninitialized and contain garbage. To help you remember that, I suggest you think of the stack as a pile of dirty dishes. It’s simply disgusting to use any part of the stack without first “washing it clean” by explicitly initializing every automatic variable.
The lifetime of a variable is the time in which the variable has valid memory. The lifetime of an automatic variable begins with the call to the corresponding function and ends when that function returns. Therefore a function should never return a pointer to its automatic variable because, as shown in the video, all automatic variables fall outside the stack after the function returns.
On the other hand, you can pass pointers to automatic variables into the called functions, such as swap() called from main() in the video. The pointers remain valid because the lifetime of the automatic variables from main() lasts for the whole duration of the call to swap().
The most peculiar property of automatic variables is that there may be many instances of the same variable with overlapping lifetimes. That happens because every call of the function creates an entirely new set of automatic variables that persist until the function returns. You could see this in the video, where each call of the fact() function created a new instance of the foo array on the stack.
If you use a lot of automatic variables or deep call sequences, such as the recursion shown in the video, you can exceed the stack memory. That situation is called stack overflow and can have various consequences (usually very unpleasant).
If you are lucky, your program will crash quickly. You saw this in the video, where the sizable automatic variable foo promptly overflew the stack and ended up in the BusFault exception.
But it is also possible that the stack overflows gradually. You saw this in the video when the out-of-bounds indexing into the foo array corrupted the return address on the stack. In that case, the function fact() returned, but not to the point of the original call. Instead, due to some “coincidence” (admittedly prearranged to make it fun to watch), the CPU started “executing” the vector table. (You will learn about the vector table in future lessons about the startup code.) Interestingly, that didn’t trigger any CPU exceptions, and the execution continued through the main() function and into another call to the fact() function. The scenario repeated from that point, but the stack wasn’t quite restored to the original state, so it kept growing until it eventually overflew.
The main point I’d like you to remember is that overflowing the stack can lead to “logic-defying” behavior. Experts know this, and whenever they encounter an “impossible” condition, the first thing they check is stack overflow. In fact, this was the conclusion of the investigation performed by Michael Barr in the famous Toyota unintended acceleration lawsuits [1,2].
If you are interested in more analysis of the Toyota case, I’ve written a blog post: Are We Shooting Ourselves in the Foot with Stack Overflow?, where I pointed out strategies to mitigate the risk of such failures. The crucial aspect is that it’s much better to fail quickly and regain control than to keep running in some crippled state, potentially causing more harm.
With this in mind, perhaps you’ve noticed in the video that all CPU fault exception handlers were endless loops in the standard startup code. This common practice might be convenient for debugging because you find the CPU spinning inside the exception handler. But to the device users, the behavior represents a denial of service. Therefore, all exception handlers in production code should perform appropriate damage control and most likely reset the system. But this is a separate topic for another time.
In the next lesson, I’ll tackle the subject of standard integer types, mixing integer types, and integer promotion in C. Stay tuned!
 Transcript of Michael Barr’s testimony in the Toyota Unintended Acceleration Lawsuit
 Slides presented by Michael Barr in his expert witness testimony
For more Embedded, subscribe to Embedded’s weekly email newsletter.
You must Sign in or Register to post a comment.
This site uses Akismet to reduce spam. Learn how your comment data is processed.
Programming embedded systems: Stack overflow and other pitfalls of functions – Embedded