Even calling uninitialized data “garbage” is misleading. You might expect that the compiler would just leave out some initialization code and compile the remaining code in the expected way, causing the values to be “whatever was in memory previously”. But no - the compiler can (and absolutely will) optimize by assuming the values are whatever would be most convenient for optimization reasons, even if it would be vanishingly unlikely or even impossible.<p>As an example, consider this code (godbolt: <a href="https://godbolt.org/z/TrMrYTKG9" rel="nofollow">https://godbolt.org/z/TrMrYTKG9</a>):<p><pre><code> struct foo {
unsigned char a, b;
};
foo make(int x) {
foo result;
if (x) {
result.a = 13;
} else {
result.b = 37;
}
return result;
}
</code></pre>
At high enough optimization levels, the function compiles to “mov eax, 9485; ret”, which sets both a=13 and b=37 without testing the condition at all - as if both branches of the test were executed. This is perfectly reasonable because the lack of initialization means the values <i>could</i> already have been set that way (even if unlikely), so the compiler just goes ahead and sets them that way. It’s faster!
That seems like a reasonable optimization, actually. If the programmer doesn’t initialize a variable, why not set it to a value that always works?<p>Good example of why uninitialized variables are not intuitive.
Indeed, UB is literally whatever the compiler feels like. A famous one [1] has the compiler deleting code that contains UB and falling through to the next function.<p>"But it's right there in the name!" Undefined behavior literally places <i>no restrictions</i> on the code generated or the behavior of the program. And the compiler is under no obligation to help you debug your (admittedly buggy) program. It can literally <i>delete your program and replace it with something else that it likes.</i><p>[1] <a href="https://kristerw.blogspot.com/2017/09/why-undefined-behavior-may-call-never.html" rel="nofollow">https://kristerw.blogspot.com/2017/09/why-undefined-behavior...</a>
There are some even funnier cases like this one: <a href="https://gcc.godbolt.org/z/cbscGf8ss" rel="nofollow">https://gcc.godbolt.org/z/cbscGf8ss</a><p>The compiler sees that foo can only be assigned in one place (that isn't called locally, but could called from other object files linked into the program) and its address never escapes. Since dereferencing a null pointer is UB, it can legally assume that `*foo` is always 42 and optimizes out the variable entirely.
To those who are just as confused as me:<p>Compilers can do whatever they want when they see UB, and accessing an unassigned and unassiganble (file-local) variable is UB, therefore the compiler can just decide that *foo is in fact <i>always</i> 42, or never 42, or sometimes 42, and all would be just as valid options for the compiler.<p>(I know I'm just restating the parent comment, but I had to think it through several times before understanding it myself, even after reading that.)
> Compilers can do whatever they want when they see UB, and accessing an unassigned and unassiganble (file-local) variable is UB, therefore the compiler can just decide that *foo is in fact always 42, or never 42, or sometimes 42, and all would be just as valid options for the compiler.<p>That's not exactly correct. It's not that the compiler sees that there's UB and decides to do something arbitrary: it's that it sees that there's exactly one way for UB to <i>not</i> be triggered and so it's assuming that that's happening.
Although it should be noted that that’s not how compilers “reason”.<p>The way they work things out is to assume no UB happens (because otherwise your program is invalid and you would not request compiling an invalid program would you) then work from there.
If you don't initialise a variable, you're implicitly saying <i>any</i> value is fine, so this actually makes sense.
The difference is that it can behave as if it had multiple different values at the same time. You don't just get <i>any</i> value, you can get completely absurd paradoxical Schrödinger values where `x > 5 && x < 5` may be true, and on the next line `x > 5` may be false, and it may flip on Wednesdays.<p>This is because the code is executed symbolically during optimization. It's not running on your real CPU. It's first "run" on a simulation of an abstract machine from the C spec, which doesn't have registers or even real stack to hold an actual garbage value, but it does have magic memory where bits can be set to 0, 1, or this-can-never-ever-happen.<p>Optimization passes ask questions like "is x unused? (so I can skip saving its register)" or "is x always equal to y? (so I can stop storing it separately)" or "is this condition using x always true? (so that I can remove the else branch)". When using the value is an <i>undefined behavior</i>, there's no requirement for these answers to be consistent or even correct, so the optimizer rolls with whatever seems cheapest/easiest.
"Your scientists were so preoccupied with whether they could, they didn't stop to think if they should."<p>With Optimizing settings on, the compiler should immediately treat unused variables as errors by default.
For some values of 'sense'.
Even the notion that uninitialized memory contain values is kind of dangerous. Once you access them you can't reason about what's going to happen at all. Behaviour can happen that's not self-consistent with any value at all: <a href="https://godbolt.org/z/adsP4sxMT" rel="nofollow">https://godbolt.org/z/adsP4sxMT</a>
Things can get even wonkier if the compiler keeps the values in registers, as two consecutive loads could use different registers based as you say on what's the most convenient for optimisation (register allocation, code density).
If I understand it right, in principle the compiler doesn't even need to do that.<p>It can just leave the result totally uninitialised. That's because both code paths have undefined behaviour: whichever of result.x or result.y is not set is still copied at "return result" which is undefined behaviour, so the overall function has undefined behaviour either way.<p>It could even just replace the function body with abort(), or omit the implementation entirely (even the ret instruction, allowing execution to just fall through to whatever memory happens to follow). Whether any computer does that in practice is another matter.
> It can just leave the result totally uninitialised. That's because both code paths have undefined behaviour: whichever of result.x or result.y is not set is still copied at "return result" which is undefined behaviour, so the overall function has undefined behaviour either way.<p>That is incorrect, per the resolution of DR222 (partially initialized structures) at WG14:<p>> This DR asks the question of whether or not struct assignment is well defined when the source of the assignment is a struct, some of whose members have not been given a value. There was consensus that this should be well defined because of common usage, including the standard-specified structure struct tm.<p>As long as the caller doesn't read an uninitialised member, it's completely fine.
How is this an "optimization" if the compiled result is incorrect? Why would you design a compiler that can produce errors?
It’s not incorrect.<p>The code says that if x is true then a=13 and if it is false than b=37.<p>This is the case. Its just that a=13 even if x is false. A thing that the code had nothing to say about, and so the compiler is free to do.
Ok, so you’re saying it’s “technically correct?”<p>Practically speaking, I’d argue that a compiler assuming uninitialized stack or heap memory is always equal to some arbitrary convenient constant is obviously incorrect, actively harmful, and benefits no one.
In this example, the human author clearly intended mutual exclusivity in the condition branches, and this optimization would in fact destroy that assumption. That said, (a) human intentions are not evidence of foolproof programming logic, and often miscalculate state, and (b) the author could possibly catch most or all errors here when compiling without optimizations during debugging phase.
Also even without UB, even for a naive translation, a could just happen to be 13 by chance, so the behaviour isn't even an example of nasal demons.
Because a <i>could</i> be 13 even if x is false because initialisation of the struct doesn’t have defined behavior of what the initial values of a and b need to be.<p>Same for b. If x is true, b <i>could</i> be 37 no matter how unlikely that is.
It is not incorrect. The values are undefined, so the compiler is free to do whatever it want to do with them, even assign values to them.
It's not incorrect. Where is the flaw?
There are a few problems with this post:<p><pre><code> 1 - In C++, a struct is no different than a class
other than a default scope of public instead of
private.
2 - The use of braces for property initialization
in a constructor is malformed C++.
3 - C++ is not C, as the author eventually concedes:
At this point, my C developer spider senses are tingling:
is Response response; the culprit? It has to be, right? In
C, that's clear undefined behavior to read fields from
response: The C struct is not initialized.
</code></pre>
In short, if the author employed C++ instead of trying to use C techniques, all they would have needed is a zero cost constructor definition such as:<p><pre><code> inline Response () : error (false), succeeded (false)
{
;
}</code></pre>
I have bumped into this myself, too. It's really annoying. The biggest footgun isn't even discussed explicitly and it might be how the error got introduced - it's when the struct goes from POD to non-POD or vice-versa, the rules change, so completely innocent change, like adding a string field, can suddenly create undefined behaviour in unrelated code that was correct previously.
Yeah, looks pretty straightforward to me, but I used to write C++ for a living. I mean, there are complicated cases in C++ starting with C++11, this one is not really one of them. Just init the fields to false. Most of these cases is just C++ trying to bring in new features without breaking legacy code, it has become pretty difficult to keep up with it all.
Even if you fixed the initialized data problem, this code is still a bug waiting to happen. It should be a single bool in the struct to handle the state for the function as there are only two states that actually make sense.<p>succeeded = true;
error = true;
//This makes no sense<p>succeeded = false;
error = false;
//This makes no sense<p>Otherwise if I'm checking a response, I am generally going to check just "succeeded" or "error" and miss one of the two above states that "shouldn't happen", or if I check both it's both a lot of awkward extra code and I'm left with trying to output an error for a state that again makes no sense.
It happens often when "error" field is not a bool, but a string, aka error_message. Could be empty string, or _null_, or even _undefined_ if we're in JS.<p>Then the obvious question why do we need _succeeded_ at all, if we can always check for _error_. Sometimes it can be useful, when the server doesn't know itself if the operation is succeeded (e.g. an IO/database operation timed out), so it might be succeeded, but should also show an error message to user.<p>Another possibility if the succeeded is not a bool, but, say, "succeeded_at" timestamp. In general, I noticed that almost always any boolean value in database can be replaced with a timestamp or an error code.
To me the real horror is that the exact same syntax can be either a perfectly normal thing to do, or a horrible mistake that gives the compiler a license to kill, and this doesn't depend on something locally explicit, but on details of a definition that lives somewhere else and may have multiple layers of indirection.
Many years had a customer complaint about undefined data changing value in Fortran 77. It turned out that the compiler never allocated storage for uninitialized variables, so it was aliased to something else.<p>Compiler was changed to allocate storage for any referenced varibles.
I think UB doesn't have much to do with this bug after all.<p>The original code defined a struct with two bools that were not initialized. Therefore, when you instantiate one, the initial values of the two bools could be <i>anything</i>. In particular, they could be both true.<p>This is a bit like defining a local int and getting surprised that its initial value is not always zero. (Even if the compiler did nothing funny with UB, its initial value could be anything.)
> The original code defined a struct with two bools that were not initialized. Therefore, when you instantiate one, the initial values of the two bools could be anything. In particular, they could be both true.<p>Then reading from that struct like in OP constitutes UB.
Well yes, that would be UB, but even if the C++ compiler had no concept of UB, it would still be wrong code.
The two fields in the struct are expected to be false unless changed, then initialize them as such. Nothing is gained by leaving it to the compiler, and a lot is lost.
I think the point is that <i>sometimes</i> variables are defined by the language spec as initialized to zero, and <i>sometimes</i> they aren't.<p>Perhaps what you mean is, "Nothing is to be gained by relying on the language spec to initialize things to zero, and a lot is lost"; I'd agree with that.
tldr; the UB was reading uninitialized data in a struct. The C++ rules for when default initialization occurs are crazy complex.<p>I think a sanitizer probably would have caught this, but IMHO this is the language's fault.<p>Hopefully future versions of C++ will mandate default initialization for all cases that are UB today and we can be free of this class of bug.
Yeah... but I wouldn't characterize the bug itself (in its essential form) as UB.<p>Even if the implementation specified that the data would be indeterminate depending on what existed in that memory location previously, the bug would still exist.<p>Even if you hand-coded this in assembly, the bug would still exist.<p>The essence of the bug is uninitialized data being garbage. That's always gonna be a latent bug, regardless of whether the behavior is defined in an ISO standard.
Yeah I agree. This is a classic “uninitialized variable has garbage memory value” bug. But it is not a “undefined nasal demons behavior” bug.<p>That said, we all learn this one! I spent like two weeks debugging a super rare desync bug in a multiplayer game with a P2P lockstep synchronous architecture.<p>Suffice to say I am now a zealot about providing default values all the time. Thankfully it’s a lot easier since C++11 came out and lets you define default values at the declaration site!
In C++ 26 reading an uninitialized variable is by default Erroneous Behaviour, which means your compiler is encouraged to diagnose this (it's an error) but if it happens anyway (perhaps because the compiler can't tell before runtime) there's a specified behaviour, it isn't Undefined Behaviour. The compiler will have chosen some value for that uninitialized variable and <i>if</i> it can't just diagnose that what you wrote was nonsense, it has some arbitrary value, perhaps configurable or perhaps described in your compiler's documentation.<p>So these variables will be more or less what the current "defanged" Rust std::mem::uninitialized() function gets you. A bit slower than "truly" uninitialized variables, but not instant death in most cases if you made a mistake because you're human.<p>Those C++ people who feel they actually <i>need</i> uninitialized variables can tell the compiler explicitly [for that particular variable] in C++ 26 that they opt out of this safeguard. They get the same behaviour you've seen described in this thread today, arbitrary Undefined Behaviour if you read the uninitialized variable. This would be similar to modern Rust's MaybeUninit::uninit().assume_init() - you are explicitly telling the compiler it's OK to set fire to everything, you should probably not do this, but we did warn you.
For now, best strategy is to initialize everything explicitly.
Great post. It was both funny and humble. Of course, it probably wasn't at all funny at the time.
Symbian's way of avoiding this was to use a class called CBase to derive from. CBase would memset the entire allocated memory for the object to binary zeros, thus zeroizing any member variable.<p>And by convention, all classes derived from CBase would start their name with C, so something like CHash or CRectangle.
I'm afraid that's still not defined behaviour in many case. For example, pointer and bool can be initialized with `=0`, but that doesn't mean the binary representation in memory has to be 0, and so initializing with memset would still be wrong. (Even if it works with all compilers I know of.)<p>Also, how does CBase knows the size of its allocated memory?
The symbian source code is available. Looks like it uses a class-specific operator new() overload.<p><a href="https://github.com/SymbianSource/oss.FCL.sf.os.kernelhwsrv/blob/0c3208650587ac0230aed8a74e9bddb5288023eb/kernel/eka/include/e32base.h#L29" rel="nofollow">https://github.com/SymbianSource/oss.FCL.sf.os.kernelhwsrv/b...</a><p>2. Initialisation of the CBase derived object to binary zeroes through a specific CBase::operator new() - this means that members, whose initial value should be zero, do not have to be initialised in the constructor. This allows safe destruction of a partially-constructed object.