A few years back I patched the memory allocator used by the Cloudflare Workers runtime to overwrite all memory with a static byte pattern on free, so that uninitialized allocations contain nothing interesting.<p>We expected this to hurt performance, but we were unable to measure any impact in practice.<p>Everyone still working in memory-unsafe languages should really just do this IMO. It would have mitigated this Mongo bug.
> OpenBSD uses 0xdb to fill newly allocated memory and 0xdf to fill memory upon being freed. This helps developers catch "use-before-initialization" (seeing 0xdb) and "use-after-free" (seeing 0xdf) bugs quickly.<p>Looks like this is the default in OpenBSD.
Recent macOS versions zero out memory on free, which improves the efficacy of memory compression. Apparently it’s a net performance gain in the average case
<i>A few years back I patched the memory allocator used by the Cloudflare Workers runtime to overwrite all memory with a static byte pattern on free, so that uninitialized allocations contain nothing interesting.</i><p>Note that many malloc implementations will do this for you given an appropriate environment, e.g. setting MALLOC_CONF to opt.junk=free will do this on FreeBSD.
FYI, at least in C/C++, the compiler is free to throw away assignments to any memory pointed to by a pointer if said pointer is about to be passed to free(), so depending on how you did this, no perf impact could have been because your compiler removed the assignment. This will even affect a call to memset()<p>see here: <a href="https://godbolt.org/z/rMa8MbYox" rel="nofollow">https://godbolt.org/z/rMa8MbYox</a>
I patched the free() implementation itself, not the code that calls free().<p>I did, of course, test it, and anyway we now run into the "freed memory" pattern regularly when debugging (yes including optimized builds), so it's definitely working.
However, if you recast to volatile, the compiler will keep it:<p><pre><code> #include <stdlib.h>
#include <string.h>
void free(void* ptr);
void not_free(void* ptr);
void test_with_free(char* ptr) {
ptr[5] = 6;
void *(* volatile memset_v)(void *s, int c, size_t n) = memset;
memset_v(ptr + 2, 3, 4);
free(ptr);
}
void test_with_other_func(char* ptr) {
ptr[5] = 6;
void *(* volatile memset_v)(void *s, int c, size_t n) = memset;
memset_v(ptr + 2, 3, 4);
not_free(ptr);
}</code></pre>
That code is not guaranteed to work. Declaring memset_v as volatile means that the variable has to be read, but does not imply that the function must be called; the compiler is free to compile the function call as "tmp = memset_v; if (tmp != memset) tmp(...)" relying on its knowledge that in the likely case of equality the call can be optimized away.
Whilst the C standard doesn't guarantee it, both LLVM and GCC _do_. They have implementation-defined that it will work, so are not free to optimise it away.<p>[0] <a href="https://llvm.org/docs/LangRef.html#llvm-memset-intrinsics" rel="nofollow">https://llvm.org/docs/LangRef.html#llvm-memset-intrinsics</a><p>[1] <a href="https://gitweb.git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob_plain;f=lib/memset_explicit.c;hb=refs/heads/stable-202301" rel="nofollow">https://gitweb.git.savannah.gnu.org/gitweb/?p=gnulib.git;a=b...</a>
Newer versions of C++ (and C, apparently) have functions so that the cast isn't necessary ( <a href="https://en.cppreference.com/w/c/string/byte/memset.html" rel="nofollow">https://en.cppreference.com/w/c/string/byte/memset.html</a> ).
Zeroing memory should absolutely be the default behavior for any generic allocator in 2025.<p>If you need better performance, write your own allocator optimized for your specific use case — it's not that hard.<p>Besides, you if you don't need to clear old allocations, there are likely other optimizations you'll be able to find which would never fly in a system allocator.
You know, I never even considered doing that but it makes sense; whatever overhead that's incurred by doing that static byte pattern is still almost certainly minuscule compared to the overhead of something like a garbage collector.
IMO the tradeoff that is important here is a few microseconds of time sanitizing the memory saves the millions of dollars of headache when memory unsafe languages fail (which happens regularly)
I agree. I almost feel like this should be like a flag in `free`. Like if you pass in 1 or something as a second argument (or maybe a `free_safe` function or something), it will automatically `memset` whatever it's freeing with 0's, and then do the normal freeing.
Alternatively, just make free do that by default, adding a fast_and_furious_free which doesn't do it, for the few hotspots where that tiny bit of performance is actually needed.
<a href="https://news.ycombinator.com/item?id=46417221">https://news.ycombinator.com/item?id=46417221</a>
Non-deterministic latency is a drawback, but garbage collection is not inherently slower than manual memory management/reference counting/etc. Depending on the usage pattern it can be faster. It's a set of trade-offs
Is this the same as enabling `init_on_free=1` in the kernel?
The author seems to be unaware that Mongo internally develops in a private repo and commits are published later to the public one with <a href="https://github.com/google/copybara" rel="nofollow">https://github.com/google/copybara</a>. All of the confusion around dates is due to this.
The author of this post is incorrect about the timeline. Our Atlas clusters were upgraded days before the CVE was announced.
How often are mongo instances exposed to the internet? I'm more of an SQL person and for those I know it's pretty uncommon, but does happen.
From my experience, Mongo DB's entire raison d'etre is "laziness".<p>* Don't worry about a schema.<p>* Don't worry about persistence or durability.<p>* Don't worry about reads or writes.<p>* Don't worry about connectivity.<p>This is basically the entire philosophy, so it's not surprising at all that users would also not worry about basic security.
Although interestingly, for all the mongo deployments I managed, the first time I saw a cluster publicly exposed without SSL was postgres :)
To the extent that any of this was ever true, it hasn’t been true for at least a decade. After the WiredTiger acquisition they really got their engineering shit together. You can argue it was several years too late but it did happen.
Not only that, but authentication is much harder than it needs to be to set up (and is off by default).
I'm sure there are publicly exposed MySQLs too
Most of your points are wrong. Maybe only 1- is valid'ish.
Ultimate webscale!
A highly cited reason for using mongo is that people would rather not figure out a schema. (N=3/3 for “serious” orgs I know using mongo).<p>That sort of inclination to push off doing the right thing now to save yourself a headache down the line probably overlaps with “let’s just make the db publicly exposed” instead of doing the work of setting up an internal network to save yourself a headache down the line.
> A highly cited reason for using mongo is that people would rather not figure out a schema.<p>Which is such a cop out, because there is <i>always</i> a schema. The only questions are whether it is designed, documented, and where it's implemented. Mongo requires some very explicit schema decisions, otherwise performance will quickly degrade.
Fowler describes it as Implicit vs Explicit schema, which feels right.<p>Kleppmann chooses "schema-on-read" vs "schema-on-write" for the same concept, which I find harder to grasp mentally, but describes when schema validation need occur.
I would have hoped that there would be no important data in mongoDB.<p>But now we can at least be rest assured that the important data in mongoDB is just very hard to read with the lack of schemas.<p>Probably all of that nasty "schema" work and tech debt will finally be done by hackers trying to make use of that information.
There is a surprising amount of important data in various Mongo instances around the world. Particularly within high finance, with multi-TB setups sprouting up here and there.<p>I suspect that this is in part due to historical inertia and exposure to SecDB designs.[0] Financial instruments can be <i>hideously complex</i> and they certainly are ever-evolving, so I can imagine a fixed schema for essentially constantly shifting time series universe would be challenging. When financial institutions began to adopt the SecDB model, MongoDB was available as a high-volume, "schemaless" KV store, with a reasonably good scaling story.<p>Combine that with the relatively incestuous nature of finance (they tend to poach and hire from within their own ranks), the average tenure of an engineer in one organisation being less than 4 years and you have an osmotic process of spreading "this at least works in this type of environment" knowledge. Add the naturally risk-averse nature of finance[ß] and you can see how one successful early adoption will quickly proliferate across the industry.<p>0: This was discussed at HN back in the day too: <a href="https://calpaterson.com/bank-python.html" rel="nofollow">https://calpaterson.com/bank-python.html</a><p>ß: For an industry that loves to take financial risks - with other people's money of course, they're not stupid - the players in high finance are remarkably risk-averse when it comes to technology choices. Experimentation with something new and unknown carries a potentially unbounded downside with limited, slowly emerging upside.
I'd argue that there's a schema; it's just defined dynamically by the queries themselves. Given how much of the industry seems fine with dynamic typing in languages, it's always been weird to me how diehard people seem to be about this with databases. There have been plenty of legitimate reasons to be skeptical of mongodb over the years (especially in the early days), but this one really isn't any more of a big deal than using Python or JavaScript.
Yes there's a schema, but it's hard to maintain. You end up with 200 separate code locations rechecking that the data is in the expected shape. I've had to fix too many such messes at work after a project grinded to a halt. Ironically some people will do schemaless but use a statically typed lang for regular backend code, which doesn't buy you much. I'd totally do dynamic there. But DB schema is so little effort for the strong foundation it sets for your code.<p>Sometimes it comes from a misconception that your schema should never have to change as features are added, and so you need to cover all cases with 1-2 omni tables. Often named "node" and "edge."
> Ironically some people will do schemaless but use a statically typed lang for regular backend code, which doesn't buy you much. I'd totally do dynamic there.<p>I honestly feel like the opposite, at least if you're the only consumer of the data. I'd never really go out of my way to use a dynamically typed language, and at that point, I'm already going to be having to do something to get the data into my own language's types, and at that point, it doesn't really make a huge difference to me what format it used to be in. When there are a variety of clients being used though, this logic might not apply though.
We just sit a data persistence service infront of mongo and so we can enforce some controls for everything there if we need them, but quite often we don’t.<p>It’s probably better to check what you’re working on than blindly assuming this thing you’ve gotten from somewhere is the right shape anyway.
The adage I always tell people is that in any successful system, the data will far outlive the code. People throw away front ends and middle layers all the time. This becomes so much harder to do if the schema is defined across a sprawling middle layer like you describe.
As someone who has done a lot of Ruby coding I would say using a statically typed database is almost a must when using a dynamically type language. The database enforces the data model and the Ruby code was mostly just glue on top of that data model.
What's weird to me is when dynamic typers don't acknowledge the tradeoff of quality vs upfront work.<p>I never said mongodb was wrong in that post, I just said it accumulated tech debt.<p>Let's stop feeling attacked over the negatives of tradeoffs
It's possible you didn't intend it, but your parent comment definitely came off as snarky, so I don't think you should be surprised that people responded in kind. You're honestly doing it again with the "let's stop feeling attacked" bit; whether you mean it or not, your phrasing comes across as pretty patronizing, and overall combined with the apparent dislike of people disagreeing with you after the snark it comes across as passive-aggressive. In general it's not going to go over well if you dish out criticism but can't take it.<p>In any case, you quite literally said there was a "lack of schemas", and I disagreed with that characterization. I certainly didn't feel attacked by it; I just didn't think it was the most accurate way to view things from a technical perspective.
Whatever horrors there are with mongo, it's still better than the shitshow that is Zope's ZODB.
The article links to a shodan scan reporting 213K exposed instances <a href="https://www.shodan.io/search?query=Product%3A%22MongoDB%22" rel="nofollow">https://www.shodan.io/search?query=Product%3A%22MongoDB%22</a>
It could be because when you leave an SQL server exposed it often turns into much worse things. For example, without additional configuration, PostgreSQL will default into a configuration that can own the entire host machine. There is probably some obscure feature that allows system process management, uploading a shell script or something else that isn't disabled by default.<p>The end result is "everyone" kind of knows that if you put a PostgreSQL instance up publicly facing without a password or with a weak/default password, it will be popped in minutes and you'll find out about it because the attackers are lazy and just running crypto-mine malware, etc.
My university has one exposed to the internet, and it's still not patched. Everyone is on holiday and I have no idea who to contact.
For a long time, the default install had it binding to all interfaces and with authentication disabled.
often. lots of data leaks happened because of this. people spin it up in a cloud vm and forget it has a public ip all the time.
[dead]
Because nobody uses mongo for the reasons you listed. They use redis, dynamo, scylla or any number of enriched KV stores.<p>Mongo has spent its entire existence pretending to be a SQL database by poorly reinventing
everything you get for free in postgres or mysql or cockroach.
False. Mongo never pretended to be a SQL database. But some dimwits insisted on using it for transactions, for whatever reason, and so it got transactional support, way later in life, and in non-sharded clusters in the initial release. People that know what they are doing have been using MongoDB for reliable horizontally-scalable document storage basically since 3.4. With proper complex indexing.<p>Scylla! Yes, it will store and fetch your simple data very quickly with very good operational characteristics. Not so good for complex querying and indexing.
[dead]
Yeah fair, I was being a bit lazy here when writing my comment. I've used nosql professionally quite a bit, but always set up by others. When working on personal projects I reach for SQL first because I can throw something together and don't need ideal performance. You're absolutely right that they both have their place.<p>That being said the question was genuine - because I don't keep up with the ecosystem, I don't know it's ever valid practice to have a nosql db exposed to the internet.
What they wrote was pretty benign. They just asked how common it is for Mongo to be exposed. You seem to have taken that as a completely different statement
In something like a database zeroing or poisoning on free is probably a good idea. (These days probably all allocators should do it by default.)<p>Allocators are an interesting place to focus on for security. Chris did amazing work there for Blink that eventually rolled out to all of Chromium. The docs are a fun read.<p><a href="https://blog.chromium.org/2021/04/efficient-and-safe-allocations-everywhere.html" rel="nofollow">https://blog.chromium.org/2021/04/efficient-and-safe-allocat...</a><p><a href="https://chromium.googlesource.com/chromium/src/+/master/base/allocator/partition_allocator/PartitionAlloc.md" rel="nofollow">https://chromium.googlesource.com/chromium/src/+/master/base...</a>
I'm still thinking about the hypothetical optimism brought by OWASP top 10 hoping that major flaws will be solved and that buffer overflow has been there since the beginning... in 2003.
Why is anyone using mongo for literally anything
Easy replication. I suppose it's faster than Postgres's JSONB, too.<p>I would rather not use it, but I see that there are legitimate cases where MongoDB or DynamoDB is a technically appropriate choice.
because it is "web scale"<p>ref: <a href="https://www.youtube.com/watch?v=b2F-DItXtZs" rel="nofollow">https://www.youtube.com/watch?v=b2F-DItXtZs</a>
Right? When they came out, it was all about NoSQL, which then turned out only mean key-value database, whom are plentiful.
This is a nasty ad repositorium datorum argumentation which I cannot tolerate.
> On Dec 24th, MongoDB reported they have no evidence of anybody exploiting the CVE<p>Absence of evidence is not evidence of absence...
Related:<p><i>MongoBleed</i><p><a href="https://news.ycombinator.com/item?id=46394620">https://news.ycombinator.com/item?id=46394620</a>
> In C/C++, this doesn’t happen. When you allocate memory via `malloc()`, you get whatever was previously there.<p>What would break if the compiler zero'd it first? Do programs rely on malloc() giving them the data that was there before?
is it true that ubisoft got hacked and 900GB of data from their database was leaked due to mongobleed, i am seeing a lot of posts on social media under the #ubisoft tags today. can someone on HN confirm?
This has many similarities to the Heartbleed vulnerability: it involves trusting lengths from an attacker, leading to unauthorized revelation of data.
MongoDB has always sucked... But it's webscale (sic)<p>Do yourself a favour, use ToroDB instead (or even straight PostgreSQL's JSONB).
Have all Atlas clusters been auto-updated with a fix?
"MongoBleed Explained by an LLM"
If it is, it's less fluffy and empty than most of LLM prose we're usually fed. It's well explained and has enough details to not be overwhelming.<p>Honestly, aside from the "<emoji> impact" section that really has an LLM smell (but remember that some people legit do this since it's in the llm training corpus), this more feels like LLM assisted (translated? reworded? grammar-checked?) that pure "explain this" prompt.
I didn't use AI in writing the post.<p>I did some research with it, and used it to help create the ASCII art a bit. That's about it.<p>I was afraid that adding the emoji would trigger someone to think it's AI.<p>In any case, nowadays I basically always get at least one comment calling me an AI on a post that's relatively popular. I assume it's more a sign of the times than the writing...
Thank you for the clarification! I'm sorry for engaging in the LLM hunt, I don't usually do. Please keep writing, this was a really good breakdown!<p>In hindsight, I would not even have thought about it if not for the comment I replied to. LLM prose fail to make me read whole paragraphs and I find myself skipping roughly the second half of every paragraph, which was definitely not the case for your article. I did somewhat skip at the emoji heading, not because of LLMs, but because of a saturation of emojis in some contexts that don't really need them.<p>I should have written "this could be LLM assisted" instead of "this more feels like LLM assisted", but well words.<p>Again, sorry, don't get discouraged by the LLM witch hunt.
I’m about ready to start flagging every comment that complains about the source material being LLM-generated. It’s tiresome, pointless, and adds absolutely nothing useful to the discussion.<p>If the material is wrong, explain why. Otherwise, shut up.
Though the source article was human written, the public exploit was developed with an LLM.<p><a href="https://x.com/dez_/status/2004933531450179931" rel="nofollow">https://x.com/dez_/status/2004933531450179931</a>
[dead]
[dead]
[dead]