Why does C have the best file API

(maurycyz.com)

101 points by maurycyz11 hours ago

27 comments

teo_zero14 minutes ago
What a bizarre conclusion to draw! It's like saying that cars are the best means of transportation because you can travel to the Grand Canyon in them and the Grand Canyon is the best landscape in the world, and yes you could use other means to get there, but cars are what everybody's using.If the real goal of TFA was to praise C's ability to reinterpret a chunk of memory (possibly mapped to a file) as another datatype, it would have been more effective to do so using C functions and not OS-specific system calls. For example:<pre><code> FILE *f = fopen(...); uint32_t *numbers; fread(numbers, ..., f); access numbers[...] frwite(numbers, ..., f); fclose(f);</code></pre>
- m-schuetz1 minute ago
 This is way more cumbersome than mmap if you need to out-of-core process the file in non-sequential patterns. Way way more cumbersome, since you need to deal with intermediate staging buffers, and reuse them if you actually want to be fast. mmap, on the other hand, is absolutely trivial to use, like any regular buffer pointer. And at least on windows, the mmap counterpart can be faster when processing the file with multiple threads, compared to fread.But I agree that it's a bizarre article since mmap is not a C standard, and relies on platform-dependend Operating System APIs.
amluto1 hour ago
I can’t entirely tell what the article’s point is. It seems to be trying to say that many languages can mmap bytes, but:> (as far as I'm aware) C is the only language that lets you specify a binary format and just use it.I assume they mean:<pre><code> struct foo { fields; }; foo *data = mmap(…); </code></pre> And yes, C is one of relatively few languages that let you do this without complaint, because it’s a terrible idea. And C doesn’t even let you specify a binary format — it lets you write a struct that will correspond to a binary format in accordance with the C ABI on your particular system.If you want to access a file containing a bunch of records using mmap, and you want a well defined format and good performance, then use something actually intended for the purpose. Cap’n Proto and FlatBuffers are fast but often produce rather large output; protobuf and its ilk are more space efficient and very widely supported; Parquet and Feather can have excellent performance and space efficiency if you use them for their intended purposes. And everything needs to deal with the fact that, if you carelessly access mmapped data that is modified while you read it in any C-like language, you get UB.
- gopalv32 minutes ago
 > correspond to a binary format in accordance with the C ABI on your particular system.We're so deep in this hole that people are fixing this on a CPU with silicon.The Graviton team made a little-endian version of ARM just to allow lazy code like this to migrate away from Intel chips without having to rewrite struct unpacking (& also IBM with the ppc64le).Early in my career, I spent a lot of my time reading Java bytecode into little endian to match all the bytecode interpreter enums I had & completely hating how 0xCAFEBABE would literally say BE BA FE CA (jokingly referred as "be bull shit") in a (gdb) x views.
- dvt42 minutes ago
 Had the same thought. Also confused at the backhanded compliment that pickle got:> Just look at Python's pickle: it's a completely insecure serialization format. Loading a file can cause code execution even if you just wanted some numbers... but still very widely used because it fits the mix-code-and-data model of python.Like, are they saying it's bad? Are they saying it's good? I don't even get it. While I was reading the post, I was thinking about pickle the whole time (and how terrible that idea is, too).
- Negitivefrags42 minutes ago
 Why is it such a terrible idea?No need to add complexity, dependancies and reduced performance by using these libraries.
 - amluto36 minutes ago
 Lots of reasons:The code is not portable between architectures.You can’t actually define your data structure. You can pretend with your compiler’s version of “pack” with regrettable results.You probably have multiple kinds of undefined behavior.Dealing with compatibility between versions of your software is awkward at best.You might not even get amazing performance. mmap is not a panacea. Page faults and TLB flushing are not free.You can’t use any sort of advanced data types — you get exactly what C gives you.Forget about enforcing any sort of invariant at the language level.
 - Negitivefrags30 minutes ago
 I've written a lot of code using that method, and never had any portability issues. You use types with number of bits in them.Hell, I've slung C structs across the network between 3 CPU architectures. And I didn't even use htons!Maybe it's not portable to some ancient architecture, but none that I have experienced.If there is undefined behavior, it's certainly never been a problem either.And I've seen a lot of talk about TLB shootdown, so I tried to reproduce those problems but even with over 32 threads, mmap was still faster than fread into memory in the tests I ran.Look, obviously there are use cases for libraries like that, but a lot of the time you just need something simple, and writing some structs to disk can go a long way.
 - lifthrasiir36 minutes ago
 No defined binary encoding, no guarantee about concurrent modifications, performance trade-offs (mmap is NOT always faster than sequential reads!) and more.
 - jraph35 minutes ago
 Because a struct might not serialize the same way from a CPU architecture to another.The size of int can be different, the padding can be different, etc.
seba_dos19 hours ago
mmap is not a C feature, but POSIX. There are C platforms that don't provide mmap, and on those that do you can use mmap from other languages (there's mmap module in the Python's standard library, for example).
- ajross8 hours ago
 I think this is sort of missing the point, though. Yes, mmap() is in POSIX[1] in the sense of "where is it specified".But mmap() was implemented in C because C is the natural language for exposing Unix system calls and mmap() is a syscall provided by the OS. And this is true up and down the stack. Best language for integrating with low level kernel networking (sockopts, routing, etc...)? C. Best language for async I/O primitives? C. Best language for SIMD integration? C. And it goes on and on.Obviously you can do this stuff (including mmap()) in all sorts of runtimes. But it always appears first in C and gets ported elsewhere. Because no matter how much you think your language is better, if you have to go into the kernel to plumb out hooks for your new feature, you're going to integrated and test it using a C rig before you get the other ports.[1] Given that the pedantry bottle was opened already, it's worth pointing out that you'd have gotten more points by noting that it appeared in 4.2BSD.
 - nickelpro8 hours ago
 If we're going to be pedantic, mmap is a syscall. It happens that the C version is standardized by POSIX.The underlying syscall doesn't use the C ABI, you need to wrap it to use it from C in the same way you need to wrap it to use it from any language, which is exactly what glibc and friends do.Moral of the story is mmap belongs to the platform, not the language.
 - a-dub8 hours ago
 it also appears in operating systems that aren't written in c. i see it as an operating system feature, categorically.
 - projektfu8 hours ago
 Why does Ada have the best file API?<a href="https://github.com/AdaCore/florist/blob/master/libsrc/posix-memory_mapping.ads" rel="nofollow">https://github.com/AdaCore/florist/blob/master/libsrc/posix-...</a>
Dwedit8 hours ago
Using mmap means that you need to be able to handle memory access exceptions when a disk read or write fails. Examples of disk access that fails includes reading from a file on a Wifi network drive, a USB device with a cable that suddenly loses its connection when the cable is jiggled, or even a removable USB drive where all disk reads fail after it sees one bad sector. If you're not prepared to handle a memory access exception when you access the mapped file, don't use mmap.
- justmedep8 hours ago
 You can even mmap a socket on some systems (iOS and macOS via GCD). But doing that is super fragile. Socket errors are swallowed.My interpretation always was the mmap should only be used for immutable and local files. You may still run into issues with those type of files but it’s very unlikely.
 - kelnos4 hours ago
 mmap is also good for passing shared memory around.(You still need to be careful, of course.)
 - usefulcat35 minutes ago
 It’s also great for when you have a lot of data on local storage, and a lot of different processes that need to access the same subset of that data concurrently.Without mmap, every process ends up caching its own private copy of that data in memory (think fopen, fread, etc). With mmap, every process accesses the same cached copy of that data directly from the FS cache.Granted this is a rather specific use case, but for this case it makes a huge difference.
- tremon5 hours ago
 C doesn't have exceptions, do you mean signals? If not, I don't see how that is that any different from having to handle I/O errors from write() and/or open() calls.
 - lights01235 hours ago
 Yes, it’s the SIGBUS signal.
- phoronixrly7 hours ago
 Ah, reminds me of 'Are You Sure You Want to Use MMAP in Your Database Management System? (2022)' <a href="https://db.cs.cmu.edu/mmap-cidr2022/" rel="nofollow">https://db.cs.cmu.edu/mmap-cidr2022/</a>
Const-me8 hours ago
I think C# standard library is better. You can do same unsafe code as in C, SafeBuffer.AcquirePointer method then directly access the memory. Or you can do safer and slightly slower by calling Read or Write methods of MemoryMappedViewAccessor.All these methods are in the standard library, i.e. they work on all platforms. The C code is specific to POSIX; Windows supports memory mapped files too but the APIs are quite different.
- SvenL8 hours ago
 I think you don’t need to be unsafe, they have normal API for it.<a href="https://learn.microsoft.com/en-us/dotnet/standard/io/memory-mapped-files" rel="nofollow">https://learn.microsoft.com/en-us/dotnet/standard/io/memory-...</a>
 - Const-me6 hours ago
 Indeed, but these normal APIs have runtime costs for bounds checking. For some use cases, unsafe can be better. For instance, last time I used a memory-mapped file was for a large immutable Bloom filter. I knew the file should be exactly 4GB, validated that in the constructor, then when testing 12 bits from random location of the mapped file on each query, I opted for unsafe codes.
zahlman8 hours ago
Aside from what <a href="https://news.ycombinator.com/item?id=47210893">https://news.ycombinator.com/item?id=47210893</a> said, mmap() is a low-level design that makes it easier to work with files that don't fit in memory and fundamentally represent a single homogeneous array of some structure. But it turns out that files commonly do fit in memory (nowadays you commonly have on the order of ~100x as much disk as memory, but millions of files); and you very often want to read them in order, because that's the easiest way to make sense of them (and tape is not at all the only storage medium historically that had a much easier time with linear access than random access); and you need to parse them because they don't represent any such array.When I was first taught C formally, they definitely walked us through all the standard FILE* manipulators and didn't mention mmap() at all. And when I first heard about mmap() I couldn't imagine personally having a reason to use it.
- wahern8 minutes ago
 > But it turns out that files commonly do fit in memoryThe difference between slurping a file into malloc'd memory and just mmap'ing it is that the latter doesn't use up anonymous memory. Under memory pressure, the mmap'd file can just be evicted and transparently reloaded later, whereas if it was copied into anonymous memory it either needs to be copied out to swap or, if there's not enough swap (e.g. if swap is disabled), the OOM killer will be invoked to shoot down some (often innocent) process.If you need an entire file loaded into your address space, and you don't have to worry about the file being modified (e.g. have to deal with SIGBUS if the file is truncated), then mmap'ing the file is being a good citizen in terms of wisely using system resources. On a system like Linux that aggressively buffers file data, there likely won't be a performance difference if your memory usage assumptions are correct, though you can use madvise & friends to hint to the kernel. If your assumptions are wrong, then you get graceful performance degradation (back pressure, effectively) rather than breaking things.Are you tired of bloated software slowing your systems to a crawl because most developers and application processes think they're special snowflakes that will have a machine all to themselves? Be part of the solution, not part of the problem.
- nickelpro8 hours ago
 mmap is also relatively slow (compared to modern solutions, io_uring and friends), and immensely painful for error handling.It's simple, I'll give it that.
 - nixon_why693 hours ago
 Page faults are slower than being deliberate about your I/O but mapped memory is no faster or slower than "normal" memory, its the same mechanism.
okanat7 hours ago
I guess the author didn't use that many other programming languages or OSes. You can do the same even in garbage collected languages like Java and C# and on Windows too.<a href="https://docs.oracle.com/javase/8/docs/api/java/nio/MappedByteBuffer.html" rel="nofollow">https://docs.oracle.com/javase/8/docs/api/java/nio/MappedByt...</a><a href="https://learn.microsoft.com/en-us/dotnet/api/system.io.memorymappedfiles.memorymappedfile?view=net-10.0" rel="nofollow">https://learn.microsoft.com/en-us/dotnet/api/system.io.memor...</a><a href="https://learn.microsoft.com/en-us/windows/win32/memory/creating-a-file-mapping-object" rel="nofollow">https://learn.microsoft.com/en-us/windows/win32/memory/creat...</a>Memory mapping is very common.
- dekhn7 hours ago
 mmap is a built-in module on python! Also true for perl.
- paulddraper3 hours ago
 And since Java 4 no less.
alanfranz8 hours ago
Well...I'm not sure what the author really wants to say. mmap is available in many languages (e.g. Python) on Linux (and many other *nix I suppose). C provides you with raw memory access, so using mmap is sort-of-convenient for this use case.But if you use Python then, yes, you'll need a bytearray, because Python doesn't give you raw access to such memory - and I'm not sure you'd want to mmap a PyObject anyway?Then, writing and reading this kind of raw memory can be kind of dangerous and non-portable - I'm not really sure that the pickle analogy even makes sense. I very much suppose (I've never tried) that if you mmap-read malicious data in C, a vulnerability would be _quite_ easy to exploit.
- okanat7 hours ago
 Creating memory mapped files is a very common OS feature since 90s. Many high level languages have it as OS agnostic POSIX or not.
 - racingmars6 hours ago
 > very common OS feature since 90sAnd if you want to go farther back, even if it wasn't called "mmap" or a specific function you had to invoke -- there were operating systems that used a "single-level store" (notably MULTICS and IBM's AS/400..err OS/400... err i5 OS... err today IBM i [seriously, IBM, pick a name and stick with it]) where the interface to disk storage on the platform is that the entire disk storage/filesystem is always mapped into the same address space as the rest of your process's memory. Memory-mapped files were basically the only interface there was, and the operating system "magically" persisted certain areas of your memory to permanent storage.
- bvrmn8 hours ago
 Actually in Python you could recast (zerocopy) bytearray as other primitive C type or even any other structure using ctypes module.
Perenti50 minutes ago
It may have a tidy mmap api, but Smalltalk has a much better file api through its Streams hierarchy IMHO. You can create a stream on a diskfile, you can create a stream on a byteArray, you can create a stream on standard Unix streams, you can create a stream on anything where "next" makes sense.
nickelpro8 hours ago
> Why does C have the best file API> Look inside> Platform APIsOk.I agree platform APIs are better than most generic language APIs at least. I disagree on mmap being the "best".
castral8 hours ago
I think OP and I have very divergent opinions on what makes a file API "best". This may have been the best 30 years ago. The world has moved on.
ibejoeb8 hours ago
The article only touches on `open` and `close` and doesn't deal with any of the realities of file access. Not a particularly compelling write-up.
srean9 hours ago
> However, in other most languages, you have to read() in tiny chunks, parse, process, serialize and finally write() back to the disk. This works, but is verbose and needlessly limitedC has those too and am glad that they do. This is what allows one to do other things while the buffer gets filled, without the need for multithreading.Yes easier standardized portable async interfaces would have been nice, not sure how well supported they are.
- general_reveal8 hours ago
 Wouldn’t we need to implement all of that extra stuff if we really wanted to work with text from files? I have a use case where I do need extra fast text input/output from files. If anyone has thoughts on this, I’d love it.
 - srean8 hours ago
 The standard way is to use libraries like libevent, libuv that wraps system calls such as epoll, kqueue etc.The other palatable way is to register consumer coroutines on a system provided event-loop. In C one does so with macro magic, or using stack switching with the help of tiny bit of insight inline assembly.Take a look at Simon Tatham's page on coroutines in C.To get really fast you may need to bypass the kernel. Or have more control on the event loop / scheduler. Database implementations would be the place to look.
chuckadams8 hours ago
It has the best API for the author, that's for sure. One size does not fit all: believe it or not, different files have different uses. One does not mmap a pipe or /dev/urandom.
mmastrac9 hours ago
"best file API" and the man page for the O_ flags disagree.
FrankWilhoit9 hours ago
A file API is not the same thing as a filesystem API. The holy grail is still a universal but high(-enough)-level filesystem API.
kelnos4 hours ago
mmap() is useful for some narrow use-cases, I think, but error-handling is a huge pain. I don't want to have to deal with SIGBUS in my application.I agree that the model of mmap() is amazing, though: being able to treat a file as just a range of bytes in memory, while letting the OS handle the fussy details, is incredibly useful. (It's just that the OS doesn't handle all of the fussy details, and that's a pain.)
charcircuit8 hours ago
C's API does not include mmap, nor does it contain any API to deal with file paths, nor does it contain any support for opening up a file picker. This paired with C's bad string support results in one of it being one of the worst file APIs.Also using mmap is not as simple as the article lays out. For example what happens when another process modifies the file and now your processes' mapped memory consists of parts of 2 different versions of the file at the same time. You also need to build a way to know how to grow the mapping if you run out room. You also want to be able to handle failures to read or write. This means you pretty much will need to reimplement a fread and fwrite going back to the approach the author didn't like: "This works, but is verbose and needlessly limited to sequential access." So it turns out "It ends up being just a nicer way to call read() and write()" is only true if you ignore the edge cases.
andersmurphy8 hours ago
mmap is nice. But, I find sqlite is a better filesystem API [1]. If you are going to use mmap why not take it further and use LMDB? Both have bindings for most languages.[1] - <a href="https://sqlite.org/fasterthanfs.html" rel="nofollow">https://sqlite.org/fasterthanfs.html</a>
teunispeters7 hours ago
technically yes, because there's a failure path for every single failure that an OS knows about. And most others aren't so resilient. However, mmap bypasses a lot of that....
koakuma-chan8 hours ago
How do you handle read/write errors with mmap?
- Falcondor8 hours ago
 mmap on file io errors would manifest in Signals (For example SIGBUS or SIGSEGV).So if you wanted to handle file read/write errors you would need to implement signal handlers.<a href="https://stackoverflow.com/questions/6791415/how-do-memory-mapped-files-handle-i-o-errors" rel="nofollow">https://stackoverflow.com/questions/6791415/how-do-memory-ma...</a>
 - koakuma-chan8 hours ago
 ... which is not great for an API.
- icedchai7 hours ago
 In my experience, having worked with a large system that used almost exclusively mmap for I/O, you don’t. The process segfaults and is restarted. In practice it almost never happened.
TZubiri6 hours ago
Here's a negative signal I'm seeing often:When a developer that usually consumes the language starts critiquing the language.I could go on as to why it's a bad signal, psychologically, but let's just say that empirically it usually doesn't come from a good place, it's more like a developer raising the stakes of their application failing and blaming the language.Sure one out of a thousand devs might be the next Linus Torvalds and develop the next Rust. But the odds aren't great.
nice_byte7 hours ago
mmap is not a language feature. it is also full of its own pitfalls that you need to be aware of. recommended reading: <a href="https://db.cs.cmu.edu/mmap-cidr2022/" rel="nofollow">https://db.cs.cmu.edu/mmap-cidr2022/</a>
jccx707 hours ago
This is like: I discovered the wheel and want to let you know!
userbinator8 hours ago
It still works if the file doesn't fit in RAMNo it doesn't. If you have a file that's 2^36 bytes and your address space is only 2^32, it won't work.On a related digression, I've seen so many cases of programs that could've handled infinitely long input in constant space instead implemented as some form of "read the whole input into memory", which unnecessarily puts a limit on the input length.
- dekhn7 hours ago
 Address space size and RAM are two different things.
 - dataflow5 hours ago
 What they said is correct regardless of that though?
 - conradludgate1 hour ago
 The point the article makes is that a 32GB file can be mmapped even if you only have 8GB of memory available - it wasn't talking about the address space. So the response is irrelevant even if technically correct
- actionfromafar8 hours ago
 You can mmap with offset, for that case. Just FYI in anyone thought it was a hard limit.
- jiggawatts7 hours ago
 All memory map APIs support moveable “windows” or views into files that are much larger than either physical memory or the virtual address space.I’ve seen otherwise competent developers use compile time flags to bypass memmap on 32-bit systems even though this always worked! I dealt with database engines in the 1990s that used memmap for files tens of gigabytes in size.