Before Bluestore, we ran Ceph on ZFS with the ZFS Intent Log on NVDIMM (basically non-volatile RAM backed by a battery). The performance was extremely good. Today, we run Bluestore on ZVOLs on the same setup and if the zpool is a "hybrid" pool we put the Ceph OSD databases on an all-NVMe zpool. Ceph WAL wants a disk slice for each OSD, so we don't do Ceph WAL and consolidate incoming writes on the ZiL/SLOG on NVDIMM.
Also known as:<p>Write! No, fsync! No, really fsync I mean it!<p>Wait, why is my disk throughput so low? And why am I out of file descriptors?
Article is focused on Ceph where FS is a frontend to the storage backend(s), now read the title again...
> Wait, why is my disk throughput so low?<p>Because many filesystems do fsync wrong, for reasons that are not inherent to filesystems in general.
It's easier to write the system's front end while paying little attention to the backend and "just" letting a local filesystem do a lot of the work for you, but it doesn't work well. The interesting question is if the result is also that the frontend-to-backend communication abstraction is good enough to replace the backend with a better solution. I'm not familiar enough with Ceph and BlueStore to have a conclusion on that.<p>I happen to work for a distributed file-system company, and while I don't do the filesystem part itself, the old saying "it takes software 10 years to mature" is so true in this domain.
See also "Hierarchical File Systems are Dead" by Margo Seltzer and Nicholas Murphy <a href="https://www.usenix.org/legacy/events/hotos09/tech/full_papers/seltzer/seltzer.pdf" rel="nofollow">https://www.usenix.org/legacy/events/hotos09/tech/full_paper...</a>
No mention of LATCH theory? (Location, Alphabet, Time, Category, and Hierarchy)<p>Oddly, no matter how they are organized, their indices will always be a hierarchy (tree).<p>Personally, I think human brains just have a categorization approach that is built into our brains as hierarchy, so while other methods are definitely useful, they are an add-on, not a replacement.
Lot's of these issues seem to be not specific to distributed systems and also impact local single-node systems. Notable example is postgresql fsyncgate, or how mail servers in the past struggled (iirc that was one of the cases where reiserfs shined).
Noooo, really?<p>It all depends on what you want to do. For things that are already in files like all that data that DeepSeek and other models train on and for which DS open sourced their own distributed file system, it makes sense to go with a distributed file system.<p>For OLTP you need a database with appropriate isolation levels.<p>I know someone will build a distributed file system on top of FoundationDB if they haven’t yet.
~2006 I've built a fuse fs that used mysql as a backend, kept all file hashes (not blocks, just whole files) and did deduplication. good old times.
Isn't the Cassandra file system something like that ?
They did it atop Cassandra.
They have, at Exoscale. My officemate leads the team doing it.
Just use hypercore with hyperdrive. And be free!
It really is true, I spent years of my life wrangling a massive glusterfs cluster and it was awful. You basically can't do any kind of file system operations on it that aren't CRUD on well known specific paths. Anything else— traversal, moving/copying, linking, updating permissions would just hang forever. You're also at the mercy of the kernel driver which does hate you personally. You will have nightmares about uninterruptible sleep. Migrating it all to S3 over Ceph was a beautiful thing.