From the article:<p>> I was later surprised all the real world find implementations I examined use tree-walk interpreters instead.<p>I’m not sure why this would be surprising. The find utility is totally dominated by disk IOPS. The interpretation performance of find conditions is totally swamped by reading stuff from disk. So, keep it simple and just use a tree-walk interpreter.
The assumption that "find" performance is dominated by disk IOPS is not generally valid.<p>For instance, I normally compile big software projects in RAM disks (Linux tmpfs), because I typically use computers with no less than 64 GB of DRAM.<p>Such big software projects may have very great numbers of files and subdirectories and their building scripts may use "find".<p>In such a case there are no SSD or HDD I/O operations, everything is done in the main memory, so the intrinsic performance of "find" may matter.
Is it truly simpler to do that? A separate “command line to byte codes” module would be way easier to test than one that also does the work, including making any necessary syscalls.<p>Also, decreasing CPU usage many not speed up <i>find</i> (much), but it would leave more time for running other processes.
If it was easier to interpret byte codes, nobody would use a tree-walk interpreter. There’s no performance reason to use a tree-walk interpreter. They all do it because it’s easy. You basically already have the expression in tree form, regardless of where you end up. So, stop processing the tree and just interpret it.
File operations are a good candidate for testing with side effects since they ship with every OS and are not very expensive in a tmpfs, but you don't have to let it perform side effects. You could pass the eval function a delegate which it calls methods on to perform side effects and pass in a mocked delegate during testing.
Yeah that's basically what was discussed here: <a href="https://lobste.rs/s/xz6fwz/unix_find_expressions_compiled_bytecode" rel="nofollow">https://lobste.rs/s/xz6fwz/unix_find_expressions_compiled_by...</a><p>And then I pointed to this article on databases: <a href="https://notes.eatonphil.com/2023-09-21-how-do-databases-execute-expressions.html" rel="nofollow">https://notes.eatonphil.com/2023-09-21-how-do-databases-exec...</a><p>Even MySQL, Duck DB, and Cockroach DB apparently use tree-walking to evaluate expressions, not bytecode!<p>Probably for the same reason - many parts are dominated by I/O, so the work on optimization goes elsewhere<p>And MySQL is a super-mature codebase
I was just reading a paper about compiling SQL queries (actually about a fast compilation technique that allows for full compilation to machine code that is suitable for SQL and WASM): <a href="https://dl.acm.org/doi/pdf/10.1145/3485513" rel="nofollow">https://dl.acm.org/doi/pdf/10.1145/3485513</a><p>Sounds like many DBs do some level of compilation for complex queries. I suspect this is because SQL has primitives that actually compute things (e.g. aggregations, sorts, etc.). But find does basically none of that. Find is completely IO-bound.
Virtually all databases compile queries in one way or another, but they vary in the nature of their approaches. SQLite for example uses bytecode, while Postgres and MySQL both compile it to a computation tree which basically takes the query AST and then substitutes in different table/index operations according to the query planner.<p>SQLite talks about the reasons for each variation here: <a href="https://sqlite.org/whybytecode.html" rel="nofollow">https://sqlite.org/whybytecode.html</a>