Mid 2015, I spent months optimizing Apache ORC's compression models over TPC-H.<p>It was easier to beat Parquet's defaults - ORC+zlib seemed to top out around the same as the default in this paper (~178Gb for the 1TB dataset, from the hadoop conf slides).<p>We got a lot of good results, but the hard lesson we learned was that scan rate is more important than size. A 16kb read and a 48kb read took about the same time, but CPU was used by other parts of the SQL engine, IO wasn't the bottleneck we thought it was.<p>And scan rate is not always "how fast can you decode", a lot of it was encouraging data skipping (see the Capacitor paper from the same era).<p>For example, when organized correctly, the entire l_shipdate column took ~90 bytes for millions of rows.<p>Similarly, the notes column was never read at all so dictionaries etc was useless.<p>Then I learned the ins & outs of another SQL engine, which kicked the ass of every other format I'd ever worked with, without too much magical tech.<p>Most of what I can repeat is that SQL engines don't care what order rows in a file are & neither should the format writer - also that DBAs don't know which filters are the most useful to organize around & often they are wrong.<p>Re-ordering at the row-level beats any other trick with lossless columnar compression, because if you can skip a row (with say an FSST for LIKE or CONTAINS into index values[1] instead of bytes), that is nearly infinite improvement in the scan rate and IO.<p>[1] - <a href="https://github.com/amplab/succinct-cpp" rel="nofollow">https://github.com/amplab/succinct-cpp</a>