2 comments

  • gopalv1 hour ago
    Mid 2015, I spent months optimizing Apache ORC&#x27;s compression models over TPC-H.<p>It was easier to beat Parquet&#x27;s defaults - ORC+zlib seemed to top out around the same as the default in this paper (~178Gb for the 1TB dataset, from the hadoop conf slides).<p>We got a lot of good results, but the hard lesson we learned was that scan rate is more important than size. A 16kb read and a 48kb read took about the same time, but CPU was used by other parts of the SQL engine, IO wasn&#x27;t the bottleneck we thought it was.<p>And scan rate is not always &quot;how fast can you decode&quot;, a lot of it was encouraging data skipping (see the Capacitor paper from the same era).<p>For example, when organized correctly, the entire l_shipdate column took ~90 bytes for millions of rows.<p>Similarly, the notes column was never read at all so dictionaries etc was useless.<p>Then I learned the ins &amp; outs of another SQL engine, which kicked the ass of every other format I&#x27;d ever worked with, without too much magical tech.<p>Most of what I can repeat is that SQL engines don&#x27;t care what order rows in a file are &amp; neither should the format writer - also that DBAs don&#x27;t know which filters are the most useful to organize around &amp; often they are wrong.<p>Re-ordering at the row-level beats any other trick with lossless columnar compression, because if you can skip a row (with say an FSST for LIKE or CONTAINS into index values[1] instead of bytes), that is nearly infinite improvement in the scan rate and IO.<p>[1] - <a href="https:&#x2F;&#x2F;github.com&#x2F;amplab&#x2F;succinct-cpp" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;amplab&#x2F;succinct-cpp</a>
  • pella1 hour ago
    Thanks!<p>Looks similar to OpenZL ( <a href="https:&#x2F;&#x2F;openzl.org&#x2F;" rel="nofollow">https:&#x2F;&#x2F;openzl.org&#x2F;</a> ) <i>&quot;OpenZL takes a description of your data and builds from it a specialized compressor optimized for your specific format.&quot;</i>