Removing newlines in FASTA file increases ZSTD compression ratio by 10x

(log.bede.im)

279 points by bede148 days ago

21 comments

jefftk145 days ago
The FASTA format looks like:<pre><code> > title bases with optional newlines > title bases with optional newlines ... </code></pre> The author is talking about removing the non-semantic optional newlines (hard wrapping), not all the newlines in the file.It makes a lot of sense that this would work: bacteria have many subsequences in common, but if you insert non-semantic newlines at effectively random offsets then compression tools will not be able to use the repetition effectively.
- LeifCarrotson145 days ago
 In case "bases with optional newlines" wasn't obvious to anyone else, a specific example (from Wikipedia) is:<pre><code> ;LCBO - Prolactin precursor - Bovine MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSS EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHL VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDED ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC* </code></pre> where "SS...EM", HL..VT", or "ED..AR" may be common subsequences, but the plaintext file arbitrarily wraps at column 65 so it renders on a DEC VT100 terminal from the 70s nicely.Or, for an even simpler example:<pre><code> ; plaintext GATTAC AGATTA CAGATT ACCAGA TTACAG ATTACA </code></pre> becomes, on disk, something like<pre><code> ; plaintext\r\nGATTAC\r\nAGATTA\r\nCAGATT\r\nACCAGA\r\nTTACAG\r\nATTACA\r\n </code></pre> which is hard to compress, while<pre><code> ; plaintext\r\nGATTACAGATTACAGATTACCAGATTACAGATTACA </code></pre> is just<pre><code> "; plaintext\r\n" + "GATTACA" * 7 </code></pre> and then, if you want, you can reflow the text when it's time to render to the screen.
 - tgtweak145 days ago
 Feels like it could be an extension to the compression lib (and would retain newlines as such) vs requiring external document tailoring. Also feels like a very specific use case but this optimization might have larger applications outside this narrow field/format.
 - Terr_145 days ago
 Huh, so in other words: "If you don't arbitrarily interrupt continuous sequences of data with cosmetic noise, they compress better."
 - spatoa144 days ago
 When working with data, I definitely prefer the UI to adapt to the data. I never save anything for the display back.
- bede145 days ago
 Thank you for clarifying this – yes the non-semantic nature of these particular line breaks is a key detail I omitted.
 - tialaramex145 days ago
 It might be worth (in some other context) introducing a pre-processing step which handles this at both ends. I'm thinking like PNG - the PNG compression is "just" zlib but for RGBA that wouldn't do a great job, however there's a (per row) filter step first, so e.g. we can store just the difference from the row above, now big areas of block colour or vertical stripes are mostly zeros and those compress well.Guessing which PNG filters to use can make a huge difference to compression with only a tiny change to write speed. Or (like Adobe 20+ years ago) you can screw it up and get worse compression and slower speeds. These days brutal "try everything" modes exist which can squeeze out those last few bytes by trying even the unlikeliest combinations.I can imagine a filter layer which says this textual data comes in 78 character blocks punctuated with \n so we're going to strip those out, then compress and in the opposite direction we decompress then put back the newlines.For FASTA we can just unconditionally choose to remove the extra newlines but that may not be true for most inputs, so the filters would help there.
 - dekhn145 days ago
 For one approach to compressing FASTQ (which is a series 3 distinct lines), we broke the distinct lines each into their own streams and then compressed each stream independently. That way the coder for the header line, sequence line, and error line could learn the model of that specific line type and get slightly better compression (not unlike a columnar format, although in this case, we simply combined "blocks" of streams together with a record separator.I'm still pretty amazed that periodic newlines hurt compression ratios so much, given the compressor can use both a huffman coding and a lookback dictionary.The best rule in sequence data storage is to store as little of it as possible.
- AndrewOMartin145 days ago
 The compression ratio will likely skyrocket if you sorted the list of bases.
 - shellfishgene145 days ago
 You're joking, but a few bioinformatics tools use the Burrows-Wheeler transform to save memory, which is a bit like sorting the bases.
 - jefftk145 days ago
 You can also improve compression by reordering the sequences within the FASTA file, as long as you're using it as a dictionary and not a list of title-sequence pairs.
- mylons145 days ago
 this is also the insight that the bwa developer had, to use the burrows-wheeler transform which is part of bzip2 due to it's compression properties being particularly good for genomic sequences.
 - dekhn145 days ago
 I once had the distinct pleasure of hosting the author of BWA (R. Durbin) at Google, and pointing out "That's Mike Burrows, over there, next to Jeff Dean and Sanjay Ghemawat". That led to an interesting discussion between Durbin and Dean on DNA sequence compression. It's not the first time I've been in a room with a bunch of geniuses and simply kept my mouth shut so nobody would know I'm an idiot.
 - mylons144 days ago
 fwiw Heng Li was "the author." He was a postdoc under Durbin's professorship. I was around when bwa was developed and was working in a collaboration with Heng Li (I was working on SOLiD R&D). Any development emails were between Heng and our team, we never spoke with Durbin.first author on a paper like this indicates the most significant contributions <a href="https://academic.oup.com/bioinformatics/article/25/14/1754/225615?login=false" rel="nofollow">https://academic.oup.com/bioinformatics/article/25/14/1754/2...</a>
 - nerpderp82144 days ago
 But you're the genius I keep my mouth shut around!
- amelius145 days ago
 This was a question that I thought was interesting enough to test ChatGPT with.Surprisingly, it gave an answer along the lines of the parent comment.However, it seems it didn't figure this out by itself but it says:> It’s widely known in bioinformatics that “one-line Fasta” files compress much better with LZ-based algorithms, and this is discussed in forums, papers, and practical guides on storing genomic data.
felixhandte145 days ago
This is because Zstd's long-distance matcher looks for matching sequences of 64 bytes [0]. Because long matching sequences of the data will likely have the newlines inserted in different offsets in the run, this totally breaks Zstd's ability to find the long-distance match.Ultimately, Zstd is a byte-oriented compressor that doesn't understand the semantics of the data it compresses. Improvements are certainly possible if you can recognize and separate that framing to recover a contiguous view of the underlying data.[0] <a href="https://github.com/facebook/zstd/blob/v1.5.7/lib/compress/zstd_ldm.c#L20" rel="nofollow">https://github.com/facebook/zstd/blob/v1.5.7/lib/compress/zs...</a>(I am one of the maintainers of Zstd.)
- nerpderp82145 days ago
 That is fascinating. I wonder if you could layer a Levenshtein State Machine on the strings so you can apply n-edits to the text to get longer matches.I absolutely adore ZSTD, it has worked so well for me compressing json metadata for a knowledge engine.
 - felixhandte145 days ago
 Zstd has a similar-ish capability called "repetition codes" [0].The first stage of Zstd does LZ77 matching, which transforms the input into "sequences", a series of instructions each of which describes some literals and one match. The literals component of the instruction says "the next L bytes of the message are these L bytes". The match component says "the next M bytes of the input are the M bytes N bytes ago".If you want to construct a match between two strings that differ by one character, rather than saying "the next N bytes are the N bytes M bytes ago except for this one byte here which is X instead", Zstd just breaks it up into two sequences, the first part of the match, and then a single literal byte describing the changed byte, and then the rest of the match, which is described as being at offset 0. The encoding rules for Zstd define offset 0 to mean "the previously used match offset". This isn't as powerful as a Levenshtein edit, but it's a reasonable approximation.The big advantage of this approach is that it doesn't require much additional machinery on the encoder or decoder, and thus remains very fast. Whereas implementing a whole edit description state machine would (I think) slow down decompression and especially compression enormously.[0] <a href="https://datatracker.ietf.org/doc/html/rfc8878#name-repeat-offsets" rel="nofollow">https://datatracker.ietf.org/doc/html/rfc8878#name-repeat-of...</a>
 - bede144 days ago
 Fascinating, thank you.
mfld145 days ago
<pre><code> Using larger-than-default window sizes has the drawback of requiring that the same --long=xx argument be passed during decompression reducing compatibility somewhat. </code></pre> Interesting. Any idea why this can't be stored in the metadata of the compressed file?
- lifthrasiir145 days ago
 It is stored in the metadata [1], but anything larger than 8 MiB is not guaranteed to be supported. So there has to be an out-of-band agreement between compressor and decompressor.[1] <a href="https://datatracker.ietf.org/doc/html/rfc8878#name-window-descriptor" rel="nofollow">https://datatracker.ietf.org/doc/html/rfc8878#name-window-de...</a>
 - mfld145 days ago
 Thanks! So the --long essentially signals the decoder "I am willing to accept the potentially large memory requirements implied by the given window size"
 - pbronez145 days ago
 Seems useful for games marketplaces like Steam and Xbox. You control the CDN and client, so you can use tricky but effective compression settings all day long.
 - chronogram145 days ago
 For internal use like that you can also use the library feature. The downside of using long=31 is increased memory usage, which might not be desirable for customer facing applications like Steam.
- nolist_policy145 days ago
 It uses more memory (up to +2gb) during decompression as well -> potential DoS.
 - Aachen145 days ago
 Sending a .zip filled with all zeroes, so it compresses extremely well, is a well-known DoS historically (zip bomb, making the server run out of space in trying to read the archive)You always need resource limits when dealing with untrusted data. RAM is one of the obvious ones. They could introduce a memory limit parameter; require passing --long with a value equal to or greater than what the stream requires to successfully decompress; require seeking support for the input stream so they can look back that way (TMTO); fall back to using temp files; or interactively prompt the user if there's a terminal attached. Lots of options, each with pros and cons of course, that would all allow a scenario where the required information for the decoder is stored in the compressed data file
 - dzaima145 days ago
 The decompressed output needn't be in-memory (or even on-disk; it could be directly streamed to analysis) all at the same time, at which point resource limits aren't a problem at all. And I believe --long already is a "grater than or equal to" value, and should also be effectively a memory limit (or pretty close to one at least).Seeking back in the input might theoretically work, but I feel like that could easily get very bad (aka exponential runtime); never mind needing actual seeking.
ashvardanian145 days ago
Nice observation!Took me a while to realize that Grace Blackwell refers to a person and not an Nvidia chip :)I’ve worked with large genomic datasets on my own dime, and the default formats show their limits quickly. With FASTA, the first step for me is usually conversion: unzip headers from sequences, store them in Arrow-like tapes for CPU/GPU processing, and persist as Parquet when needed. It’s straightforward, but surprisingly underused in bioinformatics — most pipelines stick to plain text even when modern data tooling would make things much easier :(
- jltsiren145 days ago
 Basic text formats persist, because everyone supports them. Many tools have better file formats for internal purposes, but they are rarely flexible enough and robust enough for wider use. There are occasional proposals for better general purpose formats, but the people proposing them rarely agree which of the competing proposals should be adopted. And even if they manage to agree, they probably don't have the time and the money to make it actually happen.
 - vintermann145 days ago
 Also for historical reasons I think, since Perl used to be the big bioinformatics language, and it is surprisingly hard to compete with in string handling.
 - lazide145 days ago
 Perl+strings really is one of those ‘unreasonably effective’ combinations.It feels like Benzene in some ways. Use it correctly and gdamn. Just don’t huff it - i mean - use it for your enterprise backend - and it’s worth it.
- bede145 days ago
 Yes, when doing anything intensive with lots of sequences it generally makes sense to liberate them from FASTA as early as possible and index them somehow. But as an interchange format FASTA seems quite sticky. I find the pervasiveness of fastq.gz particularly unfortunate with Gzip being as slow as it is.> Took me a while to realize that Grace Blackwell refers to a person and not an Nvidia chip :)I even confused myself about this while writing :-)
 - chrchang523145 days ago
 Note that BGZF solves gzip’s speed problem (libdeflate + parallel compression/decompression) without breaking compatibility, and usually the hit to compression ratio is tolerable.
Aachen145 days ago
I've also noticed this. Zstandard doesn't see very common patternsFor me it was an increasing number (think of unix timestamps in a data logger that stores one entry per second, so you are just counting up until there's a gap in your data), in the article it's a fixed value every 60 bytesOf course, our brains are exceedingly good at finding patterns (to the point where we often find phantom ones). I was just expecting some basic checks like "does it make sense to store the difference instead of the absolute value for some of these bytes here". Seeing as the difference is 0 between every 60th byte in the submitted article, that should fix both our issuesBzip2 performed much better for me but it's also incredibly slow. If it were only the compressor, that might be fine for many applications, but also decompressing is an exercise in patience so I've moved to Zstandard at the standard thing to use
- pajko145 days ago
 Bzip2 performs exactly better because it rearranges the input to achieve better pattern matches: <a href="https://en.m.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform" rel="nofollow">https://en.m.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_tran...</a>
 - vintermann144 days ago
 A number of identical copies of a string, but with random mutations propagating through it like a word ladder puzzle, is pretty close to best-case for BWT-based compressors.But Bzip2 is also a pretty bad BWT-based compressor. Not only does it use block sizes from a time when 8mb memory was a lot, it does silly things which doesn't help compression at all.
semiinfinitely145 days ago
FASTA is a candidate for the stupidest file format ever invented and a testament to the massive gap in perceived vs actual programming ability of the average bioinformatician.
- boothby145 days ago
 Spend a few years handling data in arcane, one-off, and proprietary file formats conceived by "brilliant" programmers with strong CS backgrounds and you might reconsider the conclusion you've come to here.
 - dwattttt145 days ago
 This is a presentation problem, or possibly a lack of tooling problem.A binary format with a tool that renders it to text works the same as a text format; if the rendering is lossless, you could even consume the text format rather than the binary.A "text" format is built to be understandable, but that's not a requirement; you could write a text format that isn't descriptive, and you'd have just as much trouble understanding what 'A' means as you would understanding what 'C0' means for a binary format.Undocumented formats are a pain, whether they're in text or binary.
 - boothby144 days ago
 > or possibly a lack of tooling problem.It's a lack of tooling problem. Because if you're a bioinformatics researcher, you want to devote your time, money and energy towards bioinformatics. You don't want to spend weeks getting tooling written to handle an arcane file format, nor pay for that tooling, nor hire a "brilliant" programmer. That tooling needs to be written, packaged and maintained for perhaps dozens of programming languages.Instead, you want to use the format that can be read and written by a rank novice with a single programming course under their belt, because that's what makes the field approachable by dewey-eyed undergrads eager to get their feet wet. Giving those folks an easy on-ramp is how you grow the field.And then you want to compress that format with bog-standard compression algorithms, and you might get side-tracked investigating how to improve that process without exploding your bioinformatics-focused codebase. Which is an interesting show of curiosity, not a reason to insult a class of scientists.There's also a distribution problem. When people break history, by introducing new file formats, or updating old file formats, that impedes archival and replication. And once you've got a handful of file formats, now you've got the classic n+1 problem where no single format is optimal in all ways so people are always inventing new formats to see what sticks. And now an archivist needs to maintain tooling with an ever-increasing overhead. Here we see a clash of wisdom versus intellect, and if you're trying to foster a healthy field of research, wisdom wins the long game.
 - dwattttt143 days ago
 > Instead, you want to use the format that can be read and written by a rank novice with a single programming course under their belt, because that's what makes the field approachable by dewey-eyed undergrads eager to get their feet wet. Giving those folks an easy on-ramp is how you grow the field.This is good, but you're implicitly saying there's a tradeoff made to achieve this, because otherwise we wouldn't be talking about alternatives. And an on-ramp for novices is good, but I don't see a step where the standards of the field move on from what novices can handle.
- semiinfinitely145 days ago
 other file formats that rival fasta in stupidity include fastq pdb bed sam cram vcf. further reading [1]> "intentionally or not, bioinformatics found a way to survive: obfuscation. By making the tools unusable, by inventing file format after file format, by seeking out the most brittle techniques"1. <a href="https://madhadron.com/science/farewell_to_bioinformatics.html" rel="nofollow">https://madhadron.com/science/farewell_to_bioinformatics.htm...</a>
 - jakobnissen145 days ago
 SAM is not a bad file format. What's bad about SAM?
 - optionalsquid145 days ago
 I don't dislike the format, and it is much, much better than what it replaced, but SAM, and its binary sister-format BAM, does have some flaws:- The original index format could not handle large chromosomes, so now there are two index formats: .bai and .csi- For BAM, the CIGAR (alignment description) operation count is limited to 16 bits, which means that very long alignments cannot be represented. One workaround I've seen (but thankfully not used) is saving the CIGAR as a string in a tag- SAM cannot unambiguously represent sequences with only a single base (e.g. after trimming), since a '*' in the quality column can be interpreted either as a single Phred score (9) or as a special value meaning "no qualities". BAM can represent such sequences unambiguously, but most tools output SAM
 - jakobnissen145 days ago
 True. I'd consider these minor flaws. W.r.t. the CIGAR, the spec says you do need to store it as a tag.
- fwip145 days ago
 It might be the stupidest, but stupid in the sense of "the simplest thing that could possibly work."When FASTA was invented, Sanger sequencing reads would be around a thousand bases in length. Even back then, disk space wasn't so precious that you couldn't spend several kilobytes on the results of your experiment. Plus, being able to view your results with `more` is a useful feature when you're working with data of that size.And, despite its simplicity, it has worked for forty years.
 - michaelhoffman145 days ago
 When FASTA was invented in 1985, generally sequencing reads would be about half that.The simplicity of FASTA seems like a dream compared to the GenBank flat file format used before then. And around the year 2000, less computationally-inclined scientists were storing sequence in Microsoft Word binary .doc files.A lot of file formats (including bioinformatics formats!) have come and gone in that time period. I don't think many would design it this way today, but it has a lot of nice features like the ones you point out that led to its longevity.
 - attractivechaos145 days ago
 FASTA was invented in late 1980s. At that time, unix tools often limited line length. Even in early 2000s, some unix tools (on AIX as I remember) still had this limit.
 - melagonster144 days ago
 Yes, If someone want, they can do many analyses by grep!
- totalperspectiv145 days ago
 > a testament to the massive gap in perceived vs actual programming ability of the average bioinformatician.This is not really a fair statement. Literally all of software bears the weight of some early poor choice that then keeps moving forward via weight of momentum. FASTA and FASTQ formats are exceptionally dumb though.
- Fraterkes145 days ago
 I’ll do you the immense favor of taking the bait. What’s so bad about it?
 - jszymborski145 days ago
 It's a fine format for what it is.A parser to stream FASTA can be written in like 30 lines [0], much easier than say CSV where the edge cases can get hairy.If you need something like fast random reads, use the FAIDX format [1], or even better just store it in an LMDB or SQLite embedded db.People forget FASTA was from 1985, and it sticks around because (1) it's easy to parse and write (2) we have mountains of sequences in that format going back 4 decades.[O] <a href="https://gist.github.com/jszym/9860a2671dabb45424f2673a49e4b582" rel="nofollow">https://gist.github.com/jszym/9860a2671dabb45424f2673a49e4b5...</a>[1] <a href="https://seqan.readthedocs.io/en/main/Tutorial/InputOutput/IndexedFastaIO.html" rel="nofollow">https://seqan.readthedocs.io/en/main/Tutorial/InputOutput/In...</a>
- StillBored145 days ago
 I think the prevalence of the format vs something more widely used should be part of that metric.On those grounds, the lack of pre-tokenization in html/css/js ranks at this point as a planet killing level of poor choices.
leobuskin145 days ago
What about a specialized dict for FASTA? Shouldn't it increase ZSTD compression significantly?
- bede145 days ago
 Yes I'd expect a dict-based approach to do better here. That's probably how it should be done. But --long is compelling for me because using it requires almost no effort, it's still very fast, and yet it can dramatically improve compression ratio.
 - tecleandor145 days ago
 From what I've read (although I haven't tested and I can't find my source from when I read it), dictionaries aren't very useful when dataset is big, and just by using '--long' you can cover that improvement.Have any of you tested it?
 - leobuskin145 days ago
 I don’t think the size of content matters, it’s all about patterns (and their repetitiveness) within, and FASTA is a great target, if I understand the format correctly
 - kevincox145 days ago
 Size of content does matter. Because the start of your content effectively builds up a dictionary. Once you have built a custom dictionary from the content you are compressing the initial dictionary is no longer relevant. So a dictionary effectively only helps at the start of the content. Exactly what "start" means will depend on your data and the algorithm but as the size of the data you compress grows the relative benefit of the dictionary drops.Or another way to look at this is that the total bytes saved of the dictionary will plateau. So your dictionary may save 50% of the first MB, 10% of the next MB and 5% of the rest of the first 10MB. It matters a lot if you are compressing 2MB of data (7% savings!) but not so much if you are compressing 1GB (<1%).
 - privatelypublic145 days ago
 I tried building a zstd dictionary for something(compressing many instances of the same Mono(.net) binary serialized class, mostly identical), and in this case it provided no real advantage. Honestly, I didn't dig into it too much, but will give --long a try shortly.PS: what the author implicitly suggests cannot be replaced with zstd tweaks. It'll be interesting to look at the file in imhex- especially if I can find an existing Pattern File.
keketi145 days ago
When you know you're going to be compressing files of particular structure, it's often very beneficial to tweak compression algorithm parameters. In one case when dealing with CSV data, I was able to find a LZMA2 compression level, dictionary size and compression mode that yielded a massive speedup, uses 1/100th the memory and surprisingly even yields better compression ratios, probably from the smaller dictionary size. That's in comparison to the library's default settings.
- ciupicri144 days ago
  Could you please provide more details, perhaps give an example?
IshKebab145 days ago
Damn surely you stop using ASCII formats before your dataset gets to 2 TB??
- rurban145 days ago
 Ha. it gets worse. Search engines or blacklist processors often use gigantic url lists, which are stored as plain ASCII, which is then fed into a perfect hash generator, which accesses those url's unordered. I.e. they need to create a second ordering index to access the urllist. The perfect hashing guys are mathematicians and so they don't care because their definition of a mphf (minimal perfect hash function) is just a random ordering of unique indices, but they don't care to store the ordering also. So we have ASCII and no index.
- bede145 days ago
 BAM format is widely used but assemblies still tend to be generated and exchanged in FASTA text. BAM is quite a big spec and I think it's fair to say that none of the simpler binary equivalents to FASTA and FASTQ have caught on yet (XKCD competing standards etc.)e.g. <a href="https://github.com/ArcInstitute/binseq" rel="nofollow">https://github.com/ArcInstitute/binseq</a>
- hhh145 days ago
 no, I power thru indefinitely with no recourse
- amelius145 days ago
 People rely on compression for that ;)
totalperspectiv145 days ago
Removing the wrapping newline from the FASTA/FASTQ convention also dramatically improves parsing perf when you don't have to do as much lookahead to find record ends.
- Gethsemane145 days ago
 Unfortunately, when you write a program that doesn't wrap output FASTAs, you have a bunch of people telling you off because SOME programs (cough bioperl cough) have hard limits on line length :)
 - sharedptr144 days ago
 Is BioPerl still standard, did people move to BioPython?When I was shown BioPerl I was tempted to write a better, C++ version, but was overwhelmed by other university stuff and let it go.
 - o11c145 days ago
 You can use content-defined chunking to wrap at a predictable place so that compression still works.
- bede145 days ago
 Thanks for reminding me to benchmark this!
 - totalperspectiv145 days ago
 I've only tested this when writing my own parser where I could skip the record end checks, so idk if this improves perf on a existing parser. Excited to see what you find!
lutusp145 days ago
> I speculated that this poor performance might be caused by the newline bytes (0x0A) punctuating every 60 characters of sequence, breaking the hashes used for long range pattern matching.If the linefeeds were treated as semantic characters and not allowed to break the hash size, you would get similar results without pre-filtering and post-filtering. It occurs to me that this strategy is so obvious that there must be some reason it won't work.
pkilgore145 days ago
To me the most interesting thing here isn't that you can compress something better by removing randomly-distributed semantically-meaningless information. It's why zstd --long does so much better than gzip when you do and the default does worse than gzip.What lessons can we take from this?
- cogman10145 days ago
 Why it does worse than gzip isn't something that I know. Why --long is so efficient is likely a result of evolution of all things :). A lot of things have common ancestors which means shared genetic patterns across species. --long allows zstd to see a 2gb window of data which means it's likely finding all those genetic similarities across species.Endogenous retroviruses [1] are interesting bits of genetics that helps link together related species. A virus will inject a bit of it's genetics into the host which can effectively permanently scar the host's DNA and all their offspring's DNA.[1] <a href="https://en.wikipedia.org/wiki/Endogenous_retrovirus" rel="nofollow">https://en.wikipedia.org/wiki/Endogenous_retrovirus</a>
diimdeep145 days ago
What's current way to accessibly process my 23andme raw data ? It's been synthesized decade ago and SNPedia and Promethease seems abandoned, so what's alternative if there is, and if there is none how we arrived to this?
- iijj145 days ago
  I was no longer on the scene when it happened, but I’ve been told it became very difficult to get ongoing funding from the National Science Foundation for bioinformatics software around 10 years ago. You could get an initial grant to develop something, but ongoing support was difficult. So websites and ‘databases’ (curated datasets) that made it easy to run the tools faded away.
- vintermann144 days ago
  What format is the 23andMe data in, by the way?
  - diimdeep144 days ago
    tab delimited file (usually compressed for distribution), containing the fields rsid, chromosome, position, genotype (e.g. rs3094315 1 742429 AG).
a_bonobo144 days ago
There's some discussion here about DNA-specific compression algorithms.I thought I'd raise yesterday's HN discussion on 'The unreasonable effectiveness of modern sort algorithms' <a href="https://news.ycombinator.com/item?id=45208828">https://news.ycombinator.com/item?id=45208828</a>That blog post isn't about DNA per se, but it is about sorting data when you know there are only 4 numbers. I guess DNA has 5 - A,T,G,C,N the unknown base - but there's a huge space of DNA-specific compression research that outperforms ZSTD.
rini17148 days ago
This might in general be a good preprocessing step to check for punctuation repeating in fixed intervals and remove it, and restore after decompression.
- vintermann145 days ago
 That turns in into specialized compression, which DNA already has plenty of. Many forms of specialized compression even allow string-related queries directly on the compressed data.
 - rini17144 days ago
 There are plenty of data formats where data is interspersed with fixed delimiters in fixed intervals.
 - vintermann143 days ago
 In fixed intervals? I'm not so sure about that. Generally if the intervals are fixed, you shouldn't need delimiters, you know where one thing begins and another ends anyway.Anyway, BWT-based compressors like Bzip2 do a good job on "repetition, but with random differences". Better than LZ-based compressors. However, they are not competitive on speed, and it's gotten relatively worse as computers got faster since the Burrows-Wheeler transform can't be parallelized very well and is inherently cache-unfriendly.
 - rini17142 days ago
 All kinds of data - block justified text, database files with fixed row size tables, even HTTP chunked encoding tends to have blocks of same size with same delimiters,...I really don't see how better supporting this "second order repetition" feature in the encoding would cause such a big problem. LZ variants already track repeating strings.
- bede147 days ago
 Yes, it sounds like 7-Zip/LZMA can do this using custom filters, among other more exotic (and slow) statistical compression approaches.
FL33TW00D145 days ago
Looking forward to the relegation of FASTQ and FASTA to the depths of hell where they belong. Incredibly inefficient and poorly designed formats.
- jefftk145 days ago
 How so? As long as you remove the hard wrapping and use compression aren't they in the same range as other options?(I currently store a lot of data as FASTQ, and smaller file sizes could save us a bunch of money. But FASTQ + zstd is very good.)
 - fwip145 days ago
 There's a few options out there that have noticeably better compression, with the downside of being less widely-compatible with tools. zstd also has the benefit of being very fast (depending on your settings, of course).CRAM compresses unmapped fastq pretty well, and can do even better with reference-based compression. If your institution is okay with it, you can see additional savings by quantizing quality scores (modern Illumina sequencers already do this for you). If you're aligning your data anyways, probably retaining just the compressed CRAM file with unmapped reads included is your best bet.There are also other fasta/fastq specific tools like fqzcomp or MZPAQ. Last I checked, both of these could about halve the size of our fastq.gz files.
 - FL33TW00D145 days ago
 <a href="https://www.biorxiv.org/content/10.1101/2025.04.08.647863v1.full.pdf" rel="nofollow">https://www.biorxiv.org/content/10.1101/2025.04.08.647863v1....</a>
 - optionalsquid145 days ago
 The fact that these formats are unable to represent degenerate bases (Ns in particular, but also the remaining IUPAC bases), in my experience renders them unusable for many, if not most, use-cases, including for the storage of FASTQ data
 - dwattttt145 days ago
 The question of how to represent things not specified in the original format is a tough one.At the loosest end a format can leave lots of space for new symbols, and you can just use those to represent something new. But then not everyone agrees on what the new symbol means, and worse multiple groups can use symbols to mean different things.On the other end of the spectrum, you can be strict about the format, and not leave space for new symbols. Then to represent new things you need a new standard, and people to agree on it.It's mostly a question of how well code can be updated and agreed upon, how strict you can require your tooling to be w.r.t. formats.
 optionalsquid144 days ago
 The original FASTA/Pearson format and fasta/tfasta tools have supported 'N' for ambiguous nucleotides since at least 1996 [1], and the FASTQ format has to my knowledge always supported 'N' bases (i.e. since around 2000). IUPAC codes themselves date back to 1970 [2]. You can probably get away with not supporting the full range of IUPAC nucleotide codes, but not supporting 'N' makes your tool unusable to represent what is probably the majority of available FASTA/FASTQ data[1] See 'release.v16' in the fasta2 release at <a href="https://fasta.bioch.virginia.edu/wrpearson/fasta/fasta_versions.html" rel="nofollow">https://fasta.bioch.virginia.edu/wrpearson/fasta/fasta_versi...</a>[2] <a href="https://iupac.qmul.ac.uk/misc/naabb.html" rel="nofollow">https://iupac.qmul.ac.uk/misc/naabb.html</a>
 melagonster144 days ago
 The problem is IUPAC just exists.
dekhn145 days ago
I've explored alternatives to FASTA and FASTQ but in most cases I found that simply not storing sequence data is the best option of all, but if I have to do it, columnar formats with compression are usually the best alternative when considering all of (my) the constraints.
im3w1l145 days ago
As someone with an idle interest in data compression, ss it possible to download the original dataset somewhere to play around with? Or rather a like 20gb subset of it.
- fwip145 days ago
 The article links to the dataset here: <a href="https://ftp.ebi.ac.uk/pub/databases/ENA2018-bacteria-661k/" rel="nofollow">https://ftp.ebi.ac.uk/pub/databases/ENA2018-bacteria-661k/</a>
 - im3w1l145 days ago
 Ahh, I didn't notice that there were two adjacent links. I must have clicked the first one and gotten the article rather than the ftp.
meel-hd145 days ago
<a href="https://github.com/meel-hd/DNA" rel="nofollow">https://github.com/meel-hd/DNA</a>
nickdothutton145 days ago
How can we represent data or algos such that such optimisations before more obvious?
Kim_Bruning145 days ago
Now I'm wondering why this works. DNA clearly has some interesting redundancy strategies. (it might also depend on genome?)
- dwattttt145 days ago
 The FASTA format stores nucleotides in text form... compression is used to make this tractable at genome sizes, but it's by no means perfect.Depending on what you need to represent, you can get a 4x reduction in data size without compression at all, by just representing a GATC with 2 bits, rather than 8.Compression on top of that "should" result in the same compressed size as the original text (after all, the "information" being compressed is the same), except that compression isn't perfect.Newlines are an example of something that's "information" in the text format that isn't relevant, yet the compression scheme didn't know that.
 - hyghjiyhu145 days ago
 I think one important factor you missed to account for is frameshifting. Compression algorithms work on bytes - 8 bits. Imagine that you have the exact same sequence but they occur at different offsets mod 4. Then your encoding will give completely different results, and the compression algorithm will be unable to make use of the repetition.
 - dwattttt145 days ago
 I was actually under the impression compression algorithms tend to work over a bitstream, but I can't entirely confirm that.
 - Sesse__145 days ago
 Some are bit-by-bit (e.g. the PPM family of compressors[1]), but the normal input granularity for most compressors is a byte. (There are even specialized ones that work on e.g. 32 bits at a time.)[1] Many of the context models in a typical PPM compressor will be byte-by-byte, so even that isn't fully clear-cut.
 - bede144 days ago
 A Zstd maintainer clarified this: <a href="https://news.ycombinator.com/item?id=45251544">https://news.ycombinator.com/item?id=45251544</a>> Ultimately, Zstd is a byte-oriented compressor that doesn't understand the semantics of the data it compresses
 - vintermann145 days ago
 They output a bitstream, yeah but I don't know of anything general purpose which effectively consumes anything smaller than bytes (unless you count various specialized handlers in general-purpose compression algorithms, e.g. to deal with long lists of floats)
- vintermann145 days ago
 This is a dataset of bacterial DNA. Any two related bacteria will have long strings of the same letters. But it won't be neatly aligned, so the line breaks will mess up pattern matching.
 - bede145 days ago
 Exactly. The line breaks break the runs of otherwise identical bits in identical sequences. Unless two identical subsequences are exactly in phase with respect to their line breaks, the hashes used for long range matching are different for otherwise identical subsequences.
 - amelius145 days ago
 And the compressor does not think: "how can I make these two sequences align better without wasting a lot of space?"
 - ebolyen145 days ago
 No, because alignment, in the general case, is O(n^2). It is ironically one of the more tractable and well solved problems in bioinformatics.
 - tiagod145 days ago
 The compressor doesn't think about anything. Also, Zstd doesn't have the goal of reaching the highest possible compression ratio. It's more geared toward lowest overhead, high bandwidth compress/decompress.