<p><pre><code> 74910,74912c187768,187779
< [Example 1: If you want to use the code conversion facetcodecvt_utf8to output tocouta UTF-8 multibyte sequence
< corresponding to a wide string, but you don't want to alter the locale forcout, you can write something like:\237 D.27.21954
\251ISO/IECN4950wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
< std::string mbstring = myconv.to_bytes\050L"Hello\134n"\051;
---
>
> [Example 1: If you want to use the code conversion facet codecvt_utf8 to output to cout a UTF-8 multibyte sequence
> corresponding to a wide string, but you don’t want to alter the locale for cout, you can write something like:
>
> § D.27.2
> 1954
>
> © ISO/IEC
> N4950
>
> wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
> std::string mbstring = myconv.to_bytes(L"Hello\n");
</code></pre>
Is indeed faster but output is messier. And doesn't handle Unicode in contrast to mutool that does. (Probably also explains the big speed boost.)
I built a PDF text extraction library in Zig that's significantly faster than MuPDF for text extraction workloads.<p>~41K pages/sec peak throughput.<p>Key choices: memory-mapped I/O, SIMD string search, parallel page extraction, streaming output. Handles CID fonts, incremental updates, all common compression filters.<p>~5,000 lines, no dependencies, compiles in <2s.<p>Why it's fast:<p><pre><code> - Memory-mapped file I/O (no read syscalls)
- Zero-copy parsing where possible
- SIMD-accelerated string search for finding PDF structures
- Parallel extraction across pages using Zig's thread pool
- Streaming output (no intermediate allocations for extracted text)
</code></pre>
What it handles:<p><pre><code> - XRef tables and streams (PDF 1.5+)
- Incremental PDF updates (/Prev chain)
- FlateDecode, ASCII85, LZW, RunLength decompression
- Font encodings: WinAnsi, MacRoman, ToUnicode CMap
- CID fonts (Type0, Identity-H/V, UTF-16BE with surrogate pairs)</code></pre>
FWIW - mupdf is simply not fast. I've done lots of pdf indexing apps, and mupdf is by far the slowest and least able to open valid pdfs when it came to text extraction. It also takes <i>tons</i> of memory.<p>a better speed comparison would either be multi-process pdfium (since pdfium was forked from foxit before multi-thread support, you can't thread it), multi-threaded foxit, or something like syncfusion (which is quite fast and supports multiple threads). Or even single thread pdfium vs single thread your-code.<p>These were always the fastest/best options. I can (and do) achieve 41k pages/sec or better on these options.<p>The other thing it doesn't appear you mention is whether you handle putting the words in reading order (IE how they appear on the page), or only stream order (which varies in its relation to apperance order) .<p>If it's only stream order, sure, that's really fast to do. But also not anywhere near as helpful as reading order, which is what other text-extraction engines do.<p>Looking at the code, it looks like the code to do reading order exists, but is not what is being benchmarked or used by default?<p>If so, this is really comparing apples and oranges.
What kind of performance are you seeing with/without SIMD enabled?<p>From <a href="https://github.com/Lulzx/zpdf/blob/main/src/main.zig" rel="nofollow">https://github.com/Lulzx/zpdf/blob/main/src/main.zig</a> it looks like the help text cites an unimplemented "-j" option to enable multiple threads.<p>There is a "--parallel" option, but that is only implemented for the "bench" command.
You've released quite a few projects lately, very impressive.<p>Are you using LLMs for parts of the coding?<p>What's your work flow when approaching a new project like this?
Claude Code.
> Are you using LLMs for parts of the coding?<p>I can't talk about the code, but the readme and commit messages are most likely LLM-generated.<p>And when you take into account that the first commit happened just three hours ago, it feels like the entire project has been vibe coded.
> I built<p>You didn't. Claude did. Like it did write this comment.<p>And you didn't even bother testing it before submitting, which is insulting to everyone.
What's fast about mmap?
Two big advantages:<p>You avoid an unnecessary copy. Normal read system call gets the data from disk hardware into the kernel page cache and then copies it into the buffer you provide in your process memory. With mmap, the page cache is mapped directly into your process memory, no copy.<p>All running processes share the mapped copy of the file.<p>There are a lot of downsides to mmap: you lose explicit error handling and fine-grained control of when exactly I/O happens. Consult the classic article on why sophisticated systems like DBMSs do not use mmap: <a href="https://db.cs.cmu.edu/mmap-cidr2022/" rel="nofollow">https://db.cs.cmu.edu/mmap-cidr2022/</a>
<i>you lose explicit error handling</i><p>I've never had to use mmap but this is always been the issue in my head. If you're treating I/O as memory pages, what happens when you read a page and it needs to "fault" by reading the backing storage but the storage fails to deliver? What can be said at that point, or does the program crash?
This is a very interesting link. I didn't expect mmap to be less performant than read() calls.<p>I now wonder which use cases would mmap suit better - if any...<p>> All running processes share the mapped copy of the file.<p>So something like building linkers that deal with read only shared libraries "plugins" etc ..?
it allows the program to reference memory without having to manage it in the heap space. it would make the program faster in a memory managed language, otherwise it would reduce the memory footprint consumed by the program.
What’s the fidelity like compared to tika?
Test it on major PDF corpora[1]<p>[1] <a href="https://github.com/pdf-association/pdf-corpora" rel="nofollow">https://github.com/pdf-association/pdf-corpora</a>
Is there the possibility to hook in OCR for text blocks flattened into an image, maybe with some callback? That’s my biggest gripe with dealing with PDFs.
very nice, it'd be good to see a feature comparison as when I use mupdf it's not really just about speed, but about the level of support of all kinds of obscure pdf features, and good level of accuracy of the built-in algorithms for things like handling two-column pages, identifying paragraphs, etc.<p>the licensing is a huge blocker for using mupdf in non-OSS tools, so it's very nice to see this is MIT<p>python bindings would be good too
added a comparison, will improve further. <a href="https://github.com/Lulzx/zpdf?tab=readme-ov-file#comparison-with-mupdf" rel="nofollow">https://github.com/Lulzx/zpdf?tab=readme-ov-file#comparison-...</a><p>also, added python bindings.
thanks, claude, I guess haha<p>as others have commented, I think while this is a nice portfolio piece, I would worry about its longevity as a vibe coded project
Impressive performance gains! 5x faster than MuPDF is significant, especially for applications processing large volumes of PDFs. Zig's memory safety without garbage collection overhead makes it ideal for this kind of performance-critical work.<p>I'm curious about the trade-offs mentioned in the comments regarding Unicode handling. For document analysis pipelines (like extracting text from technical documentation or research papers), robust Unicode support is often critical.<p>Would be interesting to see benchmarks on different PDF types - academic papers with equations, scanned documents with OCR layers, and complex layouts with tables. Performance can vary wildly depending on the document structure.
These vibe coded tests are terrible:<p><a href="https://github.com/Lulzx/zpdf/blob/main/python/tests/test_zpdf.py" rel="nofollow">https://github.com/Lulzx/zpdf/blob/main/python/tests/test_zp...</a>
excellent stuff what makes zig so fast
It makes your development workflow smooth enough that you have the time and energy to do stuff like all the bullet points listed in <a href="https://news.ycombinator.com/item?id=46437289">https://news.ycombinator.com/item?id=46437289</a>
>you have the time and energy to do stuff like all the bullet points listed<p>Don't disagree but in specific case, per the author, project was made via Claude Code. Although could as well be that Zig is better as LLM target. Noticed many new vibe projects decide to use Zig as target.
Not being slow - they compile straight to bytecode, they aren't interpreted, and have aggressive, opinionated optimizations baked in by default, so it's even faster than compiled c (under default conditions.)<p>Contrasted with python, which is interpreted, has a clunky runtime, minimal optimizations, and all sorts of choices that result in slow, redundant, and also slow, performance.<p>The price for performance is safety checks, redundancy, how badly wrong things can go, and so on.<p>A good compromise is luajit - you get some of the same aggressive optimizations, but in an interpreted language, with better-than-c performance but interpreted language convenience, access to low level things that can explode just as spectacularly as with zig or c, but also a beautiful language.
Zig is safer than C under default conditions, not faster. By default does a lot of illegal behavior safety checking, such as array and slice bounds checking, numeric overflow checking, and invalid union access checking. These features are disabled by certain (non default) build modes, or explicitly disabled at a per scope level.<p>It may be easier to write code that runs faster in Zig than in C under similar build optimization levels, because writing high performance C code looks a lot like writing idiomatic Zig code. The Zig standard library offers a lot of structures like hash maps, SIMD primitives, and allocators with different performance characteristics to better fit a given use-case. C application code often skips on these things simply because it is a lot more friction to do in C than in Zig.
> they compile straight to bytecode<p>machine code, not <a href="https://en.wikipedia.org/wiki/Bytecode" rel="nofollow">https://en.wikipedia.org/wiki/Bytecode</a><p>> The price for performance is safety checks<p>In Zig, non-ReleaseFast build modes have significant safety checks.<p>> luajit ... with better-than-c performance<p>No.
will add this to the list, now learning new languages is less of a barrier with LLMs
Now we just need Python bindings so I can use it in my trash language of choice.
added python bindings!
Were you working on it already, or did it take you less than 17 minutes to commit <a href="https://github.com/Lulzx/zpdf/commit/9f5a7b70eb4b53672c0e4d80b7438e7504066e43" rel="nofollow">https://github.com/Lulzx/zpdf/commit/9f5a7b70eb4b53672c0e4d8...</a> ?
What’s the format that’s perhaps free, easy to parse and render? Build one please.
- First commit 3hours ago.<p>- commit message: LLM-generated.<p>- README: LLM-generated.<p>I'm not convinced that projects vibe coded over the evening deserve the HN front page…<p>Edit: and of course the author's blog is also full of AI slop…<p>2026 hasn't even started I already hate it.
Using Ai isn't lazier than your regurgitated dismissal, to be fair.
...and it does not work. I tried it on ~10 random pdfs, including very simple ones (e.g. a hello world from typst), it segfaults on every single one.
Wait, but why?<p>If it's really better than what we had before, what does it matter how it was made? It's literally hacked together with the tools of the day (LLMs) isn't that the very hacker ethos? Patching stuff together that works in a new and useful way.<p>5x speed improvements on pdf text extraction might be great for some applications I'm not aware of, I wouldn't just dismiss it out of hand because the author used $robot to write the code.<p>Presumably the thought to make the thing in the first place and decide what features to add and not add was more important than how the code is generated?
> If it's really better than what we had before<p>That's a very big if. The whole point is that what we had before was made slowly. This was made quickly. In itself it's not better but what it typically means is hours and hours of testing. Going through painful problems that highlight idiosyncrasies of the problem space. Things that are really weird and specific to whatever the tool is trying to address.<p>In such cases we can be expect that with very little time very few things were tested and tested properly (including a comment mentioned how tests were also generated). "We" the audience of potentially interested users have then to do that work (as plenty did commenting on that post).<p>IMHO what you bring forward is precisely that :<p>- can the new "solution" actually pass ALL the tests the previous one did? More?<p>This should be brought to the top and the actual compromises can then be understood, "we" can then decide if it's "better" for our context. In some cases faster with lossy output is actually better, in others absolutely not. The difference between the new and the old solutions isn't binary and have no visibility on that is what makes such a process nothing more than yet another showcase that LLMs can indeed produce "something" that is absolutely boring while consuming a TON of resources, including our own attention.<p>TL;DR: there should be test "harness" made by 3rd parties (or from well known software it is the closest too) that an LLM generated piece of code should pass before being actually compared.
Tomorrow's headlines<p>fpdf<p>jpdf<p>cpdf<p>cpppdf<p>bfpdf<p>ppdf<p>...<p>opdf