Show HN: wxpath – Declarative web crawling in XPath

(github.com)

64 points by rodricios23 days ago

3 comments

rodricios17 days ago
Hey, wxpath author here. It's pretty cool seeing this project reach the front page a week after posting it.Just wanted to mention a few things.wxpath is a result of a decade of working and thinking about web crawling and scraping. I created two somewhat popular Python web-extraction projects a decade ago (eatiht, and libextract), and even helped publish a metaanalysis on scrapers, all heavily relying on lxml/XPath.After finding some time on my hands and after a hiatus on actually writing web scrapers, I decided to return to this little problem domain.Obviously, LLMs have proven to be quite formidable at web content extraction, but they encounter the now-familiar issues of token limits and cost.Besides LLMs, there's been some great projects making great progress on the problem of web data extraction, like the Scrapy and Crawlee frameworks, and projects like Ferret (<a href="https://www.montferret.dev/docs/introduction/" rel="nofollow">https://www.montferret.dev/docs/introduction/</a>) - another declarative web crawling framework - and others (Xidel, <a href="https://github.com/benibela/xidel" rel="nofollow">https://github.com/benibela/xidel</a>).The shared, common abstraction of most web-scraping frameworks and tools is "node selectors" - the syntax and engine for extracting nodes and their data.XPath has proven resilient and continues to be a popular node-selection and processing language. However, what it lacks, which other frameworks provide, is crawling.wxpath is an attempt to fill that gap.Hope people find it useful!<a href="https://github.com/rodricios/eatiht" rel="nofollow">https://github.com/rodricios/eatiht</a> <a href="https://github.com/datalib/libextract" rel="nofollow">https://github.com/datalib/libextract</a>
neilv17 days ago
It's impressive that wxpath does the DSL as an extension of XPath syntax. I hadn't quite thought of it that way.I routinely used a mix of XPath and arbitrary code heavily for Web scraping (as implied in the intro for "<a href="https://docs.racket-lang.org/html-parsing/" rel="nofollow">https://docs.racket-lang.org/html-parsing/</a>").Then I made some DSLs for doing some of the common scraping coding patterns more concisely and declaratively, but the DSLs ended up in a Lisp-y syntax, not looking like XPath.
- rodricios17 days ago
 Making wxpath as an extension to the XPath DSL was a key goal of mine.The hard part was ensuring the syntax looked and felt as XPath-y as possible.Open to any feedback wrt to the syntax and semantics!
css_apologist17 days ago
xpath is so fucking cooli can understand why it failed for general use, but shit like this revives my excitementq: i'm not an expert, this looks like it extends xpath syntax? haven't seen stuff like the /map is this referring to the html map element? or a fp-style map?
- rodricios17 days ago
 I think xpath is cool too!If wxpath can help revive some of that excitement, then I consider my project a success.As for your question, while wxpath does extend the xpath syntax, `/map` is not one of its additions, nor is it a html map element.XPath 3.1 introduced first-class maps (and arrays) (<a href="https://www.w3.org/TR/xpath-31/#id-maps" rel="nofollow">https://www.w3.org/TR/xpath-31/#id-maps</a>), and `/map` is the syntax to create said structure. It's an awesome feature that's especially useful for quickly delivering JSON-like objects.
 - css_apologist17 days ago
 sick, ty
- jerf17 days ago
 XPath may have "failed" for general use but it's generally well-enough supported that I can find a library in the common languages I've used when I went looking for it. In some ways the hard part is just knowing it exists so you can use it if you need it.
 - rodricios17 days ago
 Couldn't agree more.I should also add that most (Python-based) web crawling and scraping frameworks support XPath engines OOTB: Scrapy, Crawlee, etc. In that sense, XPath is very much alive.
- rhdunn17 days ago
 Maps were added in XPath 3.1 -- <a href="https://www.w3.org/TR/xpath-31/#id-maps" rel="nofollow">https://www.w3.org/TR/xpath-31/#id-maps</a>.There's currently work on XPath 4.0 -- <a href="https://qt4cg.org/specifications/xquery-40/xpath-40.html" rel="nofollow">https://qt4cg.org/specifications/xquery-40/xpath-40.html</a>.