Hey, wxpath author here. It's pretty cool seeing this project reach the front page a week after posting it.<p>Just wanted to mention a few things.<p>wxpath is a result of a decade of working and thinking about web crawling and scraping. I created two somewhat popular Python web-extraction projects a decade ago (eatiht, and libextract), and even helped publish a metaanalysis on scrapers, all heavily relying on lxml/XPath.<p>After finding some time on my hands and after a hiatus on actually writing web scrapers, I decided to return to this little problem domain.<p>Obviously, LLMs have proven to be quite formidable at web content extraction, but they encounter the now-familiar issues of token limits and cost.<p>Besides LLMs, there's been some great projects making great progress on the problem of web data extraction, like the Scrapy and Crawlee frameworks, and projects like Ferret (<a href="https://www.montferret.dev/docs/introduction/" rel="nofollow">https://www.montferret.dev/docs/introduction/</a>) - another declarative web crawling framework - and others (Xidel, <a href="https://github.com/benibela/xidel" rel="nofollow">https://github.com/benibela/xidel</a>).<p>The shared, common abstraction of most web-scraping frameworks and tools is "node selectors" - the syntax and engine for extracting nodes and their data.<p>XPath has proven resilient and continues to be a popular node-selection and processing language. However, what it lacks, which other frameworks provide, is crawling.<p>wxpath is an attempt to fill that gap.<p>Hope people find it useful!<p><a href="https://github.com/rodricios/eatiht" rel="nofollow">https://github.com/rodricios/eatiht</a>
<a href="https://github.com/datalib/libextract" rel="nofollow">https://github.com/datalib/libextract</a>