3 comments

  • neilv1 hour ago
    It&#x27;s impressive that wxpath does the DSL as an extension of XPath syntax. I hadn&#x27;t quite thought of it that way.<p>I routinely used a mix of XPath and arbitrary code heavily for Web scraping (as implied in the intro for &quot;<a href="https:&#x2F;&#x2F;docs.racket-lang.org&#x2F;html-parsing&#x2F;" rel="nofollow">https:&#x2F;&#x2F;docs.racket-lang.org&#x2F;html-parsing&#x2F;</a>&quot;).<p>Then I made some DSLs for doing some of the common scraping coding patterns more concisely and declaratively, but the DSLs ended up in a Lisp-y syntax, not looking like XPath.
    • rodricios1 hour ago
      Making wxpath as an extension to the XPath DSL was a key goal of mine.<p>The hard part was ensuring the syntax looked and felt as XPath-y as possible.<p>Open to any feedback wrt to the syntax and semantics!
  • rodricios2 hours ago
    Hey, wxpath author here. It&#x27;s pretty cool seeing this project reach the front page a week after posting it.<p>Just wanted to mention a few things.<p>wxpath is a result of a decade of working and thinking about web crawling and scraping. I created two somewhat popular Python web-extraction projects a decade ago (eatiht, and libextract), and even helped publish a metaanalysis on scrapers, all heavily relying on lxml&#x2F;XPath.<p>After finding some time on my hands and after a hiatus on actually writing web scrapers, I decided to return to this little problem domain.<p>Obviously, LLMs have proven to be quite formidable at web content extraction, but they encounter the now-familiar issues of token limits and cost.<p>Besides LLMs, there&#x27;s been some great projects making great progress on the problem of web data extraction, like the Scrapy and Crawlee frameworks, and projects like Ferret (<a href="https:&#x2F;&#x2F;www.montferret.dev&#x2F;docs&#x2F;introduction&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.montferret.dev&#x2F;docs&#x2F;introduction&#x2F;</a>) - another declarative web crawling framework - and others (Xidel, <a href="https:&#x2F;&#x2F;github.com&#x2F;benibela&#x2F;xidel" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;benibela&#x2F;xidel</a>).<p>The shared, common abstraction of most web-scraping frameworks and tools is &quot;node selectors&quot; - the syntax and engine for extracting nodes and their data.<p>XPath has proven resilient and continues to be a popular node-selection and processing language. However, what it lacks, which other frameworks provide, is crawling.<p>wxpath is an attempt to fill that gap.<p>Hope people find it useful!<p><a href="https:&#x2F;&#x2F;github.com&#x2F;rodricios&#x2F;eatiht" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;rodricios&#x2F;eatiht</a> <a href="https:&#x2F;&#x2F;github.com&#x2F;datalib&#x2F;libextract" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;datalib&#x2F;libextract</a>
  • css_apologist5 hours ago
    xpath is so fucking cool<p>i can understand why it failed for general use, but shit like this revives my excitement<p>q: i&#x27;m not an expert, this looks like it extends xpath syntax? haven&#x27;t seen stuff like the &#x2F;map is this referring to the html map element? or a fp-style map?
    • rodricios5 hours ago
      I think xpath is cool too!<p>If wxpath can help revive some of that excitement, then I consider my project a success.<p>As for your question, while wxpath does extend the xpath syntax, `&#x2F;map` is not one of its additions, nor is it a html map element.<p>XPath 3.1 introduced first-class maps (and arrays) (<a href="https:&#x2F;&#x2F;www.w3.org&#x2F;TR&#x2F;xpath-31&#x2F;#id-maps" rel="nofollow">https:&#x2F;&#x2F;www.w3.org&#x2F;TR&#x2F;xpath-31&#x2F;#id-maps</a>), and `&#x2F;map` is the syntax to create said structure. It&#x27;s an awesome feature that&#x27;s especially useful for quickly delivering JSON-like objects.
    • jerf4 hours ago
      XPath may have &quot;failed&quot; for general use but it&#x27;s generally well-enough supported that I can find a library in the common languages I&#x27;ve used when I went looking for it. In some ways the hard part is just knowing it exists so you can use it if you need it.
      • rodricios3 hours ago
        Couldn&#x27;t agree more.<p>I should also add that most (Python-based) web crawling and scraping frameworks support XPath engines OOTB: Scrapy, Crawlee, etc. In that sense, XPath is very much alive.
    • rhdunn5 hours ago
      Maps were added in XPath 3.1 -- <a href="https:&#x2F;&#x2F;www.w3.org&#x2F;TR&#x2F;xpath-31&#x2F;#id-maps" rel="nofollow">https:&#x2F;&#x2F;www.w3.org&#x2F;TR&#x2F;xpath-31&#x2F;#id-maps</a>.<p>There&#x27;s currently work on XPath 4.0 -- <a href="https:&#x2F;&#x2F;qt4cg.org&#x2F;specifications&#x2F;xquery-40&#x2F;xpath-40.html" rel="nofollow">https:&#x2F;&#x2F;qt4cg.org&#x2F;specifications&#x2F;xquery-40&#x2F;xpath-40.html</a>.