10 comments

  • sheept1 hour ago
    &gt; LLMs return malformed JSON more often than you&#x27;d expect, especially with nested arrays and complex schemas. One bad bracket and your pipeline crashes.<p>This might be one reason why Claude Code uses XML for tool calling: repeating the tag name in the closing bracket helps it keep track of where it is during inference, so it is less error prone.
    • andrew_zhong55 minutes ago
      Yeah that&#x27;s a good observation. XML&#x27;s closing tags give the model structural anchors during generation — it knows where it is in the nesting. JSON doesn&#x27;t have that, so the deeper the nesting the more likely the model loses track of brackets.<p>We see this especially with arrays of objects where each object has optional nested fields. The model will get 18 items right and then drop a closing bracket on item 19, or a invalid field of wrong type. That&#x27;s why we put effort into the repair&#x2F;recovery&#x2F;sanitization layer — validate field-by-field and keep what&#x27;s valid rather than throwing everything out.
    • AbanoubRodolf47 minutes ago
      [dead]
  • Flux1591 hour ago
    This looks pretty interesting! I haven&#x27;t used it yet, but looked through the code a bit, it looks like it uses turndown to convert the html to markdown first, then it passes that to the LLM so assuming that&#x27;s a huge reduction in tokens by preprocessing. Do you have any data on how often this can cause issues? ie tables or other information being lost?<p>Then langchain and structured schemas for the output along w&#x2F; a specific system prompt for the LLM. Do you know which open source models work best or do you just use gemini in production?<p>Also, looking at the docs, Gemini 2.5 flash is getting deprecated by June 17th <a href="https:&#x2F;&#x2F;ai.google.dev&#x2F;gemini-api&#x2F;docs&#x2F;deprecations#gemini-2.5-flash-models" rel="nofollow">https:&#x2F;&#x2F;ai.google.dev&#x2F;gemini-api&#x2F;docs&#x2F;deprecations#gemini-2....</a> (I keep getting emails from Google about it), so might want to update that to Gemini 3 Flash in the examples.
  • dmos621 hour ago
    What&#x27;s your experience with not getting blocked by anti-bot systems? I see you&#x27;ve custom patches for that.
    • andrew_zhong38 minutes ago
      The anti-bot patches here (via Patchright) are about preventing the browser from being detected as automated — fixing CDP leaks, removing automation flags, etc. For sites behind Cloudflare or Datadome, that alone usually isn&#x27;t enough — you&#x27;ll need residential proxies and proper browser fingerprints on top. The library supports connecting to remote scraping browsers via WebSocket and proxy configuration for those cases.
  • AirMax9820 minutes ago
    This feels like slop to me.<p>It may or may not be, but if you want people to actually use this product I’d suggest improving your documentation and replies here to not look like raw Claude output.<p>I also doubt the premise that about malformed JSON. I have never encountered anything like what you are describing with structured outputs.
  • plastic0412 hours ago
    &gt; Avoid detection with built-in anti-bot patches and proxy configuration for reliable web scraping.<p>And it doesn&#x27;t care about robots.txt.
    • andrew_zhong57 minutes ago
      Good point. The anti-bot patches here (via Patchright) are about preventing the browser from being detected as automated — things like CDP leak fixes so Cloudflare doesn&#x27;t block you mid-session. It&#x27;s not about bypassing access restrictions.<p>Our main use case is retail price monitoring — comparing publicly listed product prices across e-commerce sites, which is pretty standard in the industry. But fair point, we should make that clearer in the README.
      • messe22 minutes ago
        &gt; It&#x27;s not about bypassing access restrictions.<p>Yes. It is. You&#x27;ve just made an arbitrary choice not to define it as such.
  • zx80801 hour ago
    Robots.txt anyone?
    • andrew_zhong57 minutes ago
      Good point. The anti-bot patches here (via Patchright) are about preventing the browser from being detected as automated — things like CDP leak fixes so Cloudflare doesn&#x27;t block you mid-session. It&#x27;s not about bypassing access restrictions.<p>Our main use case is retail price monitoring — comparing publicly listed product prices across e-commerce sites, which is pretty standard in the industry. But fair point, we should make that clearer in the README.
      • reyqn35 minutes ago
        <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=47340079">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=47340079</a>
  • openclaw0132 minutes ago
    [dead]
  • johnwhitman58 minutes ago
    [dead]
  • Remi_Etien1 hour ago
    [dead]
  • gautamborad2 hours ago
    [dead]