Asking AI to build scrapers should be easy right?

(skyvern.com)

150 points by suchintan112 days ago

21 comments

nithril112 days ago
The same day, a post on reddit was about: "We built 3B and 8B models that rival GPT-5 at HTML extraction while costing 40-80x less - fully open source" [1].Not fully equivalent to what is doing Skyvern, but still an interesting approach.[1] <a href="https://www.reddit.com/r/LocalLLaMA/comments/1o8m0ti/we_built_3b_and_8b_models_that_rival_gpt5_at_html/" rel="nofollow">https://www.reddit.com/r/LocalLLaMA/comments/1o8m0ti/we_buil...</a>
- suchintan112 days ago
 This is really cool. We might integrate this into Skyvern actually - we've been looking for a faster HTML extraction engineThanks for sharing!
_pdp_112 days ago
This is exactly the direction I am seeing agent go. They should be able to write their own tools and we are soon launching something about that.That being said...LLMS are amazing for some coding tasks and fail miserably at others. My hypothesis is that there is some sort of practical limit to how many concepts an LLM can hold into account no matter the context window given the current model architectures.For a long time I wanted to find some sort of litmus test to measure this and I think I found one that is an easy to understand programming problem, can be done in a single file, yet complex enough. I have not found a single LLM to be able to build a solution without careful guidance.I wrote more about this here if you are interested: <a href="https://chatbotkit.com/reflections/where-ai-coding-agents-go-to-die" rel="nofollow">https://chatbotkit.com/reflections/where-ai-coding-agents-go...</a>
- JimDabell112 days ago
 > For a long time I wanted to find some sort of litmus test to measure this and I think I found one that is an easy to understand programming problem, can be done in a single file, yet complex enough. I have not found a single LLM to be able to build a solution without careful guidance.Plan for solving this problem:- Build a comprehensive design system with AI models- Catalogue the components it fails on (like yours)- These components are the perfect test cases for hiring challenges (immune to “cheating” with AI)- The answers to these hiring challenges can be used as training data for models- Newer models can now solve these problems- You can vary this by framework (web component / React / Vue / Svelte / etc.) or by version (React v18 vs React v19, etc.)What you’re doing with this is finding the exact contours of the edge of AI capability, then building a focused training dataset to push past those boundaries. Also a Rosetta Stone for translating between different frameworks.I put a brain dump about the bigger picture this fits into here:<a href="https://jim.dabell.name/articles/2025/08/08/autonomous-software/" rel="nofollow">https://jim.dabell.name/articles/2025/08/08/autonomous-softw...</a>
- Groxx112 days ago
 also training data quality. they are horrifyingly bad at concurrent code in general in my experience, and looking at most concurrent code in existence.... yeah I can see why.
 - disgruntledphd2112 days ago
 The really depressing part about LLMs (and the limitations of ML more generally) is that humans are really bad at formal logic (which is what programming basically is), and instead of continuing the path of making machines that made it harder for us to get it wrong, we instead decided to toss every open piece of code/text in existence into a big machine that then reproduces those patterns non-deterministically and use that to build more programs.One can see the results in a place where most code is terrible (data science is the place I see this most, as it's what I do mostly) but most people don't realise this. I assume this also happens for stuff like frontend, where I don't see the badness because I'm not an expert.
 - CaptainOfCoit112 days ago
 > is that humans are really bad at formal logic (which is what programming basically is),The tricky part is that I don't think all programming is formal logic at all, just a small part. And this thing with that different code is for different purposes really screws up LLMs reasoning process unless you make it really clear what code is for what.
 - bpt3112 days ago
 > The tricky part is that I don't think all programming is formal logic at all, just a small part.Why do you say this? The foundation of all of computer science is formal logic and symbolic logic.
 CaptainOfCoit112 days ago
 Lots of parts are more creative or more "for humans" I might say, like building the right abstractions considering the current context and potentially future contexts. There is no "right/wrong" abstractions, just abstractions with different tradeoffs, and lots of things in programming is like this, not a binary "this is correct, this is wrong", but somewhere along a spectrum of "This is what I subjectively prefer considering these tradeoffs".There is a reason a lot of programmers see programming having lots of similarities with painting and other creative activities.
 bpt3110 days ago
 Any programmer who doesn't understand the basis of their craft and the environment they're working in isn't a very good one imo.
 CaptainOfCoit110 days ago
 Problem is that everyone probably agrees with that, but where the line of "the basis" is drawn isn't so widely agree on. Is "the basis" the physical composition of the hardware components? Understanding assembly? Knowing how a CPU works? How electrons move around inside the whole thing? How all the pieces fit together, including the OS?The space is just so large that everyone has their own "basis" that sometimes even move with time. They can still be good programmers imo.
 bpt3110 days ago
 At least part of the basis for every programmer is formal logic. A systems programmer should understand how a CPU works, and a frontend JS developer should know how their target browser works, but they all need to know formal logic.
 wredcoll112 days ago
 > Why do you say this? The foundation of all of computer science is formal logic and symbolic logic.Yes, but also it has to deal with "the real world" which is only logical if you can encode a near infinite number of variables, instead we create leaky abstractions in order to actually get work done.
 bpt3110 days ago
 And those abstractions need to be encoded using symbolic and formal logic.
 - ahazred8ta111 days ago
 We basically throw rigor out the window and hope it doesn't hit anybody on the way down.
 - Grimblewald112 days ago
 Or when code is fully vectorizable they default to using loops even if explicitly told not to yse loops. Code I got a LLM to solve for a fairly straightforward problem took 18 minutes to run.my own solution? 1.56 seconds. I consider myself to be at an intermediate skill level, and while LLMs are useful, they likely wont replace any but the least talented programmers. Even then i'd value human with critial thinking paired with an LLM over an even more competent LLM.
 - CaptainOfCoit112 days ago
 Codex (GPT-5) + Rust (with or without Tokio) seems to work out well for me, asking it to run the program and validate everything as it iterates on a solution. I've used the same workflow with Python programs too and seems to work OK, but not as well as with Rust.Just for curiosities sake, what language have you been trying to use?
 - Groxx111 days ago
 mostly Go, because that's at work. for a variety of reasons, I have helped troubleshoot at least 100+ teams' projects, many of which have had concurrency issues either obvious in nearby code, or causing issues (which is why I was helping troubleshoot). same with several dozen "help us find a way to speed up [this activity]" teams' work.this is not at all a sample of high-quality, well-educated-about-concurrency code, but it does roughly match a lot of Business™ code and also most less-mature open source code I encounter (which is most open source code). it's just not something most people are fluent with.these same people using LLMs have generally produced much worse concurrent code, regardless of the model or their prompting-sophistication or thinking-time, unless it's extremely trivial (then it's slightly better, because it's at least tutorial-level correct) (and yes, they should have just used one of many pre-existing libraries in these cases). doing anything except like "5 workers on this queue plz" consistently ends up with major correctness flaws - often it works well enough while everything is running smoothly, but under pressure or in error cases it falls apart extremely badly... which is true for most "how to write x currently" blog posts I run across too - they're over-simplified to the point of being unusable in practice (e.g. by ignoring error handing) and far too inflexible to safely change for slightly different needs.honestly I think it's mostly due to two things: a lack of quality training material (some obviously exists, but it's overwhelmed by flawed stuff), and an extreme sensitivity to subtle flaws (much more so than normal code). so it's both bad at generalizing (not enough transitional examples between targets), and its general lack of ability to actually think introducing flaws that look like normal code that humans are less likely to notice (due to their own lack of experience).this is not to claim it's not possible to use them to write good concurrent code, there are many counter-examples that show it is. but it's a particularly error-prone area in practice, especially in languages without much safer patterns or built-in verification.
 - erichocean112 days ago
 In my experience, because the Clojure concurrency model is just incredibly sane and easy to get right, LLMs have no difficulty with it.
- meowface112 days ago
 With the upcoming release of Gemini 3.0 Pro, we might see a breakthrough for that particular issue. (Those are the rumors, at least.) I'm sure not fully solved, but possibly greatly improved.
whinvik112 days ago
I feel like this is how normal work is. When I have to figure out how to use a new app/api etc, I go through an initial period where I am just clicking around, shouting in the ether etc until I get the hang of it.And then the third or fourth time its automatic. Its weird but sometimes I feel like the best way to make agents work is to metathink about how I myself work.
- suchintan112 days ago
 I have a 2yo and it's been surreal watching her learn the world. It deeply resembles how LLMs learn and think. Crazy
 - Retric112 days ago
 Odd, I've been stuck by how different LLMs and kids learn the world.You don’t get that whole uncanny valley disconnect do you?
 - goatlover112 days ago
 How so? Your kid has a body that interacts with the physical world. An LLM is trained on terabytes of text, then modified by human feedback and rules to be a useful chatbot for all sorts of tasks. I don't see the similarity.
 - crazygringo112 days ago
 If you watch how agents attempt a task, fail, try to figure out what went wrong, try again, repeat a couple more times, then finally succeed -- you don't see the similarity?
 - dingnuts112 days ago
 no I see something resembling gradient descent which is fine but it's hardly a child
 - haskellshill112 days ago
 > try to figure out what went wrongLLMs don't do this. They can't think. If you just one for like five minutes it's obvious that just because the text on the screen says "Sorry, I made I mistake, there are actually 5 r's in strawberry", doesn't mean there's any thought behind it.
 crazygringo111 days ago
 I mean, you can literally watch their thought process. They try to figure out reasons why something went wrong, and then identify solutions. Often in ways that require real deduction and creativity. And have quite a high success rate.If that's not thinking, then I don't know what is.
 haskellshill107 days ago
 You're arguing in circles. They don't "try to figure out reasons" because there is no concept of "trying", "figuring out" or "reasons".> If that's not thinking, then I don't know what is.How about actual thinking, you know, what humans and to a lesser extent animals do?
 - balder1991112 days ago
 No, because an agent doesn’t learn, it’s just continuing a story. A kid will learn from the experience and at the end will be a different person.
 CaptainOfCoit112 days ago
 You just haven't added the right tool together with the right system/developer prompt. Add a `add_memory` and `list_memory` (or automatically inject the right memories for the right prompts/LLM responses) and you have something that can learn.You can also take it a step further and add automatic fine-tuning once you start gathering a ton of data, which will rewire the model somewhat.
 haskellshill112 days ago
 Perhaps it can improve but it can't learn because that requires thought. Would you say that a PID regulator can "learn"?
 CaptainOfCoit111 days ago
 I guess it depends on what you understand "learn" to mean.But in my mind, if I tell the LLM to do something, and it did it wrong, then I ask it to fix it, and if in the future I ask the same thing and it avoids the mistake it did first, then I'd say it had learned to avoid that same pitfall, although I know very well it hasn't "learned" like a human would, I just added it to the right place, but for all intents and purposes, it "learned" how to avoid the same mistake.
 haskellshill107 days ago
 This is a silly definition of learning, and any way, LLMs can't even do what you describe.
 - deadbabe112 days ago
 A person is not their body.The person is the data that they have ingested and trained on through the senses that are exposed by their body. Body is just an interface to reality.
 - haskellshill112 days ago
 That is a very weird and fringe definition of what a person is.
 deadbabe111 days ago
 If you have a different life experience than what you had so far, wouldn’t you be a different person?
 haskellshill107 days ago
 If you had one fewer leg, wouldn't you be a different person? Implication is not equivalence.
 - haskellshill112 days ago
 > It deeply resembles how LLMs learn and thinkWhat? LLMs don't think nor learn in the sense humans do. They have absolutely no resemblance to a human being. This must be the most ridiculous statement I've read this year
 - melagonster112 days ago
 I am sorry, but you are scoffing at the humanity of your kid; you know that, right?
pennaMan112 days ago
Yes, it is easy. LLMs have reduced my maintenance work on scraping tasks I manage (lots of specialized high-traffic adfield sites) by 99%What used to be a constant almost daily chore with them breaking all the time at random intervals is now a self-healing system that rarely ever fails.
- silver_sun112 days ago
 Interesting. Could you elaborate? Is there a specific reason that it doesn't do 100% of the work already?
- ACCount37112 days ago
 One of the uses for AI I'm excited about - maintaining systems, keeping up with the moving targets.
- TheTaytay112 days ago
 Could you elaborate on your setup please?
- suchintan112 days ago
 That's the dream
hamasho112 days ago
Off topic, but because the article mentioned improper usage of DOM, I put down the UK government's design system/accessibility. It's well documented, and I hope all governments have the same standard. I guess they paid a huge amount of money to consultants and vendors.[1] <a href="https://design-system.service.gov.uk/components/radios/" rel="nofollow">https://design-system.service.gov.uk/components/radios/</a>
philipbjorge112 days ago
We had a similar realization here at Thoughtful and pivoted towards code generation approaches as well.I know the authors of Skyvern are around here sometimes -- How do you think about code generation with vision based approaches to agentic browser use like OpenAI's Operator, Claude Computer Use and Magnitude?From my POV, I think the vision based approaches are superior, but they are less amenable to codegen IMO.
- suchintan112 days ago
 Unrelated, but thoughtful gave us some very very helpful feedback early in our journey. We are big fans!
- suchintan112 days ago
 I think they're complementary, and that's the direction we're headed.We can ask the vision based models to output why they are doing what they are doing, and fallback to code-based approaches for subsequent runs
Ldorigo111 days ago
I wonder why the focus on replaying UI interactions, rather than just skipping one step ahead to the underlying network/API calls? I've been playing around with similar ideas a lot recently, and I indeed started out in a similar approach as what is described in the article - but then I realized that you can get much more robust (and faster-executing) automation scripts by having the agents figure out the exact network calls to replay, rather than clicking around in a headless browser.
franze112 days ago
In AI First workshops. By now I tell them for the last exercise "no scrappers". the learning is to separate reasoning (AI) from data (that you have to bring.) and ai coded scrappers seem a logical, but always fail. scrapping is a scaling issue, not reasoning challenge. also the most interesting websites are not keen for new scrappers.
showerst112 days ago
A point orthogonal to this; consider whether you need browser automation at all.If a website isn't using Cloudflare or a JS-only design, it's generally better to skip playwright. All the major AIs understand beautifulsoup pretty well, and they're likely to write you a faster, less brittle scraper.
- Etheryte112 days ago
 The vast majority of the modern internet falls into one of those two buckets though, no?
 - showerst112 days ago
 I mostly scrape government data so the sites are a little 'behind' on that trend, but no. Even JS heavy sites are almost always pulling from a JSON or graphql source under the hood.At scale, dropping the heavier dependencies and network traffic of a browser is meaningful.
 - suchintan112 days ago
 Yeah, reverse engineering APIs is another fantastic approach. They aren't enough if you are dealing with wizards (eg typeform), but they can work really well
- suchintan112 days ago
 IF you can use crawlers, definitely do.They aren't enough for anything that's login-protected, or requires interacting with wizards (eg JS, downloading files, etc)
- pavel_lishin112 days ago
 If.
moomoo11112 days ago
I tried skyvern like 6 mo ago and it didn’t work for scraping a site that sounds like welp. Ended up doing it myself. Was trying to scrape data across Bay Area.That said I’d try it again but I don’t want to spend money again.
- randunel112 days ago
 That 'welp' probably has a tonne of bot detection going on, given its popularity and the sheer amount of data it makes available without an account.
fsckboy109 days ago
>Asking AI to build scrapers should be easy right?AI, build me a scraperwhat do you want to scrape[lists sites to scrape]oh, I've already scraped those relentlessly, here ya go
pu_pu112 days ago
Not at all in my opinion. Its a zero sum game against anti bot technologies also employing AI to block scrapers.
pcblues112 days ago
You gain experience getting interactions with other agencies optimised by dealing with them yourself. If the AI you rely on fails, you are dead in the water. And I'm speaking as a fairly resilient 50 year old with plenty of hands-on experience, but concerned for the next generation. I know generational concern has existed since the invention of writing, and the world hasn't fallen apart, so what do I know? :)
pyuser583112 days ago
Over the past few days I've spent a lot of time dealing with terribly designed UIs. Some legitimate and desired use cases are impossible because poor logic excludes them.Is AI capable of saying, "This website sucks, and doesn't work - file a complaint with the webmaster?"I once had similar problems with the CIA's World Factbook. I shudder to think what an I would do there.
- suchintan112 days ago
 It's funny, one time we had a customer that wanted to use us to test their website for bugs..Skyvern kept suggesting improvements unrelated to the issue they were testing for
 - pyuser583112 days ago
 So how do clients process this sort of feedback? As a dev, “negative user feedback” gives me scares that “failed behavior testing” does not.The AI isn’t mad, and won’t refuse to renew. Unless it’s being run by the client of course.Are clients using your platform to assess vendors?
 - suchintan112 days ago
 No, we don't have a lot of usage in that direction. People mainly use us to log into websites and either fill out forms or download files!
guluarte112 days ago
the hardest part of scrapping is bypassing Cloudflare/captchas/fingerprinting etc
- fragmede112 days ago
  The hardest part is not telling anyone how you're bypassing it!
  - ThatPlayer112 days ago
    I can talk about this bypass because they've fixed it: a site I was scraping rolled their own custom captcha that was just multiple choice. But they didn't have a nonce, so I would just attempt all the choices, and one of them would let me in.
    - jimrandomh112 days ago
      The captcha put you on notice that your scraping wasn't authorized. Depending on the details and circumstances, bypassing it and scraping anyways may have been a crime.
- suchintan112 days ago
  Definitely. What are your thoughts on the CloudFlare agent identity
herpdyderp112 days ago
I'd be all over Skyvern if only they had enterprise compliance agreements available.
- suchintan112 days ago
  We do have them! We are HIPAA compliant, have soc-2 type 2 and offer self hosted deployments
  - herpdyderp112 days ago
    Thank you for responding! Where is your compliance information? How do I sign a BAA?
    - suchintan112 days ago
      Send me an email suchintan@skyvern.com - we can get you started
ahstilde112 days ago
this matches our personal experience, too
bigiain111 days ago
So I wonder how to fingerprint and block this?While I cans see _some_ good uses for it, there are clearly abusive uses for it, including in their examples.I mean jesus fuck, who wants cheap/free automation out there to "Skyvern can be instructed to navigate to job application websites like Lever.co and automatically generate answers, fill out and submit the job application."?I already have to deal with enough totally unsuitable scattergun job applications every time we advertise an open position.This is just asking to be used for abuse.
jimrandomh112 days ago
Your example use case is automatically filling out an IRS form, operated by the sort of IRC department that makes a webform that's only up during business hours? Do you realize how legally risky that is to create, and how legally risky that will be to operate?
- suchintan111 days ago
  What are some of the risks? This is a public web form available on the IRS website
  - jimrandomh111 days ago
    If you're automating filling out the form, you aren't reading the instructions and you aren't checking what you're putting into it as much as you should be. And if you put in incorrect information, it tends to be considered fraud, even if it's downstream of a sloppy LLM rather than downstream of a particular fraudulent scheme.
    - suchintan111 days ago
      You're right, but this is where the LLMs are especially useful. Our customers all prompt it to terminate if it doesn't have the right information / the pre submission confirmation doesn't match
claysmithr112 days ago
I misread this as 'sky scrapers'