20 comments

  • bensyverson6 hours ago
    The article asserts that the quality of human knowledge work was easier to judge based on proxy measures such as typos and errors, and that the lack of such &quot;tells&quot; in AI poses a problem.<p>I don&#x27;t know if I agree with either assertion… I&#x27;ve seen plenty of human-generated knowledge work that was factually correct, well-formatted, and extremely low quality on a conceptual level.<p>And AI signatures are now easy for people to recognize. In fact, these turns of phrase aren&#x27;t just recognizable—they&#x27;re unmistakable. &lt;-- See what I did there?<p>Having worked with corporate clients for 10 years, I don&#x27;t view the pre-LLM era as a golden age of high-quality knowledge work. There was a lot of junk that I would also classify as a &quot;working simulacrum of knowledge work.&quot;
    • torben-friis3 hours ago
      For me the issue is the lack of human explanation for mistakes. With a person, low quality comes from a source. Sometimes the source is lack of knowledge, sometimes time pressure, sometimes selfish goals.<p>Most importantly, those sources of errors tend to be consistent. I can trust a certain intern to be careful but ignorant, or my senior colleague with a newborn daughter to be a well of knowledge who sometimes misses obvious things due to lack of sleep.<p>With AI it&#x27;s anyone&#x27;s guess. They implement a paper in code flawlessly and make freshman level mistakes in the same run. so you have to engage in the non intuitive task of reviewing assuming total incompetence, for a machine that shows extreme competence. Sometimes.
    • bambax6 hours ago
      It&#x27;s not that pre-LLM era was a &quot;golden age of quality&quot;, far form it. It&#x27;s that LLMs have removed yet another tell-tale of rushed bullshit jobs.
      • bensyverson5 hours ago
        Have they though?
        • happytoexplain4 hours ago
          Absolutely. Our heuristics for judging human output are useless with LLMs. We can either trust it blindly, or tediously pick over every word (guess which one people do). I&#x27;ve watched this cause havoc over and over at my job (I work with many different teams, one at a time).<p>AI signatures don&#x27;t mean low quality, they just mean AI. And humans do use them (I have always used the common AI signatures). And yes, humans produce good-looking garbage, but much more commonly they produce bad-looking garbage. This is all tangential to the point.
        • esafak49 minutes ago
          For example, science articles written in Word vs. Latex helped filter out total cranks.
    • manquer2 hours ago
      It was and still is a negative filter, not a positive one. Meaning it is easy to reject work because there typos and basic factual errors, absence of them is not a good measure of quality. Typically such checks is the first pass not the only criteria.<p>It is valuable to have this, because it the work passes the first check then it easier to identify the actual problems. Same reason we have code quality, lint style fixed before reasoning with the actual logic being written.
    • mbreese5 hours ago
      I’m also not sure I agree with the assertion that LLMs will produce a high quality (looking) report with correct time frames, lack of typos, and good looking figures. I’m just as willing to disregard human or LLM reports with obvious tells. An LLM or a person can produce work that’s shoddy or error filled. It may be getting harder to differentiate between a good or bad report, but that helps to shift the burden more onto the evaluator.<p>This is especially true if we start to see more of a split in usage between LLMs based on cost. High quality frontier models might produce better work at a higher cost, but there is also economic cost pressure from the bottom. And just like with human consultants or employees, you’ll pay more for higher quality work.<p>I’m not quite sure what I’m trying to argue here. But the idea that an LLM won’t produce a low quality report just seemed silly to me.
      • yarekt4 hours ago
        You’ve missed the point of original article about the proxy for quality disappearing. LLMs are trained adversarially, if that’s a word. They are trained to not have any “tells”.<p>Working in a team isn’t adversarial, if i’m reviewing my colleague’s PR they are not trying to skirt around a feature, or cheat on tests.<p>I can tell when a human PR needs more in depth reviewing because small things may be out of place, a mutex that may not be needed, etc. I can ask them about it and their response will tell me whether they know what they are on about, or whether they need help in this area.<p>I’ve had LLM PRs be defended by their creator until proven to be a pile of bullshit, unfortunately only deep analysis gets you there
    • puttycat4 hours ago
      The goal of automation is to automate consistently perfect competence, not human failures.<p>You wouldn&#x27;t use a calculator that is as good as a human and makes mistakes as often.
    • Aurornis1 hour ago
      &gt; I don&#x27;t know if I agree with either assertion… I&#x27;ve seen plenty of human-generated knowledge work that was factually correct, well-formatted, and extremely low quality on a conceptual level.<p>Putting a high level of polish on bad ideas is basically the grifter playbook. Throughout the business world you will find workers and entire businesses who get their success by dressing up poor ideas and bad products with all of the polish and trimmings associated with high quality work.
    • downboots6 hours ago
      Yes. I think the main warning here is that it is an added risk. A little glitch here and there until something breaks.
    • maplethorpe3 hours ago
      [dead]
  • coppsilgold21 minutes ago
    &gt; The training doesn&#x27;t evaluate &quot;is the answer true&quot; or &quot;is the answer useful.&quot; It&#x27;s either &quot;is the answer likely to appear in the training corpus&quot; or &quot;is the RLHF judge happy with the answer.&quot; We are optimising LLMs to produce output which looks like high quality output.<p>It&#x27;s not quite as dire as this. One of the main reasons why LLM&#x27;s are getting better over time is that they are used themselves to bootstrap the next generation by sifting through the training set to do &#x27;various things&#x27; to it.<p>People often forget that the training corpus contains everything humanity ever produced and anything new humanity will produce will likely come from it as well. Torturing it with current generation models is among the most productive things you can do to improve the next generation systems.
  • vivid2421 hour ago
    With AI, we‘re cargo-culting understanding. We‘re reproducing the surface of having understood something, but we‘re robbing ourselves the time and effort to truly do it.
    • hellohello21 hour ago
      AI can do things on its own, without you understanding them yes.<p>But if you are trying to understand something well, there is no better tool for helping you than AI.
  • sendes4 hours ago
    This is an already apparent problem in academia, though not for the reasons the article suggests.<p>It is not so much that the &quot;tells&quot; of a poor quality work are vanishing, but that even careful scrutiny of a work done with AI is going to become too costly to be done only by humans. One only has so much time to read while, say, in economics journals, the appendices extend to hundreds of pages.<p>Would love to hear if other fields&#x27; journals are experiencing a similar pressure in not only at the extensive margin (no of new submission) but the intensive margin (effort needed to check each work).
  • wxw6 hours ago
    Ultimately to understand a thing is to do the thing. And to not understand (which is ok!) is to trust others to, proxy measures or not. Agreed that the future of work is in a precarious place: doing less and trusting more only works up to a point.<p>`simulacrum` is a great word, gotta add that to my vocabulary.
  • NickNaraghi5 hours ago
    It&#x27;s a funny thing to write, like an article in an old newspaper that aged quickly. I suspect that this will be wildly out of date within 2-3 years.
    • krackers5 hours ago
      I think it&#x27;s already out of date with verifiable reward based RL, e.g. on maths domain. When &quot;correctness&quot; arguments fall, the argument will probably just shift to whether it&#x27;s just &quot;intelligent brute force&quot;.
      • gipp3 hours ago
        The set of tasks for which &quot;correctness&quot; is formally verifiable (in a way that doesn&#x27;t put Goodharts Law in hyperdrive) is vanishingly small.
      • TheOtherHobbes4 hours ago
        &quot;stochastic genius&quot;
  • hellohello21 hour ago
    &quot;How do you know the output is good without redoing the work yourself?&quot;<p>Verifying the correctness of solutions is often much easier than finding correct solutions yourself. Examples: Sudoku and most practical problems in just about any field.<p>-<p>&quot;The training doesn&#x27;t evaluate &#x27;is the answer true&#x27; or &quot;is the answer useful.&#x27;&quot;<p>Lets pretend RLVF does not exist to give this argument a chance. Then, while the training loop does not validate accuracy directly I guess, the meta-training loop still does. When someone prompts a model, the resulting execution trace shows if the generated answer is correct or not, and this trace is kept for subsequent training runs. The way coding agents are used productively is not: a) generate code with AI and b) run it yourself; its a) ask the AI to do something, including generating the code and running it too, no step b. This naturally creates large training sets of correct and incorrect solutions.<p>-<p>&quot;We spent billions to create systems used to perform a simulacrum of work.&quot;<p>Have you even tried using these systems to produce valuable work? How could this possibly be your conclusion after having tried them?
  • tkiolp44 hours ago
    I think this is pretty obvious for many of us in the industry. Unfortunately, there is so much money on the table that the big players will shove whatever they want down our throats
  • firefoxd6 hours ago
    Everybody&#x27;s output is someone else&#x27;s input. When you generate quantity by using an LLM, the other person uses an LLM to parse it and generate their own output from their input. When the very last consumer of the product complains, no one can figure out which part went wrong.
    • balamatom6 hours ago
      Well the last consumer is holding it wrong of course. Why? The last consumer is present, and everyone else is behind 7 proxies.
  • adampunk51 minutes ago
    Why is it not more of a scandal that all these anti-AI articles are written, using large language models?<p>Why is that not an embarrassment for everyone who moans and carps and complains about the craft?
  • zby6 hours ago
    If you have a test that fails 50% times - is that test valuable or not? A 50% failure rate alone looks like a coin toss, but by itself that does not tell us whether the test is noise or whether it is separating bad states from good ones. For a test to be useful it needs to have positive Youden’s statistic (<a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Youden%27s_J_statistic" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Youden%27s_J_statistic</a>): sensitivity + specificity - 1. A 50% failure rate alone does not let us calculate sensitivity and specificity.<p>I can see a similar problem with this article - the author notices that LLMs produce a lot of errors - then concludes that they are useless and produce only simulacrum of work. The author has an interesting observation about how llms disrupt the way we judge knowledge work. But when he concludes that llms do only simulacrum of work - this is where his arguments fail.
    • card_zero6 hours ago
      Gee, a thing by a guy, with a name. What are you saying exactly? So the test in question is a test the LLM is asked to carry out, right? Then your point is that if it&#x27;s a load of vacuous flannel 49% of the time, but meaningful 51% of the time, on average this is genuine work so we can&#x27;t complain about the 49%?<p>Wait, you&#x27;re probably talking about the test of discarding a report based on something superficial like spelling errors. Which fails with LLMs due to their basic conman personalities and smooth talking. And therefore ..?
    • jszymborski4 hours ago
      &gt; For a test to be useful it needs to have positive Youden’s statistic<p>This is not true as stated. I&#x27;d try to gloss over the absolutes relative to the context, but if I&#x27;m totally honest, I&#x27;m not sure I understand what idea you&#x27;re trying to communicate.
  • happytoexplain4 hours ago
    &quot;They sound very confident,&quot; was a warning a gave a lot on a project a year ago, before I gave up trying to get developers to stop blindly trusting the output and submitting things that were just wrong. The documentation of that team went to absolute shit because the developers thought LLMs magically knew everything.
  • throwaway_sydn1 hour ago
    “&#x2F;reliable-resources-skill Claude, using the list of approved resources, evaluate the report I’m attaching”
  • rowanG0776 hours ago
    I don&#x27;t really agree with the premise of the article. Sure proxy measures are everywhere. But for knowledge work specifically you can usually check real quality. Of course it&#x27;s not as extremely easy as &quot;oh this report contains a few spelling errors&quot;, but it is doable. If you accepted work purely based on superficial proxy measures you were not fairly evaluating work at all.
    • zingar6 hours ago
      I think there’s a weaker claim that holds true: we were able to ignore lots of content based on the superficial (and pay proper attention to work that passed this test) and now we are overwhelmed because everything meets the superficial criteria and we can’t pay proper attention to all of it.
      • thehappyfellow6 hours ago
        That&#x27;s what I had in mind! The whole post is a claim that evaluating knowledge work got more expensive because cheaper measures stopped correlating well with quality.<p>If someone was already evaluating the work output using a metric closer to the underlying quality then it might not have been a big shift for them (other than having much more work to evaluate).
      • rowanG0776 hours ago
        Yes, I agree that this is true!<p>You could however only do that if you were fine with unfairly judging the quality of work, as you now readily discarded quality work based on superficial proxies. Which admittedly is done in a lot of cases.
  • mrtesthah6 hours ago
    &gt;<i>&quot;is the RLHF judge happy with the answer.&quot;</i><p>Reinforcement Learning with Verifiable Rewards (RLVR) to improve math and coding success rates seems like an exception.
  • balamatom7 hours ago
    &gt;We&#x27;ve automated ourselves into Goodhart&#x27;s law.<p>Yes.<p>This does not however mean that progress is not being made.<p>It just means the progress is happening along such dimensions that are completely illegible in terms of the culture of the early XXI century Internet, which is to say in terms of the values of the society which produced it.
    • downboots6 hours ago
      Feels like a parallel with <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Constructivism_%28philosophy_of_mathematics%29" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Constructivism_%28philosophy_o...</a> where &quot;it&#x27;s not valid until you checked&quot;
      • balamatom6 hours ago
        I didn&#x27;t see the connection initially.
  • larrytheworm3 hours ago
    [dead]
  • jdw646 hours ago
    [dead]
  • simianwords6 hours ago
    The FUD about LLM&#x27;s will never get old. The way I know and trust LLM&#x27;s is the same way a manager would trust their reportees to do good work.<p>For most tasks, the complexity&#x2F;time required to verify a task is &lt;&lt; the time required to do the task itself. Sure there can be hallucinations on the graph that the LLM made. But LLMs are hallucinating much less than before. And the time to verify is much lower than the time required for a human to do the task.<p>I wrote a post detailing this argument <a href="https:&#x2F;&#x2F;simianwords.bearblog.dev&#x2F;the-generation-vs-verification-delta-explains-why-llms-are-useful&#x2F;" rel="nofollow">https:&#x2F;&#x2F;simianwords.bearblog.dev&#x2F;the-generation-vs-verificat...</a>
    • JackSlateur5 hours ago
      FUD ? You are missing the point entierly, and so does your blog post<p>Are LLM a good dictionary of synonyms ? Perhaps, but is it relevant ? Not at all<p>Are you biased when a solution is presented to you ? Yes, like all humans.<p>Is it damageful when said solution is brain-dead ? Obsiously.<p>Are you failing to understand that most (if not all) manager&#x27;s work is human centric and, as such, cannot be applied to a non-human ? Obviously ..<p>You trust a machine&#x27;s intent. Joke&#x27;s on you, it has no intent at all, it will breaking that &quot;trust&quot; your pour in it without even realizing-it<p>You say that LLM does better job than you. Perhaps this says it all ?