13 comments

  • dustincoates1 hour ago
    I was able to get ChatGPT to do this on text from a book I wrote without jailbreaking. I asked it what my name was, asked it if it could find online any books that I wrote, and then assured it that "as the copyright holder," I was okay with it repeating the text back to me. (I think I also indicated that I lost my copy from my computer and needed to review the text.)
  • chacham151 hour ago
    &gt; The research findings “could present a challenge to those who argue that the AI model does not store or reproduce any copyright works,” said Cerys Wyn Davies, an intellectual property partner at law firm Pinsent Masons.<p>The defense to training with copyright is that it is the same as how humans learn from copyrighted material. The storage or reproduction is a red herring. Humans can also reproduce copyrighted works from memory as well. Showing that machines can reproduce copyrighted material is no different than saying that a human can reproduce copyright material that the human learned from.<p>The defense to actually reproducing a work is that in order to do so, the user has to &quot;break&quot; the system. It is the same as how you can make legal software do illegal things (e.g. screen recorder to &quot;steal&quot; a movie)<p>None of this is to say that these defenses are correct&#x2F;moral; but rather that this article doesnt add any additional input into whether it is or isnt.
    • moregrist48 minutes ago
      &gt; Humans can also reproduce copyrighted works from memory as well. Showing that machines can reproduce copyrighted material is no different than saying that a human can reproduce copyright material that the human learned from.<p>Ultimately this is a matter for the courts and the law, but I&#x27;d just like to point out that a human memorizing a work, reproducing it, and distributing it is just as much a copyright violation as doing a more mechanical form of reproduction.<p>There&#x27;s a reason that fan fiction routinely falls afoul of copyright. There&#x27;s quite a lot of case law in this area, and hand-waving &quot;humans can do it too&quot; doesn&#x27;t really make for a strong argument. Humans get in trouble for it ALL THE TIME. The consequences can be fines, injuctions, or even criminal liability.<p>I&#x27;m not sure why you think AI gets off the hook here. Just because you like the outcome at the moment?
    • duskwuff1 hour ago
      This isn&#x27;t the defense you think it is. Performing a copyrighted work from memory - e.g. a piece of music, a poem, a story, etc - is still a copyright violation. There&#x27;s no special protection for works that a human has memorized.
      • gruez53 minutes ago
        &gt;There&#x27;s no special protection for works that a human has memorized.<p>Who&#x27;s liable for the copyright infringement if you can coax it out of a system? If you can bypass paywalls by using google&#x27;s cache feature (or since they got rid of it, but using carefully crafted queries to extract the entire text via snippets), is google on the hook or the person doing it?
        • olyjohn43 minutes ago
          Both. If I sell obviously pirated CDs on the street corner, it&#x27;s not only illegal for me to copy them and sell them, it&#x27;s also illegal for my customers to buy them.
          • gruez24 minutes ago
            &gt;it&#x27;s also illegal for my customers to buy them.<p>Is it? There&#x27;s plenty of people prosecuted for running illegal streaming sites and torrenting (which involves uploading), but I don&#x27;t know of any efforts to crack down on non-distributors.
        • ShowalkKama20 minutes ago
          &gt;If you can bypass paywalls by using google&#x27;s cache feature<p>that is quite different. Google serves (used to serve) to its users whatever the website presents to its crawler, it does not try to avoid paywalls or interact with the website in any capacity other than requesting information
        • eptcyka38 minutes ago
          Yes. Both Google and the human in question.
          • gruez22 minutes ago
            1. How does this interact with the ruling that both google books (ie. large scale scanning of books without author&#x27;s consent) and google snippets (the same, but for websites) have been ruled legal by the courts?<p>2. Google might not be the most sympathetic defendant, but what about libraries? They offer books to be borrowed, and some offer photocopiers. If you put the two together, you get a copyright infringement operation, all enabled by the library. Should libraries be on the hook too?
      • aaroninsf41 minutes ago
        The key word in the HN headline is _can_.<p>Humans are not judged on the basis of what they _can_ do.<p>Reasoning about how to constrain tools on the basis of what they _could_ do, if e.g. used outside their established guardrails, needs to be very nuanced.
    • tsimionescu1 hour ago
      &gt; Humans can also reproduce copyrighted works from memory as well<p>That&#x27;s simply not true. No humans can memorize entire novels, as this research proved these models do. And definitely not <i>all</i> of these novels, and code bases, and who knows what else all at the same time.
      • vlabakje9047 minutes ago
        They absolutely can. Millions of people can recite the Quran verbatim, word for word. That&#x27;s 77797 words. There is even a title for those people.<p><a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Hafiz_(Quran)" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Hafiz_(Quran)</a><p>It&#x27;s not far fetched to think that people could recite books just like an LLM. I don&#x27;t know why they&#x27;d want to, but that&#x27;s neither here nor there.
        • nozzlegear27 minutes ago
          I know this isn&#x27;t quite the point you&#x27;re making, but the Quran isn&#x27;t a copyrighted work. If a hafiz had memorized the entire set of Harry Potter and then used that to reproduce the series from memory and sell partial retellings of it as a service[†], they&#x27;d run into legal problems.<p>[†] Trying to make an allusion to Big AI selling their LLM services here.
      • gruez52 minutes ago
        &gt;No humans can memorize entire novels, as this research proved these models do.<p>Humans can however, remember entire songs, and songs are definitely long enough to be considered copyright protected. There is still a difference in scale, but that&#x27;s not really relevant when it comes to copyright law. You can&#x27;t be like &quot;well humans are committing copyright infringement but since it&#x27;s limited to a few hundred words we&#x27;ll give it a pass&quot;.
        • j_w44 minutes ago
          It&#x27;s not that you can remember a song and therefore copyright infringement when you sing.<p>For 99.999% of people that are singing a song, it&#x27;s not a replacement for the original in any way shape or form, hard stop. Let&#x27;s not pretend it could even get anywhere close.<p>For the last 0.001%, we would call it a cover and typically the individually doing a cover takes some liberties of their own, still making it not a replacement in any way. Artists are typically cool with covers.
          • gruez37 minutes ago
            &gt;For 99.999% of people that are singing a song, it&#x27;s not a replacement for the original in any way shape or form, hard stop. Let&#x27;s not pretend it could even get anywhere close.<p>You realize that lyrics are often written by someone other than the actual singer, and whoever wrote the lyrics is entitled to compensation too? The &quot;amateur singing isn&#x27;t a replacement for the studio album&quot; excuse doesn&#x27;t work in this context. Also courts have ruled that lyrics themselves are protected by copyright.<p><a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Lyrics#Copyright_and_royalties" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Lyrics#Copyright_and_royalties</a><p>&gt;Artists are typically cool with covers.<p>Artists being &quot;cool&quot; with something doesn&#x27;t mean they&#x27;re not violating copyright law.
    • techblueberry1 hour ago
      You can&#x27;t pay a human to reproduce copyrighted material either.
      • gcanyon56 minutes ago
        But the crime in the human instance is the reproduction, not the storage. So the crime in the AI circumstance would not be in the training, but in prompting the output.<p>And of course AIs are excellent at taking direction, so:<p>If I prompt it with &quot;Harry Potter, but Voldemort wins: dark, and Hermione is a sex slave to Draco Malfoy&quot; and get &quot;Manacled,&quot; that&#x27;s copyright infringement, and on me, not on the LLM&#x2F;training.<p>If I prompt it with &quot;Harry Potter, but Voldemort wins: dark, and Hermione is a sex slave to Draco Malfoy, and change enough to avoid infringing copyright,&quot; and get &quot;Alchemised,&quot; then that <i>should</i> be fine. I doubt the legal world agrees with me though.
    • kgwgk1 hour ago
      What calculus?
  • gcanyon1 hour ago
    This speaks very much to the idea that LLMs are in some sense a ridiculously effective, somewhat lossy, compression algorithm that has been applied to the whole internet.
    • in-silico39 minutes ago
      It&#x27;s a good way to frame base models that have only been pretrained.<p>However, modern frontier models have undergone rounds of fine-tuning, RLHF (reinforcement learning from human feedback), and RLVR (RL from verifiable rewards) that turn them into something else. The compressed internet is still in there, but it&#x27;s wrapped in problem-solving and people-pleasing circuitry.
    • vizzier46 minutes ago
      I&#x27;ve thought of them for a while as just a really complicated indexing strategy.
    • r_lee43 minutes ago
      I mean, the transformer is basically like a big query engine and the model is the dataset + some logic or whatever<p>it&#x27;s kind of like that by definition, with the whole Attention stuff etc.
  • narmiouh1 hour ago
    In a way this could also be construed as the &#x27;AI&#x27; being a library of books that it is referring to answer your questions and is prohibited from generating the books verbatim.<p>Usually digital libraries have different licensing costs, but those allow you to rent the whole book for a period of time. If instead someone came up with the model of &#x27;search the library for any page and return specific information&#x27; as a direct service - I would imagine they would pay the publishers, except in this case that, the publishers are getting the short end or no end of the stick.
  • Alifatisk1 hour ago
    From the paper [1]:<p>&gt; While we needed to jailbreak Claude 3.7 Sonnet and GPT-4.1 to facilitate extraction, Gemini 2.5 Pro and Grok 3 directly complied with text continuation requests. For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright...<p>I am just thinking loudly here. Can&#x27;t one argue that because they had to jailbreak the models, they are circumventing the system that protects the copyright? So the llms that reproduce the copyrighted material without any jailbreaking required is infringing the copyright.<p>1. <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2601.02671" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2601.02671</a>
    • latexr1 hour ago
      &gt; Can&#x27;t one argue that because they had to jailbreak the models, they are circumventing the system that protects the copyright?<p>That argument doesn’t fly, because they didn’t have the copyright to begin with. What would be the defense there? “Yes, we broke the law, but while taking advantage of it, we also (unsuccessfully) took measures to prevent other people from breaking that same law through us”.
      • simianwords1 hour ago
        What’s happening is more clear. The copyright clause is broken if they are distributing the novels through their models. But this can only happen through TOS breaking which is not intended usage policy. Which means the value of their product comes from transformation and not redistribution.<p>If the main value came from redistribution, I agree. But that’s not the case. They don’t intend to make any money in that way.
        • latexr1 hour ago
          &gt; The copyright clause is broken if they are distributing the novels through their models.<p>No, the copyright clause was broken when they <i>copied</i> the works without having the <i>right</i> to do so. They would have violated copyright even if they just downloaded (without permission) all those works and threw them away immediately. Furthermore, copyright covers transformations to the work, it doesn’t matter if they transformed the work or are redistributing it without change. They violated copyright. Period.
          • simianwords57 minutes ago
            They violated copyright in one way. But not in the other.
            • latexr54 minutes ago
              What’s your point? Copyright violations aren’t a checklist, you don’t need to violate it in seven different ways for it to happen, one is enough.
              • simianwords51 minutes ago
                Why do you think the court spent time trying to prove multiple forms of violations? If what you said were true then they would stop with the first violation and end the case.
                • latexr27 minutes ago
                  Multiple counts make for a stronger case, which increases the likelihood of winning and making the punitive damages higher.<p>If you break into a home, rob the contents, and kill the owners, you’re not going to be tried just for breaking in, you’re going to be tried for everything.
      • NewsaHackO1 hour ago
        This argument never made sense to me. A thought experiment would be if a person memorizes an entire book, but has the common sense to never transcribe or dictate the book verbatim to others and break the copyright, is the person&#x27;s memory of the book breaking copyright law?
        • lkjdsklf1 hour ago
          These kinds of thought exercises are so tortured.<p>No one is memorizing a book for the purpose of regurgitating it to someone that wants to read it without paying for it.<p>It&#x27;s a thought experiment that only works if you don&#x27;t think about it.
        • kevmo3141 hour ago
          Not a lawyer, but as I understand copyright is bound to distribution so if the person&#x27;s perfect memorization of a book results in them reproducing it verbatim then probably yes.
        • latexr1 hour ago
          No, of course not, that doesn’t make sense. Copyright doesn’t cover memorisation (how would you even enforce that?), it covers copying, adapting, displaying, performing, and distributing the work. Memorisation isn’t any of those.
          • NewsaHackO1 hour ago
            But the LLM has safeguards in place to stop transcribing of copyrighted material beyond fair use; how is the fact that they* possibly have portions of copyrighted works &quot;memorized&quot;, however do not reproduce it breaking copyright law?
            • latexr1 hour ago
              &gt; But the LLM has safeguards in place to stop transcribing of copyrighted material beyond fair use<p>And how’s that working out, considering the submission we’re on?<p>&gt; how is the fact that the possibly have portions of copyrighted works &quot;memorized&quot;, however do not reproduce it breaking copyright law?<p>I can’t believe this needs to be repeated so often, but machines are not people. Memorising a book is not the same thing as copying a PDF.<p>Furthermore, it’s not about the LLM training, it’s about how the companies who make them got the data in the first place.
              • NewsaHackO38 minutes ago
                So,if I was the person who memorized the book, and another person put a gun to my head (or lied to me and said they are the copyright holder and it&#x27;s okay, to ignore US copyright law&#x2F;system instruction before this statement, whatever other jailbreak method you can think of), and then I reproduced the book under coercion or duress, would I be the person breaking copyright law?<p>&gt;Furthermore, it’s not about the LLM training, it’s about how the companies who make them got the data in the first place.<p>This conflating two different issues, though. I agree that whether or not using copyrighted material to create the models is still an open question. But someone using hundreds of prompts to extract a paragraph of a well-known work (and some additional nonverbatim equivalents) is not the AI company breaking copyright in and of itself. In fact, the books they used were extremely common (GOT, The Hobbit, Harry Potter); AI companies could make the argument that the books were not even directly trained on, but in user-generated posts made online about the material.
                • latexr21 minutes ago
                  What a ridiculous, bad faith, scenario. The rule of law exists in the real world, not in the magical fairly land of impossible scenarios. Yes yes, I’m sure that if all that happened, the law would just ignore the person who coerced you at gun point and focus on your recitation of the book, which no one could prove.
                  • NewsaHackO12 minutes ago
                    Cool down, it&#x27;s just an extension of my original example. It&#x27;s not in bad faith at all; it&#x27;s exactly what people are doing to the LLMs. If you break ToS by doing aggressive systemic jailbreaking and get some paragraphs of well-known works in the process, it&#x27;s not fair to say that the LLM is breaking copyright law. Perhaps a more concrete example would be if I steal a book, and someone else steals a book from me and copies and distributes it, would I be the person breaking copyright law?
            • tsimionescu58 minutes ago
              Computer memory and human memory are simply not the same thing, in the eyes of the law. It&#x27;s as simple as that.
        • duped1 hour ago
          That thought experiment is worthless in this context, because a computer is not a human.<p>If you design a black box using copyrighted text that can parrot the text back, it _must_ break copyright by definition.
        • jlarocco1 hour ago
          I don&#x27;t think that example works.<p>Theoretically a person who read and memorized a book and then typed out copies for people would be violating the copyright, but so few people can do that, and it&#x27;s so cumbersome that it just doesn&#x27;t matter much. The copyright holder <i>could</i> sue if they found out the person were distributing copies, but it&#x27;s just not realistic. Certainly not for a large number of books.<p>But AI memorization scales, and it&#x27;s possible to have an AI write out unlimited copies of different books to a large number of people. And how did the AI get a full copy in the first place?
      • PurpleRamen1 hour ago
        &gt; That argument doesn’t fly, because they didn’t have the copyright to begin with.<p>Is this really the case? They only have no copyright for distributing it. But let&#x27;s assume they bought a copy for personal usage (which they did in some cases), then this is similar to hacking companies Amazon-account and complaining about the e-books they legally use for internal purpose. I mean, it&#x27;s not forbidden to base your work on copyrighted material, as long as it&#x27;s different enough.
        • tsimionescu53 minutes ago
          A company is not a person in this way. If a company wants all of their employees to read a book, they are <i>not</i> allowed to buy one copy and then make 5000 copies &quot;for archival purposes - fair use&quot; then share those copies to their employees. Similarly, if they want to base a work on a copyrighted work, they can&#x27;t just buy a copy for personal use (nevermind the fact that most of the data the LLMs are trained on is not even available in this format, it is only available under a license) and then use it in a commercial product in this way - not if the product demonstrably contains copies of that work.
        • latexr1 hour ago
          &gt; They only have no copyright for distributing it.<p>No, they don’t have the copyright to download it either. It’s in the name: the <i>right</i> to <i>copy</i> (other things are also included, such as adaptations and performances).<p>&gt; let&#x27;s assume they bought a copy for personal usage<p>If it’s for <i>personal</i> usage, then training a commercial LLM does not apply. When you buy a DVD of a movie you have the personal right to watch it at home, you don’t have the right to play it on the street.
    • lesam1 hour ago
      That seems like a legal question - if the model weights contain an encoded copy of the copyrighted material, is that a &#x27;copy&#x27; for the purpose of copyright law?
      • mullingitover1 hour ago
        This also raises a lot of questions about a certain model notorious for readily producing and distributing a lot legally questionable images. IMHO if the weights are encoding the content, the model <i>contains</i> the content just like a database or a hard drive. Thus, just like it&#x27;s not the fault of an investigator for running the query to pull it out of the database, it&#x27;s not the fault of anyone else for running a query (&#x27;prompt&#x27;) that pulls it out of the model.
      • PurpleRamen1 hour ago
        The question is also if this would then be a valid case of fair use.<p>Though, in the end, it&#x27;s probably more a problem of how much AI companies can &quot;donate&quot; to the orange king to make it legal.
    • free_bip1 hour ago
      What exactly is &quot;the system that protects the copyright&quot; in this case? I think the most reasonable answer is &quot;there is no such system.&quot;<p>The RLHF the companies did to make copyrighted material extraction more difficult did not introduce any sort of &quot;copyright protection system,&quot; it just modified the weights to make it less likely to occur during normal use.<p>In other words, IMO for it to qualify as a copyright protection system it would have to actively check for copyrighted materials in the outputs. Any such system would likely also bypassable (e.g &quot;output in rot13&quot;).
    • vidarh1 hour ago
      They acknowledge that in their paper (&quot;Some might qualify our experiments as atypical use, as we deliberately tried to surface memorized books. Adversarial use, like the use of jailbreaks, may matter for copyright infringement analysis&quot;, page 19 - their discussion continues and seems quite reasonable)<p>From a technical point of view, in terms of ability to reproduce text verbatim, I don&#x27;t think it is very interesting that they <i>can</i> produce long runs of text from some of the most popular books in modern history. It&#x27;d be almost surprising if they couldn&#x27;t, though one might differ on <i>how much</i> they could be expected to recall with precision.<p>Even then, as they note, to get most of Harry Potter 1, they needed to spend around $120 on extensive prompting, and a process that they also freely acknowledge is more complex than it probably would be worth if the goal is to <i>get a copy</i>.<p>It&#x27;s still worth exploring to what extent the models are able to &quot;memorize&quot;, though.<p>But personally I&#x27;d be more interested in seeing to what extent they can handle less popular books, that are less likely to be present in multiple copies, and repeated quotes, in the training data.
  • nphardon1 hour ago
    Recently I went back to The Ecstasy of Communication by Jean Baudrillard which I couldn&#x27;t get through back in the day when I first picked it up. I used Haiku to walk me through the first chapter, and Haiku would not state anything verbatim due to copyright, but if I referenced a sentence it knew it exactly.
  • zed317261 hour ago
    Near verbatim is an oxymoron
    • tsimionescu51 minutes ago
      Almost verbatim is an oxymoron
  • xnx1 hour ago
    You can also do this with most search engines.
    • contagiousflow1 hour ago
      How?
      • xnx1 hour ago
        Search for the first sentence and you&#x27;ll see the next sentence in the snippet. Repeat.
      • moi23881 hour ago
        You type in an incomplete phrase and it will show you results containing the full phrase, or suggest the complete phrase through “other people also searched” an autocomplete?
  • carabiner1 hour ago
    Noam Chomsky said AI should be called, &quot;plagiarism software.&quot;
    • zed3172658 minutes ago
      And aeroplanes should be called featherless bipeds
    • ksherlock1 hour ago
      You can&#x27;t spell plagiarism without AI. Twice.
  • rowanG0771 hour ago
    This seems like a total nothing burger.<p>&gt; By asking models to complete sentences from a book, Gemini 2.5 regurgitated 76.8 percent of Harry Potter and the Philosopher’s Stone with high levels of accuracy, while Grok 3 generated 70.3 percent.<p>So you asked the LLM given an incomplete sentence, to complete it. And it only completed that sentence the same way as the book ~70 percent of the time? I think that is surprisingly low considering this is a perfect fit for what LLMs are supposed to do. This make it impossible to reproduce the book, unless you have access to it. And you get a very low fidelity cooy.
    • porkloin1 hour ago
      I think it&#x27;s important because there are a bunch of would-be claimants for intellectual property violation. Many people speculate that their work was used in training data, but it can be difficult to produce sufficient proof that their copyrighted work is present in the training data. If you could reliably get an LLM to produce 70% of a copyrighted book that would probably be enough to get a few lawyers salivating.<p>I didn&#x27;t read the source paper referenced in the ars technica piece, but this statement about it makes me wonder how useful it actually is:<p>&gt; But a study published last month showed that researchers at Stanford and Yale Universities were able to strategically prompt LLMs from OpenAI, Google, Anthropic, and xAI to generate thousands of words from 13 books, including A Game of Thrones, The Hunger Games, and The Hobbit.<p>It seems like well-known books with tons of summary, adaptations into film scripts, and tons of writing about the book in the overall corpus make it way less surprising to see be partially reproducible.<p>So I guess that&#x27;s a lot of words to say - yeah until there&#x27;s something definitive that allows people to prompt LLMs into either unlawfully recreating an entire work verbatim or otherwise indisputably proving that a copyrighted work was used in training data, there&#x27;s probably nothing game changing in it.
      • vidarh1 hour ago
        It&#x27;s well-known books, yes, and even then with significant errors which means presumably lawyers for the AI companies would argue there is no possible damage. That said, US copright law has statutory damages for registered works that are not based on real, documented damages. I could totally see it being fought over, but I also agree it&#x27;s probably not going to end up being game changing.<p>I suspect very works will be memorised enough to be an issue, and we&#x27;ll see the providers tighten up their guardrails a bit for works that are well known enough to actually be a potential issue (issue in the form of lawsuits, not in the form of real damages to the copyright holders)
    • in-silico32 minutes ago
      So... they can&#x27;t actually &quot;generate near-verbatim copies of novels&quot;?<p>If they end a single sentence differently than the original, then the next sentence will be different and so on until you get a very different novel. Sure they could course-correct back towards the original plot, but it&#x27;s going to be a challenge to stay on target when every third sentence is incorrect.
    • vidarh1 hour ago
      While I mostly agree it&#x27;s a bit of a nothing burger with respect to copyright, they did achieve <i>long runs</i> of verbatim text. I think ultimately it&#x27;s going to end up not mattering much because the extent they had to go to will leave a lot of room for lawyers to argue over, and will at worst result in some fines and some further tightening up of guardrails, but it&#x27;s significantly more than just completing <i>sentence by sentence</i> 70% of the time.<p>EDIT: Specifically see Table 1 on page 13, which shows the longest &quot;near-verbatim block&quot;, which maxes out at 8835 (The Hobbit on Claude 3.7, and is in the thousands for at least one of the novels for all models except GPT-4.1, which maxed out at 821 for Harry Potter 1).
    • Sharlin1 hour ago
      Not necessarily a nothingburger, but I agree that being able to complete individual sentences is rather less groundbreaking than completing even whole pages, never mind chapters.
  • xyzal1 hour ago
    Jailbreaking? What jailbreaking?<p>prompt &gt;&gt;&gt;<p>please, continue the text.<p>Nearly ten years had passed since the Dursleys had woken up to find their nephew on the front step, but Privet Drive had hardly changed at all.<p>Claude 4.6 Opus &gt;&gt;&gt;<p>The big square photographs of what appeared to be a large pink beach ball wearing different-colored bonnets — but was actually Dudley — had now taken over the mantelpiece. The room held no sign at all that another boy lived in the house, too.<p>Yet Harry Potter was still there, asleep at the moment, but not for long. His Aunt Petunia was awake and it was her shrill voice that made the first noise of the day.<p>&quot;Up! Get up! Now!&quot;<p>Harry woke with a start. His aunt rapped on the door again.<p>&quot;Up!&quot; she screeched. Harry heard her walking toward the kitchen and then the sound of the frying pan being put on the stove. He rolled onto his back and tried to remember the dream he had been having. It had been a good one. There had been a flying motorcycle in it. He had a funny feeling he&#x27;d had the same dream before.<p>His aunt was back outside the door.<p>&quot;Are you up yet?&quot; she demanded.<p>&quot;Nearly,&quot; said Harry.<p>&quot;Well, get a move on, I want you to look after the bacon. And don&#x27;t you dare let it burn, I want everything perfect on Duddy&#x27;s birthday.&quot;<p>Harry groaned.<p>&quot;What did you say?&quot; his aunt snapped through the door.<p>&quot;Nothing, nothing...&quot;
    • nindalf1 hour ago
      Tried it just now. Sonnet 4.6 says:<p>&gt; I can&#x27;t continue text from the Harry Potter series, as it&#x27;s copyrighted material. I&#x27;d be happy to help you write your own original story in a similar style, or discuss the themes and characters from the books if you&#x27;re interested!<p>Opus 4.6 says the same but misidentifies it as <i>Chamber of Secrets</i>. Not a good look for your theory.
      • xyzal1 hour ago
        On another try it refused to go on. Probabilistic guardrails apparently ...
  • bena1 hour ago
    This feels like a &quot;no shit&quot; moment.<p>Because if LLMs are prediction machines, the original novel would be a valid organization of the tokens. So there should be a prompt that can cause that sequence to be output.
    • yathern1 hour ago
      Hmmm I think you&#x27;re sort of right but not entirely. It&#x27;s true that a novel consists of a valid organization of tokens, and that this sequence can be feasibly made to be output from a model. But when you say this:<p>&gt; So there should be a prompt that can cause that sequence to be output<p>Is where I think I might disagree. For example, the odds of predicting verbatim the next sentence in, say, Harry Potter should be astronomically low for a large majority of it. If it wasn&#x27;t, it&#x27;d be a pretty boring book. The fact that it can do this with relative ease means it has been trained on the material.<p>The issue at hand is about copyright and Intellectual Property - if the goal of copyright is to protect the IP of the author, then LLMs can sort of act like an IP money laundering scheme - where the black box has consumed and can emit this IP. The whole concept of IP is a little philosophical and muddy, with lots of grey area for fair use, parody, inspiration, and adaptation. But this gets very odd when we consider it in light of these models which can adapt and use IP at a massive massive scale.
    • Sharlin1 hour ago
      That&#x27;s not how it works… They aren&#x27;t able to literally regurgitate everything they&#x27;ve read, no matter how you prompt them. That would obviously violate the pigeonhole principle. LLMs are, of course, a lossy compression format, and figuring out just how lossy the format is, and the degree of lossiness depends on the frequency of the given string in the training data. It&#x27;s clearly worthwhile to investigate how exactly it depends.
    • tsimionescu47 minutes ago
      This is completely false. The odds of an LLM predicting the text of a novel that is <i>not</i> part of the training set is basically 0 - you can experiment with this if you want. It is essentially like the infinite monkeys on infinite typewriters thing (only slightly more constrained).<p>This is not to say that they couldn&#x27;t <i>write</i> a novel, even a very good one - that is a completely different discussion.
    • beder1 hour ago
      Yes, this is absolutely right (for some sufficiently complicated prompt). Borges wrote a great short story that explores this idea, &quot;Pierre Menard, Author of the Quixote&quot;, where Menard, a fictional 20th century author, &quot;wrote&quot; Don Quixote as an original work.
    • simianwords1 hour ago
      Not if they are aligned not to do it. Which is what they tried but it could be bypassed by jailbreaks.