36 comments

  • NoSalt1 hour ago
    &gt; <i>&quot;You can also clone the voice from any audio sample by using our repo.&quot;</i><p>Ok, who knows where I can get those high-quality recordings of Majel Barrett&#x27; voice that she made before she died?
    • freedomben14 minutes ago
      TOS computer voice must be my computer&#x27;s voice. And after every command I run, I need a &quot;Working.&quot;
  • pain_perdu9 hours ago
    I&#x27;m psyched to see so much interest in my post about Kyutai&#x27;s latest model! I&#x27;m working on part of a related team in Paris that&#x27;s building off Kutai&#x27;s research to provide enterprise-grade voice solutions. If anyone building in this space I&#x27;d love to chat and share some our upcoming models and capabilities that I am told are SOTA. Please don&#x27;t hesitate to ping me via the address in my profile.
    • rsolva3 hours ago
      Woah, I&#x27;m impressed! The voice cloning also worked much better than expected! Will there be separate models for other languages? I know the National Library in Norway has done a good job curating speech datasets with many different dialects [1][2].<p>[1] <a href="https:&#x2F;&#x2F;data.norge.no&#x2F;en&#x2F;datasets&#x2F;220ef03e-70e1-3465-a4af-edd6b8390233&#x2F;nb-tale-speech-database-for-norwegian" rel="nofollow">https:&#x2F;&#x2F;data.norge.no&#x2F;en&#x2F;datasets&#x2F;220ef03e-70e1-3465-a4af-ed...</a><p>[2] <a href="https:&#x2F;&#x2F;ai.nb.no&#x2F;datasets&#x2F;" rel="nofollow">https:&#x2F;&#x2F;ai.nb.no&#x2F;datasets&#x2F;</a>
    • armcat8 hours ago
      Just want to say amazing work. It&#x27;s really pushing the envelope of what is possible to run locally on everyday devices.
  • derHackerman11 hours ago
    I read this, then realized I needed a browser extension to read my long case study and made a browser interface of this and put this together:<p><a href="https:&#x2F;&#x2F;github.com&#x2F;lukasmwerner&#x2F;pocket-reader" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;lukasmwerner&#x2F;pocket-reader</a>
    • laszbalo6 hours ago
      You can do the same thing with Firefox&#x27; Reader Mode. On Linux you have to set up speech-dispatcher to use your favorite TTS as a backend.Once it is set up, there will be an option to listen the page.
      • mentalgear6 hours ago
        Firefox should integrate that in their Reader Mode (the default System Voices are often very un-listable). Would seems like an easy win, and it&#x27;s a non-AI feature so not polarising.
        • laszbalo4 hours ago
          Not sure about macOS or Windows, but on Linux Firefox uses speech-dispatcher, which is a server, and Firefox is the client. Speech-dispatcher then delegates the text to the correct TTS backend. It basically runs a shell command, either sending the text to a TTS HTTP server using curl, or piping it to the standard input of a TTS binary.<p>Speech-dispatcher commonly uses espeak-ng, which sounds robotic but is reportedly better for visually impaired users, because at higher speeds it is still intelligible. This allows visually impaired users to hear UI labels more quickly. For non visually impaired users, we generally want natural sounding voices and to use TTS in the same way we would listen to podcasts or a bedtime story.<p>With this system, users are in full control and can swap TTS models easily. If a model is shipped and, two weeks later, a smaller, newer, or better one appears, their work would become obsolete very quickly.
  • lukebechtel16 hours ago
    Nice!<p>Just made it an MCP server so claude can tell me when it&#x27;s done with something :)<p><a href="https:&#x2F;&#x2F;github.com&#x2F;Marviel&#x2F;speak_when_done" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;Marviel&#x2F;speak_when_done</a>
    • tarcon8 hours ago
      macOS already has some great intrinsic TTS capability as the OS seems to include a naturally sounding voice. I recently built a similar tool to just run the &quot;say&quot; command as a background process. Had to wrap it in a Deno server. It works, but with Tahoe it&#x27;s difficult to consistently configure using that one natural voice, and not the subpar voices downloadable in the settings. The good voice seems to be hidden somehow.
      • supriyo-biswas7 hours ago
        &gt; The good voice seems to be hidden somehow.<p>How am I supposed to enable this?
        • tarcon4 hours ago
          My mistake, seems like I was refering to the Siri voice, which seems to be the default. It sounds good. It is selectable and to my surprise - even configurable in speed, pitch and volume - in the OS Accessibility settings -&gt; System Voice -&gt; Click on the (i) symbol. (macOS Tahoe)
    • tylerdavis13 hours ago
      Funny! I made one recently too using piper-tts! <a href="https:&#x2F;&#x2F;github.com&#x2F;tylerdavis&#x2F;speak-mcp" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;tylerdavis&#x2F;speak-mcp</a>
    • codepoet8013 hours ago
      I just setup pushover to send a message to my phone for this exact reason! Trying out your server next!
  • armcat17 hours ago
    Oh this is sweet, thanks for sharing! I&#x27;ve been a huge fan of Kokoro and event setup my own fully-local voice assistant [1]. Will definitely give Pocket TTS a go!<p>[1] <a href="https:&#x2F;&#x2F;github.com&#x2F;acatovic&#x2F;ova" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;acatovic&#x2F;ova</a>
    • gropo16 hours ago
      Kokoro is better for tts by far<p>For voice cloning, pocket tts is walled so I can&#x27;t tell
      • seunosewa14 hours ago
        Chatterbox-turbo is really good too. Has a version that uses Apple&#x27;s gpu.
      • echelon15 hours ago
        What are the advantages of PocketTTS over Kokoro?<p>It seems like Kokoro is the smaller model, also runs on CPU in real time, and is more open and fine tunable. More scripts and extensions, etc., whereas this is new and doesn&#x27;t have any fine tuning code yet.<p>I couldn&#x27;t tell an audio quality difference.
        • hexaga10 hours ago
          Kokoro is fine tunable? Speaking as someone who went down the rabbit hole... it&#x27;s really not. There&#x27;s no (as of last time I checked) training code available so you need to reverse engineer everything. Beyond that the model is not good at doing voices outside the existing voicepacks: simply put, it isn&#x27;t a foundation model trained on internet scale data. It is made from a relatively small set of focused, synthetic voice data. So, a very narrow distribution to work with. Going OOD immediately tanks perceptual quality.<p>There&#x27;s a bunch of inference stuff though, which is cool I guess. And it really is a quite nice little model in its niche. But let&#x27;s not pretend there aren&#x27;t huge tradeoffs in the design: synthetic data, phonemization, lack of train code, sharp boundary effects, etc.
        • jamilton15 hours ago
          Being able to voice clone with PocketTTS seems major, it doesn&#x27;t look like there&#x27;s any support for that with Kokoro.
          • echelon14 hours ago
            Zero shot voice clones have never been very good. Fine tuned models hit natural speaker similarity and prosody in a way zero shot models can&#x27;t emulate.<p>If it were a big model and was trained on a diverse set of speakers and could remember how to replicate them all, then zero shot is a potentially bigger deal. But this is a tiny model.<p>I&#x27;ll try out the zero shot functionality of Pocket TTS and report back.
        • jhatemyjob13 hours ago
          Less licensing headache, it seems. Kokoro <i>says</i> its Apache licensed. But it has eSpeak-NG as a dependency, which is GPL, which brings into question whether or not Kokoro is actually GPL. PocketTTS doesn&#x27;t have eSpeak-NG as a dependency so you don&#x27;t need to worry about all that BS.<p>Btw, I would <i>love</i> to hear from someone (who knows what they&#x27;re talking about) to clear this up for me. Dealing with potential GPL contamination is a nightmare.
          • jcelerier2 hours ago
            If it depends on espeak NG code, the complete product is 100% GPL. That said, if you are able to change the code to take off the espeak dependency then the rest would revert to non-GPL (or even if it&#x27;s a build time option that you can disable like FFMPEG with --enable-gpl)
          • miki12321111 hours ago
            Kokoro only uses Espeak for text-to-phoneme (AKA G2P) conversion.<p>If you could find another compatible converter, you could probably replace eSpeak with it. The data could be a bit OOD, so you may need to fiddle with it, but it should work.<p>Because the GPL is outdated and doesn&#x27;t really consider modern gen AI, what you could also do is to generate a bunch of text-to-phoneme pairs with Espeak and train your own transformer on them,. This would free you from the GPL license completely, and the task is easy enough that even a very small model should be able to do it.
    • amrrs17 hours ago
      Thanks for sharing your repo..looks super cool.. I&#x27;m planning to try out. Is it based on mlx or just hf transformers?
      • armcat16 hours ago
        Thank you, just transformers.
  • mgaudet15 hours ago
    Eep.<p>So, on my M1 mac, did `uvx pocket-tts serve`. Plugged in<p>&gt; It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way—in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only<p>(Beginning of Tale of Two Cities)<p>but the problem is Javert skips over parts of sentences! Eg, it starts:<p>&gt; &quot;It was the best of times, it was the worst of times, it was the age of wisdom, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the spring of hope, it was the winter of despair, we had everything before us, ...&quot;<p>Notice how it skips over &quot;it was the age of foolishness,&quot;, &quot;it was the winter of despair,&quot;<p>Which... Doesn&#x27;t exactly inspire faith in a TTS system.<p>(Marius seems better; posted <a href="https:&#x2F;&#x2F;github.com&#x2F;kyutai-labs&#x2F;pocket-tts&#x2F;issues&#x2F;38" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;kyutai-labs&#x2F;pocket-tts&#x2F;issues&#x2F;38</a>)
    • Paul_S7 hours ago
      All the models I tried have similar problems. When trying to batch a whole audiobook, the only way is to run it, then run a model to transcribe and check you get the same text.
    • vvolhejn7 hours ago
      Václav from Kyutai here. Thanks for the bug report! A workaround for now is to chunk the text into smaller parts where the model is more reliable. We already do some chunking in the Python package. There is also a more fancy way to do this chunking in a way that ensures that the stitched-together parts continue well (teacher-forcing), but we haven&#x27;t implemented that yet.
    • sbarre13 hours ago
      Yeah Javert mangled up those sentences for me as well, it skipped whole parts and then also moved words around<p>- &quot;its noisiest superlative insisted on its being received&quot;<p>Win10 RTX 5070 Ti
    • small_scombrus12 hours ago
      Using your first text block &#x27;Eponine&#x27; skips &quot;we had nothing before us&quot; and doesn&#x27;t speak the final &quot;that some of its noisiest&quot;<p>I wonder what&#x27;s going wrong in there
    • memming8 hours ago
      interesting; it skipped &quot;we had everything before us,&quot; in my test. Yeah, not a good sign.
  • singpolyma316 hours ago
    Love this.<p>It says MIT license but then readme has a separate section on prohibited use that maybe adds restrictions to make it nonfree? Not sure the legal implications here.
    • CGamesPlay15 hours ago
      For reference, the MIT license contains this text: &quot;Permission is hereby granted... to deal in the Software without restriction, including without limitation the rights to use&quot;. So the README containing a &quot;Prohibited Use&quot; section definitely creates a conflicting statement.
    • jandrese15 hours ago
      The &quot;prohibited uses&quot; section seems to be basically &quot;not to be used for crime&quot;, which probably doesn&#x27;t have much legal weight one way or another.
      • mips_avatar11 hours ago
        I think the only restriction that seems problematic is not being able to clone someone’s voice without permission. I think there’s probably a valid case for using it for satire.
      • WhyNotHugo12 hours ago
        You might use it for something illegal in one country, and then leave for another country with no extradition… but you’ve lost the license to sue the software and can be sued for copyright infringement.
    • syockit8 hours ago
      From my understanding, the code is MIT, but the model isn&#x27;t? What consitutes a &quot;Software&quot; anyway? Aren&#x27;t resources like images, sounds and the likes exempt from it (hence, covered by usual copyright unless separately licensed)? If so, in the same vein, an ML model is not part of &quot;Software&quot;. By the way, the same prohibition is repeated on the huggingface model card.
    • Buttons84015 hours ago
      Good question.<p>If a license says &quot;you may use this, you are prohibited from using this&quot;, and I use it, did I break the license?
      • ethin14 hours ago
        If memory serves, the license is the ultimate source of truth on what is allowed or not. You cannot add some section that isn&#x27;t in the text of the license (at least in the US and other countries that use similar legal systems) on some website and expect it to hold up in court because the license doesn&#x27;t include that text. I know of a few other bigger-name projects that try to pull these kinds of stunts because they don&#x27;t believe anyone is going to actually read the text of the license.
        • HenrikB11 hours ago
          The copyright holder can set whatever license they want, including writing their own.<p>In this case, I&#x27;d interpret it as they made up a new licence based on MIT, but their addendum makes it non-MIT, but something else. I agree with what others said; this &quot;new&quot; license has internal conflicts.
          • kaliqt9 hours ago
            The license is clearly defined. It would be misleading, possibly fraudulent for them to then override the license elsewhere.<p>Simply, it&#x27;s MIT licensed. If they want to change that, they have to remove that license file OR clearly update it to be a modified version of MIT.
      • IshKebab7 hours ago
        I think if they took you to court for cloning someone&#x27;s voice without permission they would probably lose because this conflict makes the terms unclear.
    • iamrobertismo15 hours ago
      Yeah, I don&#x27;t understand the point of the prohibited use section at all, seems like unnecessary fluff.
  • butz1 hour ago
    How large is the model and is it possible to train it read other languages, not only English?
    • butz28 minutes ago
      After pip install pocket-tts all dependencies are 7.4 GB. And it generates at 2x speed on CPu. Neat!
  • Evidlo10 hours ago
    How feasible would it be to build this project into a small static binary that could be distributed? The dependencies are pretty big.
    • homarp9 hours ago
      you can track this issue <a href="https:&#x2F;&#x2F;github.com&#x2F;mmwillet&#x2F;TTS.cpp&#x2F;issues&#x2F;127" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;mmwillet&#x2F;TTS.cpp&#x2F;issues&#x2F;127</a>
  • Imustaskforhelp14 hours ago
    Perhaps I have been not talking to voice models that much or the chatgpt voice always felt weird and off because I was thinking it goes to a cloud server and everything but from Pocket TTS I discovered unmute.sh which is open source and I think is from the same company as Pocket TTS&#x2F;can I think use Pocket TTS as well<p>I saw some agentic models at 4B or similar which can punch above its weights or even some basic models. I can definitely see them in the context of home lab without costing too much money.<p>I think atleast unmute.sh is similar&#x2F;competed with chatgpt&#x27;s voice model. It&#x27;s crazy how good and (effective) open source models are from top to bottom. There&#x27;s basically just about anything for almost everyone.<p>I feel like the only true moat might exist in coding models. Some are pretty good but its the only industry where people might pay 10x-20x more for the best (minimax&#x2F;z.ai subscription fees vs claude code)<p>It will be interesting to see if we will see another deepseek moment in AI which might beat claude sonnet or similar. I think Deepseek has deepseek 4 so it will be interesting to see how&#x2F;if it can beat sonnet<p>(Sorry for going offtopic)
  • akx8 hours ago
    It&#x27;s pretty good. And for once, a software-engineering-ly high-quality codebase, too!<p>All too often, new models&#x27; codebases are just a dump of code that installs half the universe in dependencies for no reason, etc.
  • dust4216 hours ago
    Good quality but unfortunately it is single language English only.
    • jiehong6 hours ago
      Agreed.<p>I think they should have added the fact that it&#x27;s English only in the title at the very least.
      • dust426 hours ago
        Yes, apart from voice cloning nothing really new. Kokoro is out since a long time and it supports at least a few languages other than english. Also there is supertonic TTS and there is Soprano TTS. The latter is developed by a single guy while Kyutai is funded with 150M€.<p><pre><code> https:&#x2F;&#x2F;github.com&#x2F;supertone-inc&#x2F;supertonic https:&#x2F;&#x2F;github.com&#x2F;ekwek1&#x2F;soprano </code></pre> No affiliation with either.
    • phoronixrly16 hours ago
      I echo this. For a TTS system to be in any way useful outside the tiny population of the world that speaks exclusively English, it must be multilingual <i>and</i> dynamically switch between languages pretty much per word.<p>Cool tech demo though!
      • bingaweek13 hours ago
        This is a great illustration that nothing you ever do will be good enough without people whining.
        • phoronixrly3 hours ago
          Excuse me for pointing out that yet another LLM tech demo is presented to our attention.
      • kamranjon16 hours ago
        That&#x27;s a pretty crazy requirement for something to be &quot;useful&quot; especially something that runs so efficiently on cpu. Many content creators from non-english speaking countries can benefit from this type of release by translating transcripts of their content to english and then running it through a model like this to dub their videos in a language that can reach many more people.
        • phoronixrly15 hours ago
          You mean youtubers? And have to (manually) synchronise the text to their video, and especially when youtube apparently offers voice-voice translation out of the box to my and many others&#x27; annoyance?
          • littlestymaar8 hours ago
            YouTube&#x27;s voice to voice is absolutely horrible though. Having the ability for the youtubers to clone their own voice would make it much, much more appealing.
        • ethin14 hours ago
          Uh, no? This is not at all an absurd requirement? Screen readers literally do this all the time, with voices that are the classic way of making a speech synthesizer, no AI required. ESpeak is an example, or MS OneCore. The NVDA screen reader has an option for automatic language switching as does pretty much every other modern screen reader in existence. And absolutely none of these use AI models to do that switching, either.
          • kube-system11 hours ago
            They didn’t say it was a crazy requirement. They said it was crazy to consider it useless without meeting that requirement.
            • ethin10 hours ago
              That doesn&#x27;t really change what I said though. It isn&#x27;t crazy to call it useless without some form of ALS either. Given that old school synthesis has been able to do it for like 20 years or so.
              • echoangle6 hours ago
                How does state of the art matter when talking about usefulness? Is old school synthesis useless?
      • Levitz15 hours ago
        But it wouldn&#x27;t be for those who &quot;speak exclusively English&quot;, rather, for those who speak English. Not only that but it&#x27;s also common to have system language set to English, even if one&#x27;s language is different.<p>There&#x27;s about 1.5B English speakers in the planet.
        • phoronixrly15 hours ago
          Let&#x27;s indeed limit the use case to the system language, let&#x27;s say of a mobile phone.<p>You pull up a map and start navigation. All the street names are in the local language, and no, transliterating the local names to the English alphabet does not make them understandable when spoken by TTS. And not to mention localised foreign names which then are completely mangled by transliterating them to English.<p>You pull up a browser, open up an news article in your local language to read during your commute. You now have to reach for a translation model first before passing the data to the English-only TTS software.<p>You&#x27;re driving, one of your friends Signals you. Your phone UI is in English, you get a notification (interrupting your Spotify) saying &#x27;Signal message&#x27;, followed by 5 minutes of gibberish.<p>But let&#x27;s say you have a TTS model that supports your local language natively. Well due to the fact that &#x27;1.5B English speakers&#x27; apparently exist in the planet, many texts in other languages include English or Latin names and words. Now you have the opposite issue -- your TTS software needs to switch to English to pronounce these correctly...<p>And mind you, these are just very simple use cases for TTS. If you delve into use cases for people with limited sight that experience the entire Internet, and all mobile and desktop applications (often having poor localisation) via TTS you see how mono-lingual TTS is mostly useless and would be switched for a robotic old-school TTS in a flash...<p>&gt; only that but it&#x27;s also common to have system language set to English<p>Ask a German whether their system language is English. Ask a French person. I can go on.
          • VMG19 minutes ago
            &gt; Ask a German whether their system language is English. Ask a French person. I can go on.<p>I&#x27;m German but my system language is English<p>Because translations often suck, are incomplete or inconsistent
          • numpad09 hours ago
            If you don&#x27;t speak the local language anyway, you can&#x27;t decode pronounced spoken local language names anyway. Your speech sub-systems can&#x27;t lock and sync to the audio track containing languages you don&#x27;t speak. Let alone transliterate or pronounce.<p>Multilingual doesn&#x27;t mean language agnostic. We humans are always monolingual, just multi-language hot-swappable if trained. It&#x27;s more like you can make;make install docker, after which you can attach&#x2F;detach into&#x2F;out of alternate environments while on terminal to do things or take in&#x2F;out notes.<p>People sometimes picture multilingualism as owning a single joined-together super-language in the brain. That usually doesn&#x27;t happen. Attempting this especially at young age could lead a person into a &quot;semi-lingual&quot; or &quot;double-limited&quot; state where they are not so fluent or intelligent in any particular languages.<p>And so, trying to make an omnilingual TTS for criticizing someone not devoting significant resources at it, don&#x27;t make much sense.
            • phoronixrly7 hours ago
              &gt; If you don&#x27;t speak the local language anyway, you can&#x27;t decode pronounced spoken local language names anyway<p>This is plainly not true.<p>&gt; Multilingual doesn&#x27;t mean language agnostic. We humans are always monolingual, just multi-language hot-swappable if trained<p>This and the analogy make no sense to me. Mind you I am trilingual.<p>I also did not imply that the <i>model</i> itself needs to be multilingual. I implied that the software that uses the model to generate speech must be multilingual and support language change detection and switching mid-sentence.
      • numpad010 hours ago
        &gt; it must be multilingual and dynamically switch between languages pretty much per word<p>Not abundantly obviously a satire and so interjecting: humans, including professional &quot;simultaneous&quot; interpreters, can&#x27;t do this. This is not how languages work.
        • koakuma-chan9 hours ago
          You can speak one language, switch to another language for one word, and continue speaking in the previous language.
          • numpad07 hours ago
            But that&#x27;s my point. You&#x27;ll stop, switch, speak, stop, switch, resume. You&#x27;re not going to be &quot;I was in 東京 yesterday&quot; as a single continuous sentence. It&#x27;ll have to be broken up to three separate sentences spoken back to back, even for humans.
            • jiehong6 hours ago
              &gt;&quot;I was in 東京 yesterday&quot;<p>I think it&#x27;s the wrong example, because this is actually very common if you&#x27;re a Chinese speaker.<p>Actually, people tend to say the name of the cities in their own countries in their native language.<p>&gt; I went to Nantes [0], to eat some kouign-amann [1].<p>As a French, both [0] and [1] will be spoken the French way on the fly in the sentence, while the other words are in English. Switching happens without any pause whatsoever (because there is really only one single way to pronounce those names in my mind, no thinking required).<p>Note that with Speech Recognition, it is fairly common to have models understanding language switches within a sentence like with Parakeet.
            • polshaw6 hours ago
              I think this is totally wrong. When you have both parties speaking multiple languages this happens all the time. You see this more with English being the loaner more often than it is the borrower, due to the reach that the language has. Listen to an Indian or Filipino speak for a while, it&#x27;s interspersed with English words ALL the time. It happens less in English as there is not the universal knowledge base of one specific other language, but it does happen sometimes when searching for a certain, je ne sais pas.
            • akshitgaur20056 hours ago
              Not really, most multilinguals switch between languages so seamlessly that you wouldn&#x27;t even notice it! It even has given birth to new &quot;languages&quot;, take for example Hinglish!!
      • knowitnone314 hours ago
        I&#x27;m Martian so everything you create better support my language on day 1
      • echelon15 hours ago
        English has more users than all but a few products.
  • Paul_S7 hours ago
    The speed of improvement of tts models reminds me of early days of Stable Diffusion. Can&#x27;t wait until I can generate audiobooks without infinite pain. If I was an investor I&#x27;d short Audible.
    • asystole6 hours ago
      An all-TTS audiobook offering is just about as appealing as an all-stable-diffusion picture gallery (that is, not at all).
      • echoangle6 hours ago
        Isn’t it more like an art gallery of prints of paintings? The primary art is the text of the book (like the painting in the gallery), TTS (and printing a copy) are just methods of making the art available.
        • 306bobby4 hours ago
          I think it can be argued that audiobook&#x27;s add to the art by adding tone and inflection by the reader.<p>To me, what you&#x27;re saying is the same as saying the art of a movie is in the script, the video is just the method of making it available. And I don&#x27;t think that&#x27;s a valid take
          • fluoridation2 hours ago
            No, that&#x27;s an incorrect analogy. The script of a movie is an intermediate step in the production process of a movie. It&#x27;s generally not meant to be seen by any audiences. The script for example doesn&#x27;t contain any cinematography or any soundtrack or any performances by actors. Meanwhile, a written work is a complete expressive work ready for consumption. It doesn&#x27;t contain a voice, but that&#x27;s because the intention is for the reader to interpret the voice into it. A voice actor can do that, but that&#x27;s just an interpretation of the work. It&#x27;s not one-to-one, but it&#x27;s not unlike someone sitting next to you in the theater and telling you what they think a scene means.<p>So yes, I mostly agree with GP. An audiobook is a different rendering of the same subject. The content is in the text, regardless of whether it&#x27;s delivered in written or oral form.
    • everyday77325 hours ago
      It&#x27;s not perfect, but I already have a setup for doing this on my phone. Add SherpaTTS and Librera Reader to your phone. (both available free on fdroid).<p>Set up SherpaTTS as the voice model for your phone (I like the en_GB-jenny_dioco-medium voice option, but there are several to choose from). Add a ebook to librera reader and open it. There&#x27;s an icon with a little person wearing headphones, which lets you send the text continuously to your phone&#x27;s tts, using just local processing on the phone. I don&#x27;t have the latest phone but mine is able to process it faster than the audio is read, so the audio doesn&#x27;t stop and start.<p>The voice isn&#x27;t totally human sounding, but it&#x27;s a lot better than the microsoft sam days, and once you get used to it the roboticness fades into the background and I can just listen to the story. You may get better results with kokoro (I couldn&#x27;t get it running on my phone) or similar tts engines and a more powerful phone.<p>One thing I like about this setup is that if you want to swap back and forth between audio and text, you can. The reader scrolls automatically as it makes the audio, and you can pause it, read in silence for a while yourself and later set it going from a new point.
    • gempir6 hours ago
      I feel like TTS is one of the areas that as evolved the least. Small TTS models have been around for like 5+ years and they&#x27;ve only gotten incrementally better. Giants like ElevenLabs make good sounding TTS but it&#x27;s not quite human yet and the improvements get less and less each iteration.
    • rowanG0777 hours ago
      Wouldn&#x27;t audible be perfectly positioned to take advantage of this. They have the perfect setup to integrate this into their offering.
      • Manfred6 hours ago
        It seems more likely that people will buy a digital copy of the book for a few bucks and then run the TTS themselves on devices they already own.
        • howdareme96 hours ago
          Not likely at all, people pay for convenience. They don&#x27;t want to do that
        • pantalaimon4 hours ago
          eBooks are much more expensive then an Audible subscription though.
          • potatoman223 hours ago
            I wouldn&#x27;t say so. Audible gives you 1 book a month for $15. Most e-books I see are around $10.
  • smallerfish3 hours ago
    Hopefully the browsers will improve their built in TTS soon. It&#x27;s still pretty unusable unless you really need it.
  • g947o2 hours ago
    I wonder if this could be adapted into an app that can run completely offline?
    • dhruvdh2 hours ago
      Try `uvx pocket-tts serve`
  • aki2375 hours ago
    This is impressive.<p>I just tried some sample verses, sounds natural.<p>But there seems to be a bug maybe? Just for fun, I had asked it to play the Real Slim Shady lyrics. It always seems to add 1 extra &quot;please stand-up&quot; in the chorus. Anyone see that?
    • gabrieldemarm1 hour ago
      Hello Gabriel from Kyutai here, maybe it&#x27;s related to the way we chunk the text? Can you post an issue on github with the extact text and voice? I&#x27;ll take a look.
  • britannio6 hours ago
    This is impressive but in a sample I tried, it switched language on the second paragraph. I&#x27;m on a M4 Pro Macbook.<p><a href="https:&#x2F;&#x2F;gist.github.com&#x2F;britannio&#x2F;481aca8cb81a70e8fd5b7dfa2f2af8c8" rel="nofollow">https:&#x2F;&#x2F;gist.github.com&#x2F;britannio&#x2F;481aca8cb81a70e8fd5b7dfa2f...</a>
  • exceptione3 hours ago
    Question: does anyone recommend a TTS that automatically recognizes emotion from the text it self?
    • fluoridation3 hours ago
      Chatterbox does something like that. For example, if the input is<p>&quot;so and so,&quot; he &lt;verb&gt;<p>and the verb is not just &quot;said&quot;, but &quot;chuckled&quot;, or &quot;whispered&quot;, or &quot;said shakily&quot;, the output is modified accordingly, or if there&#x27;s an indication that it&#x27;s a woman speaking it may pitch up during the quotation. It also tries to guess emotive content from textual content, such if a passage reads angry it may try to make it sound angry. That&#x27;s more hit-and-miss, but when it hits, it hits really well. A very common failure case is, imagine someone is trying to psych themselves up and they say internally &quot;come on, Steve, stand up and keep going&quot;, it&#x27;ll read it in a deeper voice like it was being spoken by a WW2 sergeant to a soldier.
  • agentifysh9 hours ago
    Just added it to my codex plugin that reads summary of what it finishes after each turn and I am spooked! runs well on my macbook, much better than Samantha!<p><a href="https:&#x2F;&#x2F;github.com&#x2F;agentify-sh&#x2F;speak&#x2F;" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;agentify-sh&#x2F;speak&#x2F;</a>
  • donpdonp9 hours ago
    it&#x27;d be nice to get some idea of what kind of hardware a laptop needs to be able to run this voice model.
  • anonymous3443 hours ago
    doesn&#x27;t seem to know thai language. anyobody can suggest thai tts?
  • lykahb10 hours ago
    It&#x27;d be great if it supported stdin&amp;stdout for text and wav. Then it could get piped right into afplay
    • gabrieldemarm5 hours ago
      Gabriel from Kyutai here, we do support outputting wav to stdout. We don&#x27;t support reading text from stdin but that should be easy enough. Feel free to drop a pull request!
  • OfflineSergio12 hours ago
    This is amazing. The audio feels very natural and it&#x27;s fairly good at handling complext text to speech tasks. I&#x27;ve been working on WithAudio (<a href="https:&#x2F;&#x2F;with.audio" rel="nofollow">https:&#x2F;&#x2F;with.audio</a>). Currently it only uses Kokoros. I need to test this a bit more but I might actually add it to the app. It&#x27;s too good to be ignored.
  • syntaxing16 hours ago
    Is there something similar for STT? I’m using whisper distill models and they work ok. Sometimes it gets what I say completely wrong.
    • daemonologist16 hours ago
      Parakeet is not really more accurate than Whisper, but it&#x27;s much faster - faster than realtime even on CPU: <a href="https:&#x2F;&#x2F;huggingface.co&#x2F;nvidia&#x2F;parakeet-tdt-0.6b-v3" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;nvidia&#x2F;parakeet-tdt-0.6b-v3</a> . You have to use Nemo though, or mess around with third-party conversions. (Also has a big brother Canary: <a href="https:&#x2F;&#x2F;huggingface.co&#x2F;nvidia&#x2F;canary-1b-v2" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;nvidia&#x2F;canary-1b-v2</a>. There&#x27;s also the confusingly named&#x2F;positioned Nemotron speech: <a href="https:&#x2F;&#x2F;huggingface.co&#x2F;nvidia&#x2F;nemotron-speech-streaming-en-0.6b" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;nvidia&#x2F;nemotron-speech-streaming-en-0...</a>)
      • satvikpendem16 hours ago
        Keep in mind Parakeet is pretty limited in the number of languages it supports compared to Whisper.
      • jokethrowaway6 hours ago
        Parakeet feels much more accurate in practice than whisper, it was a real &quot;a-ha&quot; moment for me.<p>Of course, English only
    • phoronixrly16 hours ago
      from the other day <a href="https:&#x2F;&#x2F;github.com&#x2F;cjpais&#x2F;Handy" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;cjpais&#x2F;Handy</a>
  • tschellenbach16 hours ago
    It&#x27;s cool how lightweight it is. Recently added support to Vision Agents for Pocket. <a href="https:&#x2F;&#x2F;github.com&#x2F;GetStream&#x2F;Vision-Agents&#x2F;tree&#x2F;main&#x2F;plugins&#x2F;pocket" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;GetStream&#x2F;Vision-Agents&#x2F;tree&#x2F;main&#x2F;plugins...</a>
  • GaggiX17 hours ago
    I love that everyone is making their own TTS model as they are not as expensive as many other models to train. Also there are plenty of different architecture.<p>Another recent example: <a href="https:&#x2F;&#x2F;github.com&#x2F;supertone-inc&#x2F;supertonic" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;supertone-inc&#x2F;supertonic</a>
    • andai16 hours ago
      In-browser demo of Supertonic with WASM:<p><a href="https:&#x2F;&#x2F;huggingface.co&#x2F;spaces&#x2F;Supertone&#x2F;supertonic-2" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;spaces&#x2F;Supertone&#x2F;supertonic-2</a>
    • coder54316 hours ago
      Another one is Soprano-1.1.<p>It seems like it is being trained by one person, and it is surprisingly natural for such a small model.<p>I remember when TTS always meant the most robotic, barely comprehensible voices.<p><a href="https:&#x2F;&#x2F;www.reddit.com&#x2F;r&#x2F;LocalLLaMA&#x2F;comments&#x2F;1qcusnt&#x2F;soprano_1180m_released_95_fewer_hallucinations&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.reddit.com&#x2F;r&#x2F;LocalLLaMA&#x2F;comments&#x2F;1qcusnt&#x2F;soprano...</a><p><a href="https:&#x2F;&#x2F;huggingface.co&#x2F;ekwek&#x2F;Soprano-1.1-80M" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;ekwek&#x2F;Soprano-1.1-80M</a>
    • nowittyusername10 hours ago
      Thanks for heads up, this looks really interesting and claimed speed is nuts..
    • nunobrito17 hours ago
      Thank you. Very good suggestion with code available and bindings for so many languages.
  • aidenn011 hours ago
    I&#x27;m sure I&#x27;m being stupid, but every voice except &quot;alba&quot; I recognize from Les Miserables; is there a character I&#x27;m forgetting?
    • vvolhejn7 hours ago
      Václav from Kyutai here. Yes the original naming scheme was from Les Miserables, glad you noticed! We just stuck to Alba because that&#x27;s the real name of the voice actor that provided the voice sample to us (see <a href="https:&#x2F;&#x2F;huggingface.co&#x2F;kyutai&#x2F;tts-voices" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;kyutai&#x2F;tts-voices</a>), the other ones are either from pre-existing datasets or given anonymously.
  • _ache_12 hours ago
    It&#x27;s very impressive! I&#x27;m mean, it&#x27;s better than other &lt;200M TTS models I encounter.<p>In English, it&#x27;s perfect and it&#x27;s so funny in others languages. It sounds exactly like someone who actually doesn&#x27;t speak the language, but got it anyway.<p>I don&#x27;t know why <i>Fantine</i> is just better than the others in others languages. Javer seems to be the worst.<p>Try Jean in Spanish « ¡Es lo suficientemente pequeño como para caber en tu bolsillo! » sound a lot like they don&#x27;t understand the language.<p>Or Azelma in French « C&#x27;est suffisament petit pour tenir dans ta poche. » is very good.I mean half of the words are from a Québécois accent, half French one but hey, it&#x27;s correct French.<p>Però non capisce l&#x27;italiano.
  • indigodaddy14 hours ago
    Perfect timing that is exactly what I am looking for for a fun little thing I&#x27;m working on. The voices sound good!
  • maxglute9 hours ago
    Would be nice if preview supports variable speed.
  • grahamrr12 hours ago
    voices sound great! i see sample rate can be adjusted, is there any way to adjust the actual speed of the voice?
  • Zardoz847 hours ago
    I&#x27;m missing the old days that connecting a SPOKE256 to the Spectrum and making it speak, looked like magic.
  • fuzzer3719 hours ago
    Haven&#x27;t we had TTS for like 20+ years? Why does AI need to be shoved into it all of a sudden. Total waste of electricity.
    • rhdunn6 hours ago
      Using neural nets (machine learning) to train TTS voices has been around a long time.<p>[1] (2016 <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1609.03499" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1609.03499</a>) WaveNet: A Generative Model for Raw Audio<p>[2] (2017 <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1711.10433" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1711.10433</a>) Parallel WaveNet: Fast High-Fidelity Speech Synthesis<p>[3] (2021 <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2106.07889" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2106.07889</a>) UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation<p>[4] (2022 <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2203.14941" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2203.14941</a>) Neural Vocoder is All You Need for Speech Super-resolution
  • oybng16 hours ago
    &gt;If you want access to the model with voice cloning, go to <a href="https:&#x2F;&#x2F;huggingface.co&#x2F;kyutai&#x2F;pocket-tts" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;kyutai&#x2F;pocket-tts</a> and accept the terms, then make sure you&#x27;re logged in locally with `uvx hf auth login` lol
    • andhuman9 hours ago
      I’ve tried the voice clinking and it works great. I added a 9s clip and it captured the speaker pretty well.<p>But don’t do the fake mistake I did and use a hf token that doesn’t have access to read from repos! The error message said that I had to request access to the repo, but I’ve had already done that, so I couldn’t figure out what was wrong. Turns out my HF token only had access to inference.
  • snvzz17 hours ago
    Relative to AmigaOS translator.device + narrator.device, this sure seems bloated.