4 comments

  • ks204871 days ago
    &gt; Our work was heavily inspired by KyutaiTTS and Sesame<p>I wish they’d describe the technical details of the differences between this and other TTS they were “inspired by”.<p>So many projects like this, I will just have to assume they are vibe-coded clones to get some publicity unless there’s more technical details.
    • echelon70 days ago
      Sesame is an impressive real time conversational audio-to-audio model you can talk to on their website [1]. But it&#x27;s closed source. They released some components, but nothing you could use to duplicate their work.<p>Sesame is what this team (and lots of teams) want to build. I know another team trying to build a real time local NSFW girlfriend you can talk to. They&#x27;re convinced they can reach $100M ARR quickly if they crack it and make it customizable.<p>KyutaiTTS provides a lot of the ingredients for this work, but it isn&#x27;t conditioned for audio to audio afaik or any of the streaming components.<p>[1] <a href="https:&#x2F;&#x2F;app.sesame.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;app.sesame.com&#x2F;</a>
    • popalchemist70 days ago
      This is a streaming version of their previously released Dia TTS, which was an original work. You may want to recalibrate your assumption.
  • Neywiny70 days ago
    Not sure if it&#x27;s an artifact of their streaming approach but their intro demo has exclamation marks and question marks and the intonation through the sentence just doesn&#x27;t fit. It&#x27;s vocalized regularly with only the last word having that exclamation or question sound. Maybe we need that Spanish upside down question mark at the start to help it.
  • woodson71 days ago
    Looks very similar to Kyutai’s models, given that it uses the same neural audio codec (Mimi) and Depformer module etc.