13 comments

  • wzdd3 hours ago
    It’s a nice engineering approach, but I’m interested in the motivation. Um and ah is distracting in a transcript, where you can naturally pause to take in information; in speech however it can serve as a focusing point to indicate the next part is important. See <a href="https:&#x2F;&#x2F;medium.com&#x2F;better-humans&#x2F;dont-worry-about-saying-um-effective-public-speaking-includes-filler-words-e56304416a90" rel="nofollow">https:&#x2F;&#x2F;medium.com&#x2F;better-humans&#x2F;dont-worry-about-saying-um-...</a> for example. The weirdly obsessive zeal that orgs like Toastmasters have about eliminating them is weird.<p>Disfluencies aren’t necessarily bad even if the word starts with “dis”!
    • toast02 hours ago
      Having heard radio interviews with and without &#x27;internal editing&#x27; to remove ums and ahs, most of the time I&#x27;d rather the edited version. It&#x27;s more concise and focused, and I find it easier to comprehend. Too many ums and ahs and my mind wanders, and if it&#x27;s radio, I can&#x27;t go easily go back to try again. When I&#x27;ve listened to podcasts or audiobooks, I could never easily go back a little to try again either, and I gave up on them (even though I have some content I really want to listen to, it&#x27;s too frustrating, so it&#x27;s not happening). But I&#x27;m sure other people have different preferences.<p>I also don&#x27;t care for writing that could have been made a lot more concise. It&#x27;s a lot of work to make things shorter, but I think it&#x27;s worthwhile.
    • NooneAtAll321 minutes ago
      &gt; in speech however it can serve as a focusing point to indicate the next part is important<p>it&#x27;s... exact opposite?<p>the main (attempted) use for ummms is to keep continuation of speech despite the pause. And the main complaint is exactly that it ruins the focus and doesn&#x27;t give respite
    • siriaan3 hours ago
      Occasional ums and ahs are fine but when every other phrase starts with a long aaaaah it can be pretty unpleasant to listen to.
      • sans_souse2 hours ago
        So, if this project&#x27;s source Audio were Beavis and Butthead, you would be enthused?
    • mrob1 hour ago
      &gt;The weirdly obsessive zeal that orgs like Toastmasters have about eliminating them is weird.<p>If you speak with disfluencies, you probably didn&#x27;t sufficiently rehearse your speech. If you didn&#x27;t rehearse enough, you probably didn&#x27;t put much effort into writing it either, so why should I put much effort into listening? It&#x27;s the same principle as AI slop.
  • supernes3 hours ago
    This approach seems kind of backwards to me. Why try to detect everything <i>except</i> the thing you&#x27;re trying to remove instead of either sampling a few uhs and ums and treating them as noise to be silenced (with a sharp crossfade to the noise floor that doesn&#x27;t interrupt speech flow) or finetuning a model to detect them specifically for full automation?
  • heroprotagonist4 hours ago
    Not to promote something, but Wispr Flow does that for me automatically if I trigger a setting for it..<p>While it&#x27;s a commercial product with a subscription, I spent a long time on the free tier not even hitting their limits until I started using it so extensively that I wanted to pay for it.<p>And I&#x27;ve used Whisper in the past, mostly for tinkering. I tried it for a couple of use cases but haven&#x27;t touched the base project in a while. But I do regularly use Faster-Whisper-XXL, an open source project based on Whisper, for subtitle generation.<p>Though, for subtitle generation, I decided to support the project and mainly use the non-public build of Faster-Whisper-XXL Pro built for donators to the open source project.<p>The extra features smooth out the subtitle editing process very substantially. Toss in &quot;--roformer_overlap 0.125 --roformer_vram 16 --best_of 15 --ff_vocal_extract mb-roformer --vad_method pyannote_v3&quot; to the cli parameters (and sometimes --realign) and you have much less work to do in SubtitleEdit or Tero Subtitler afterwards to clean it up.
    • dotancohen55 minutes ago
      Is love to hear more about subtitle generation. Specifically, can you label different speakers? I&#x27;d be using this for meeting transcription. Thank you.
  • lavaman1312 hours ago
    This is great, I&#x27;ve tried out automated podcast editing tools before and they cut too aggressively in my experience. What are you thinking about doing next with this now that you&#x27;ve gotten the alignment snapping working cleanly for &#x27;um&#x27; and &#x27;ah&#x27;, are you thinking of expanding the tool?
  • rindalir5 hours ago
    This is fascinating! I&#x27;m going to try this on a certain clip from Jurassic Park.
  • alok-g3 hours ago
    I would love to see support for videos and removal of custom filler words (I say &#x27;basically&#x27; and &#x27;like&#x27; a lot and have so far failed to improve myself on this).
  • cadamsdotcom5 hours ago
    What an awesome tool and idea. I’d be keen to see if it can integrate with video editing tools.<p>Ideally it would slice the video in the timeline without actually removing anything, so you can scrub through your video and try with and without each disfluency (thank you - awesome word) &amp; decide case by case which to keep!
  • sciencesama5 hours ago
    there is a aah counter in toast master !! this is the software that helps !!
  • npodbielski3 hours ago
    I think it is harder to remove those from your own speech. I have been doing that for few months now and I still get back at it when I am in hurry or stressed.
  • cryptoz5 hours ago
    Really cool stuff and definitely going to try it; I’m also finding it wild that Google put effort into <i>adding</i> ums and erms into their text to speech model a while back. AI puts it in, AI helps take it out.
  • sublinear5 hours ago
    Disfluencies are not necessarily &quot;filler&quot;. They can convey mood or hesitation. Cutting them can change the meaning.<p>A trivial example is &quot;umm... well... (sigh) okay&quot; versus just &quot;okay&quot;. Not okay!
  • dougcalobrisi6 hours ago
    This post is mostly about how surprisingly hard it is to cut filler words out of speech cleanly. Apparently, stripping ums isn&#x27;t a find and replace type thing, because Whisper&#x27;s timestamps are off by up to a few hundred ms and cutting on them chops syllables or leaves stutters. So, I built a tool, erm, that starts from Whisper&#x27;s guess, finds where each word actually starts and stops in the audio, and snaps the cuts to silence so there&#x27;s no click, with ffmpeg doing the splicing.<p><a href="https:&#x2F;&#x2F;github.com&#x2F;dougcalobrisi&#x2F;erm" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;dougcalobrisi&#x2F;erm</a>
  • bagvader5 hours ago
    [flagged]