Is there a way for a Speech-to-Text model to differentiate between speakers?
Is there a way for a Speech-to-Text model to differentiate between speakers?

Is there a way for a Speech-to-Text model to differentiate between speakers?

For instance, if I'm recording an interview between two people, and I have something like Whisper recording the discussion, can it break out the dialogue between the speakers? Seems like this would be a fairly simple feature, but I'm not sure if it exists.

Doesn't have to be Whisper per se, but is there a known S2T model or solution for this?

submitted by /u/jrstelle
[link] [comments]