I used Whisper AI (openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision (github.com)) to transcribe the Common Voice data set (Common Voice (mozilla.org)) for a language and note that the 'tiny' model hallucinates a lot, whereas the bigger 'small' model almost does not hallucinate at all, and the even bigger 'base' model hallucinates more than the 'small' model. Furthermore, the general performance of the small model is better than both the tiny and base models. As a side note, the data instances in this data set are sentences worth about 5-10 seconds of audio.
I am mostly interested in your thoughts on why a larger model does not necessarily perform better and may hallucinate more. I did not change any of the temperature or other settings when transcribing. I can imagine a larger model might overfit which can cause this phenomenon but I would like to know what you guys think might be the cause of the lower performance with more hallucinations.
As context: I am doing research for my master thesis so any ideas are welcome!
[link] [comments]