Any methods for achieving close to elevenlabs quality and inference speed with a local model?

My upcoming master thesis (on use of AI models for believable NPCs in video games) will involve the use of text to speech. Currently elevenlabs has been my go to for TTS, but the pricing model is quite inconvenient since its a monthly subscription instead of pay for use. I'm sure some of you here are knowledgeable with TTS, it would be great if anyone could point me in a good direction. I want use use TTS to generate natural human sounding conversational voices. Ideally I could finetune the model on different voice profiles to get a wide range of voices. I have only started to look I to options outside elevenlabs, but I figured I should ask here before I start diving deep, so I can avoid unnecessary waste of time!

If this is not achievable locally, it would be great to know if there are any methods for hosting a model on a cloud compute platform, so I can at least pay for use, instead of deal with the monthly subscription that comes with eleven labs.

Thank you!

submitted by /u/timidavid350
[link] [comments]