I'm not sure if this is the best place to ask, but I've been messing around with RVC to make voice models. I have pretty high quality .wav files and have created .pth/indexes with different data sets. Weirdly, a few of my models did really well with completely varying sets of data. One did decently with 2 minutes and 500 epochs. Another had a much larger data set at around 20-25 minutes with the same epochs. I then tried to refine one of them. Gave it a lot more data, around 40 minutes and the voice sounds nothing like it at 250 epochs. (I'd read about overtraining and tried to avoid it if the data set was much larger). I've also tried to split up the data set into 10-12 second chunks or just use one larger .wav file with the same voice clips. I noticed no difference between doing either of those, personally.
I'm very confused on how many epochs to do, if there is such a thing as too much data. As well as if in anyone else's experience if splitting the data up into segments or just having one block file of 20+ minutes is better or worse. Oh, and does it make any difference if the .wav is saved as stereo vs mono? Does stereo perhaps cause more "noise" to be read instead of focusing on how the voice sounds?
Also, I've been using Kits AI to test if the voice model sounds anything like what I want it to, given that they have an easy way to upload for free and use free audio gen in short clips, from talking to singing. My first test which only did 50 epochs and less data turned out much better than my 100+ with more data.
[link] [comments]