Tutorial: Benchmarking Bark text-to-speech on 26 Nvidia GPUs – Reading out 144K recipes

In this project, we benchmarked Bark text-to-speech across 26 different consumer GPUs.

The goal: To get Bark to read 144K food recipes from Food.com's recipe dataset.

You can read the full tutorial here: https://blog.salad.com/bark-benchmark-text-to-speech/

Included: Architecture diagram, data preparation, inference server setup, queue worker, setting up container group & compiling the results

Code-blocks included in the tutorial.

Words per dollar for each GPU:

Words per dollar comparison or each GPU

Although the latest cards are indeed much faster than older cards at performing the inference, there’s really a sweet spot for cost-performance in the lower end 30xx series cards.

Conclusions

As is often the case, there’s a clear trade-off here between cost and performance. Higher end cards are faster, but their disproportionate cost makes them more expensive per word spoken.
The model’s median speed is surprisingly similar across GPU types, even though the peak performance can be quite different.
No matter what GPU you select, you should be prepared for significant variability in performance.
Qualitative: While bark’s speech is often impressively natural sounding, it does have a tendency to go off script sometimes.

We’ve also made available audio from 1000 top-rated recipes, paired with the script it was trying to read.

submitted by /u/SaladChefs
[link] [comments]