Why is it harder to make AI generated music than images? Initially I thought it was simply because there's not enough will to pour in the same amount of resources to develop high quality models. Or there's not enough training data available. However as I thought about it more I came to the conclusion that music generation is intrinsically harder to solve. To explain involves some math but bear with me. I've studied this stuff so please don't dismiss me. I'd love to hear other people's opinions on this since I haven't seen anybody address this issue.
When image generation occurs the model starts with abstract shapes and iterates over time adding more detail each passthrough. Eventually there is enough detail that we can perceive individual, separate objects in the image. But we only perceive separate objects because of the boundaries. For example if you look at a picture of an apple on a table, you only “see” an apple because of the clear boundary between the apple and the background. If that boundary was blurred the image would very quickly become incoherent. The point is this: we parse together an image in our mind by using the boundaries of the objects. The edges of an object are what differentiates it from its environment. This doesn’t just apply with visual input but all senses. For sound, in order to differentiate two sounds requires some characteristics that are different between the two sounds (usually their frequency spectrums). In the same way that we differentiate objects in an image by the boundary (change in color, texture, lighting, etc.), there needs to be some difference between sounds to perceive “structure” in the music. This is where the problem lies.
Our brain makes sense of sound via the frequency spectrum and NOT the time dependant wave. The “boundaries” of sounds (i.e. separating the drums, bass, vocals, etc. when you hear a song) are encoded in the frequency spectrum. This means generating music is not simply generating a wave, but rather generating a frequency spectrum that makes sense to our ears which is much more difficult. For example if you have drums, bass, and vocals playing simultaneously all those frequencies are mixed up together. We only perceive them “separate” because our ear does a Fourier transform. This means we can’t use the same approach of generative AI that image generation uses because we can’t work in the domain where the generation is happening - the time domain, instead we have to work in the frequency domain. If an AI generated image has something wrong with it you can correct for it on the image itself. But say the drums are too loud in a piece of music there’s no way to separate the drums of the wave, you have to first convert to frequency which causes uncertainty via the uncertainty principle. It seems you can get more accurate results via the wavelet transform instead of Fourier but the point stands. It’s simply more mathematically complex to generate music than images due to the fact that we parse sounds via the inverse domain of the input (time to frequency). This is why AI generated music sounds "noisy" and "uneven", like it's a low quality recording; the generation is occurring in the time domain where all the frequencies are mixed up but our perception is based on the frequency domain. Adding a boundary in the frequency domain mean non-trivially altering the time domain.
[link] [comments]