For the video of this, click here.
Time and time again it's been proven that in the long run, scale beats any kind of performance gain we get from implementing smart heuristics; This idea is known as "The bitter lesson". The idea that we could build intelligence ourselves is now a thing of the past, instead, we rely on the fact that just pouring in enough energy (compute) into these neural networks will let them reach intelligence. It remains a mysterious phenomenon though: how could such simple rules (like gradient descent + backpropagation following a reward function) and a lot of energy lead to such complexity? The answer to this question lies all around us: Life itself is a system just like this. We call these systems dissipative systems in physics.
Think of evolution for example. The emergence of any complex organism around us is the product of a simple mechanism: natural selection. No one had to design these complex creatures, it was the universe itself that created such complexity. When we look at life, intelligence, or any complex system for that matter, we can deduct a couple of prerequisites for its emergence:
- There needs to be selection: selection means finding the 'best' solution for given selection criteria. If we look at natural selection, we try to find the genes (or alleles to be specific) with the highest fitness. In neural networks, the reward function tries to find the best loss on the loss surface of neural networks. Even society tries to find the best companies, workers, and ideas through capitalism.
- There needs to be sufficient diversity: Mutations in genes allow natural selection to work. If all genes were the same, competition would not be able to select the best (they would all be equally good). The emergence of complex biological structures is something that has to happen stepwise or leap-wise. For example, before we evolve to have eyes, we might start with small mutations causing us to have photon receptors, then another that makes a dome-shaped cell on top to concentrate light on the receptor, etc. until we reach the complexity of the eye. Some structures however do not lend themselves to iterative improvements and instead need leap-wise improvements. This is the case when we need multiple correct elements before something is functional. We can relate this to a neural network stuck in a local minimum with steep walls: We need a high stepsize/stochasticity to 'lead-wise' step ourselves out of the local minima and into a more beneficial state.
- Most overlooked is that we need energy: We get energy through time and power (energy = time*power). The power of life is the sun, it produces enough energy for complex systems to emerge. Without energy, selection and diversity would not happen. Without the sun, life would be impossible, not just in a biological sense, but in a physical sense. This is because life can be seen as a dissipative system (https://journals.sagepub.com/doi/10.1177/1059712319841306?icid=int.sj-full-text.similar-articles.5), and for a dissipative system to reach an optimum state, it needs energy. With enough power and time, the system will gain more and more energy, getting closer to its optimum state. For selective and diverse systems like natural selection, this means reaching the genes with the highest fitness. For intelligence, this means reaching the highest form of understanding.
Through this lens, it's not hard to see why deep learning works: It's a system with selection, diversity, and energy. If our deep learning is selecting the right thing, the diversity is high enough, and the energy is high enough, we should theoretically reach an optimal understanding.
The more general the selection procedure, the more energy that is needed. For example, having rather constrained search space like in specialistic AI, the selection does not need that much energy. If we try to make a robot learn to walk through reinforcement learning, it doesn't cost as much to compute if we teach it to first move its left leg, plant its foot, then the right leg, etc. If we constrain the search space by specifying subgoals, the search space is much smaller and the robot will converge much quicker with much less compute. However, we trade this for generality and creativity. The robot might not ever learn a new, more efficient way of walking if we constrain it by reaching each subgoal of walking.
This is what we see over time, the more compute that becomes available, the broader the reward functions get. This is how we moved from specialist AI to generalist AI: the difference is the scope of the reward function. Instead of saying: "optimize for the best score on chess", we say: "optimize for the best prediction of the next word". This reward function is so general and so broad, that AI can learn almost every skill imaginable. This however is not just ingenuity, this is the result of the increase in computing that allows us to have broader defined reward functions.
Extrapolating these results, we might wonder what the next 'step' might be in an even more general reward function. Maybe something like: "make humans happy" is so general that the AI can find truly novel and creative ways to reach this goal. It's however not feasible to do this now, as its search space is way too big considering its generality, but this means it might be something future models might do.
Another way in which we can make the reward function more general is by saying: optimize for the best neural network weights + architecture". instead of redefining the architecture, like using a neural network, we could use some kind of evolutionary algorithm that mutates and selects for best-performing architectures, while simultaneously evolving these architectures' weights. This is something Google (Using Evolutionary AutoML to Discover Neural Network Architectures) has already done, and although showing great success, they admit that computationally this is just not practical yet.
All-in-all, through this lens of selection, diversity, and energy, we can get an intuition for the emergence of intelligence and even life itself. We can predict that as energy in the system increases, so does the complexity of the system. As computing keeps increasing, we can expect more complex models. This increase in computing will also allow for different selection functions, ones that are more general than the ones we have now, allowing more creativity and value from AI over time. The scaling law is more than just a law for AI, it's a reflection of a law of nature, one described by a physics concept called dissipative systems.
[link] [comments]