Renormalization group on neural scaling laws (part 1)
Neural scaling laws are until now not fully explained. Networks that generalize to other data require validation with completely unseen sets of data. In many cases the graph of these validation losses follow a power law, that is, the behavior is scale independent. Given the history of successful application of renormalization group (RG) theory to describe a variety of scale-free and critical phenomena, here we investigate how RG techniques can provide a systematic framework for describing scale-free aspects of loss functions. Most of these tricks with RG theory rely on the following simple behavior of the validation loss $$\lambda^{c}L\left(n,d\right)=L\left(\lambda^{a}n,\lambda^{b}d\right),$$ which is completely empirical (Kaplan et al. 2020). If we assume the parametrization of the loss is differentiable, differenciating with respect to $\lambda$ gives the following PDE $$\left(c-an\frac{\partial}{\partial n}-bd\frac{\partial}{\partial d}\right)L\left(n,d\right)=0,$$ which has...