Renormalization group on neural scaling laws (part 1)

Neural scaling laws are until now not fully explained. Networks that generalize to other data require validation with completely unseen sets of data. In many cases the graph of these validation losses follow a power law, that is, the behavior is scale independent. Given the history of successful application of renormalization group (RG) theory to describe a variety of scale-free and critical phenomena, here we investigate how RG techniques can provide a systematic framework for describing scale-free aspects of loss functions.

Most of these tricks with RG theory rely on the following simple behavior of the validation loss $$\lambda^{c}L\left(n,d\right)=L\left(\lambda^{a}n,\lambda^{b}d\right),$$ which is completely empirical (Kaplan et al. 2020). If we assume the parametrization of the loss is differentiable, differenciating with respect to $\lambda$ gives the following PDE $$\left(c-an\frac{\partial}{\partial n}-bd\frac{\partial}{\partial d}\right)L\left(n,d\right)=0,$$ which has solutions for general differentiable functions $W$ and $\overline{W}$ $$L(n,d)=n^{c/a}W\Big(\frac{d}{n^{a/b}}\Big), \;\;\;\; L(n,d)=d^{c/b}\overline{W}\Big(\frac{n}{d^{b/a}}\Big).$$ These are all possible differentiable parametrizations for the loss that respect the scaling behavior. We have thus verified the "analyitical friendliness" of this simple RG-inspired statement for the neural scaling laws. Of course, the parametrizations are simply tools for describing this facinating behavior, however this allows us to not only build a unique class of parametrizations, but also make said tools more precise.

For example, two assumptions made by Kaplan et al. can be incorporated as the following. Assumption 2 states the validation loss is proportional to $n^{\alpha_N}$ when $d\ll n$ and $d^{\alpha_D}$ when $n\ll d$. Assumption 3 states the loss is analytical at $d=0$. This last assumption partially implies Assumption 2, since it means $W(0)$ is a finite quantity. By requiring that $\overline{W}(0)$ is a finite quantity, then assumption 2 whould be equivalent to assumption 3. If that is the case we are prepared to state $c=a\alpha_N$ and $c=b\alpha_D$ by direct use of the scaling PDE.

Such parametrizations fit the ones used by Kaplan et al. when $W(x)=(1+x)^{\alpha_D}$. Said parametrization is analytical at $n=0$ and at $d=0$, which means $W(0)$ and $\overline{W}(0)$ are both finite.

Considerable effort has been made to breach this scaling behavior to something more efficient like an exponential decay. If the reader allows me to be bold, this behavior is actually fantastic! Other scientists have also noticed the similarities of this scale invariance to the theory of critical phenomena, where in critical conditions physical systems become scale invariant. In future blog posts I wish to express my findings that make a connection to physics with analogys with scaling relations.

Comments

  1. This subject seems tough. But you explained it so didactically that the reading was a pleasure. Waiting for the next parts. Congratulations, Artur! 🫡

    ReplyDelete

Post a Comment

Popular posts from this blog

Study of a gauge theory with a pure $Z(N)$ lattice in 4D

Lindblad equation using Kraus operators and a CPTP map