🛩️

The Spectral Maximal Update Parameterization in Theory and Practice - December 2025

Kyle R. Chickering - kyrochickering@gmail.com
Work Completed while author was affiliated with UC Davis LUKA Lab & MBZUAI Institute of Foundation Models

[ Homepage ]

Last updated: 02.15.26

📖

Table of Contents

Introduction

The maximal parameterization update (μ\muP) (Yang et al. 2022) has proven itself a valuable tool for reducing pre-training costs at scale. Beyond enabling learning rate transfer across model scales, μ\muP improves training dynamics by ensuring that learning occurs at the same rate across disparate layers. For large-scale training runs a correct μ\muP implementation could save millions of dollars in compute. For researchers and engineers, correctly implementing μ\muP reduces the dimensionality of hyperparameter sweeps and produces more reliable results by getting models closer to compute optimal. Additionally, μ\muP eliminates a common pitfall in the literature where two models are compared but only one model is hyperparameter tuned, which can lead to misleading or downright incorrect conclusions about model scaling and behavior.

μ\muP is not merely a theoretical consideration. During training under the standard parameterization different layers learn at different rates, effectively freezing the embeddings during training and wasting compute on sub-optimal weight updates. When using μ\muP for large networks we can attain lower losses by ensuring all layers learn at the correct rate. μ\muP also helps reduce tuning costs when preparing for large-scale training runs. By enabling zero-shot hyperparameter transfer (see the figure below) we can cheaply tune small models and avoid the expensive extrapolation and validation steps required to find the optimal hyperparmeters for large models.

Demonstration of μ\mu-transfer using μ\muP on a NanoGPT model while varying the network width. Shaded region represents the standard deviation of the optimal learning rates. Using μ\muP aligns the optimal learning rates to be constant or nearly constant, and additionally aligns the loss “bowls” to be concentric.

This blog post is not a comprehensive introduction or tutorial on μ\muP; we assume that the reader has some familiarity with μ\muP, either through reading or attempting to read the Tensor Programs V paper (TP-V) (Yang et al. 2022) or through a resource like Dey’s Practitioner’s Guide to μ\muP (Dey et al. 2024). This blog post is intended for audiences who want a practical tool-kit for applying μ\muP, and who don’t have time to sort through the developments in the literature over the past several years. In particular we want to extend the practitioner’s toolkit so that they can implement μ\muP models which go beyond the original width-based models studied by Yang. The development of μ\muP now spans nearly half a decade and involves challenging mathematics, rendering much of the theory difficult to access. We view this as unfortunate, and believe that the mathematics and understanding yielded by μ\muP are extremely valuable for ML practitioners. To increase the understanding and awareness of μ\muP we present a heuristic approach to μ\muP derivations, based largely on the more recent “spectral perspective” of μ\muP (Yang et al. 2023b).

We argue that this later approach is better suited for understanding and deriving μ\muP. The original μ\muP formulation is derived by studying the updates of the weight matrices through constraints on the outputs of the the matrix multiply. Spectral μ\muP instead directly constrains the weight matrices themselves. This will also allow us to directly derive μ\muP scalings for depth. Additionally, the original μ\muP formulation admits a subtle failure mode where all of the conditions for feature learning are satisfied but learning rate transfer does not occur (see below). Spectral μ\muP prevents this failure mode from occurring.

From a practitioner’s perspective, having conditions directly on the weight matrices instead of on the layer outputs allows us to directly understand and debug μ\muP implementations, instead of trying to intuit the behavior of matrices by looking only at the outputs to each layer.

We strive to present a distillation of the mathematical theory and as such many of our results are not rigorous. However rigorous statements for our results can be found scattered throughout the literature.

Throughout we discuss practical insights that we have learned through the process of deriving, implementing, and debugging μ\muP. We hope that this guide can serve as a valuable resource for practitioners moving forward in implementing μ\muP, and we advocate for using either μ\muP or Muon for large scale training moving forwards.

Contributions

In this blog post we:

  • Argue that the spectral conditions for feature learning, originally derived in (Yang et al. 2023b), should form the basis of our understanding of μ\muP.
  • Provide a set of heuristics based on random matrix theory and Tensor Programs which allow μ\muP to be applied in contexts beyond the original derivation.
  • Briefly discuss whether or not learning rate and weight decay can be transferred across batch sizes in addition to model sizes.
  • Demonstrate that using spectral μ\muP perspective eliminates a particular failure case in the standard method of validating a μ\muP implementation: coordinate checking.

A Brief History of μ\muP

The seminal paper in the study of μ\muP is the Tensor Programs V paper (TP-V) (Yang et al. 2022) which presented the successful application of Tensor Programs to improving real-world training dynamics. We briefly note that this is the fifth paper in what is generally considered a six to ten paper series (depending on how you’re counting). We will not focus on the first four papers in this blog (Yang 2019) (Yang 2020a) (Yang 2020b) (Yang & Hu 2020), but emphasize that the μ\mu-transfer theory posed in TP-V crucially builds on Yang’s previously developed theory. The original TP-V paper addresses learning rate transfer across width, and we discuss extensions and attempted extensions of this work below.

It’s been three years since the TP-V paper, and the community has explored improvements in understanding of μ\muP as well as validation of μ\muP up to the 13B scale (Dey et al. 2023). Anecdotal rumors are that several of the large frontier labs are using μ\muP at much larger scales than this.

In terms of theoretical developments we choose to emphasize the “spectral μ\muP” paper written by Yang, Simons, and Bernstein (Yang et al. 2023b) which re-formulates μ\muP in terms of spectral weight norms. We will discuss this theoretical perspective in much greater detail below. Yang’s group also attempted to apply μ\muP to the case of depth (Yang et al. 2023a) (TP-VI), but was only able to get successful feature learning with residual blocks consisting of a single linear weight matrix and as such does not apply to the practical case of training LLMs. Finally, practitioners should be aware of the Complete-P paper (Dey et al. 2025), which primarily addresses the deficiencies in the TP-VI paper and derives the correct μ\muP parameterization for depth. We consider this paper to be the current state of the art for applying μ\muP in practice. The authors additionally derive scalings for other aspects of LLM training which were left out from the original μ\muP papers. We suggest that the Complete-P paper be used as a reference for implementing μ\muP since they have the most comprehensive table of parameter scalings.

More recently our own group has extended the μ\muP theory to cover the challenging case of grouped query attention (GQA) which required some subtle extensions to the overall theory (Chickering et al. 2025).

Finally, as this blog post was being finished, Soufiane Hayou released the first actual mathematical proof that learning rate transfers under μ\muP (Hayou 2025). This exciting new work finally bridges the gap between the theoretical intuitions of TP-V and a rigorous theory of learning rate transfer.

Spectral μ\muP

The Tensor Programs V paper proposes a method for zero-shot hyperparameter tuning by carefully applying the analysis from Tensor Programs to determine the proper learning rate and initialization schemes for a neural network (Yang et al. 2022). To this end the authors define the concept of feature learning. Let ht()\bm{h}_t^{(\ell)} be the activations of the \ell-th layer of the neural network at timestep tt, and let Δht():=ht()ht1()\Delta \bm{h}_t^{(\ell)}:=\bm{h}_t^{(\ell)}-\bm{h}_{t-1}^{(\ell)}. If the output of the \ell-th layer is ht()Rn\bm{h}_t^{(\ell)}\in \R^n, we say that feature learning is occurring for the layer if

ht()2=Θ(n),Δht()2=Θ(n),(1)||\,\bm{h}_t^{(\ell)}\,||_2=\Theta(\sqrt{n}), \qquad ||\,\Delta \bm{h}_t^{(\ell)}\,||_2=\Theta(\sqrt{n}), \tag{1}

as nn\rightarrow \infty. The authors show that for a network satisfying feature learning at every layer we should (with some caveats) expect that the learning rate transfers across models of different width nn. In other words, we can sweep hyperparameters (learning rate) at some n=n0n=n_0 and use those same hyperparameters (learning rate) for a model with n=nn=n_*.

The theory is called maximal (as in maximal update parameterization) because if the layer updates Δht()\Delta \bm{h}_t^{(\ell)} were any smaller (asymptotically) we would learn very slowly, but if they were any bigger the training would diverge. Thus, this n\sqrt{n} scaling is precisely the fastest we can update our weights without the training diverging.

While groundbreaking, μ\muP and an intuitive understanding of feature learning remained challenging after the publication of TP-V. In a follow-up work, Yang, Simons, and Bernstein provide a more mathematically satisfying explanation and intuition for feature learning, phrased in terms of the weights of a neural network instead of the activations (Yang et al. 2023b). This viewpoint is preferable for several reasons:

  1. The adjustments to the neural network are done through the weights, not through the activations, so a theory based purely on the network weights allows more direct access to quantities we are scaling / adjusting; it is easier to have a theory based directly on W\bm{W} than based tangentially through Wx\bm{W}\bm{x}.
  1. The update maximality becomes quite obvious, rather than being obscured behind a matrix-vector product. Practically speaking, using the spectral theory rather than the original theory allows us to avoid some subtle failures in the original conception of μ\muP: we show that the spectral conditions are more stringent than the original conception.
  1. In terms of norms on the weights we get a better intuition of what conditions about the network lead to learning rate transfer. It is arguable that this viewpoint helped lead to the development of the Muon optimizer (Jordan et al. 2024).

In particular, Yang et al. prove that conditions on the spectral norm of the weight matrices (see equation (2)(2) below) imply that feature learning in the sense of equation (1)(1) holds.

To follow their argument we consider a dense MLP network with input dimension dind_{\text{in}}, output dimension doutd_{\text{out}}, and hidden dimension nn. We consider the case where both dind_{\text{in}} and doutd_{\text{out}} are fixed but we scale the hidden dimension nn. This setting can be mapped to transformer LLMs and conv-nets with little to no modification, but we choose this simple setting for pedagogical purposes. We then set nin()=dinn_{\text{in}}^{(\ell)}=d_{\text{in}} for the input layer, and nin()=nn_{\text{in}}^{(\ell)}=n for the hidden and output layers, as well as nout=nn_{\text{out}}=n for the input and hidden layers, and nout=doutn_{\text{out}}=d_{\text{out}} for the output layer.

Under this notation, (Yang et al. 2023b) prove that the weights of layer \ell at timestep tt, given by Wt()Rnout()×nin()\bm{W}^{(\ell)}_t\in \R^{n_{\text{out}}^{(\ell)}\times n_{\text{in}}^{(\ell)}}, and the updates ΔWt():=Wt()Wt1()\Delta \bm{W}_t^{(\ell)}:=\bm{W}_t^{(\ell)} - \bm{W}_{t-1}^{(\ell)}, should satisfy the following constraints on their spectral norm

Wt()=Θ(nout()nin()),ΔWt()=Θ(nout()nin()).(2)||\,\bm{W}_t^{(\ell)}\,||=\Theta\left(\frac{\sqrt{n_{\text{out}}^{(\ell)}}}{\sqrt{n_{\text{in}}^{(\ell)}}}\right), \qquad ||\,\Delta\bm{W}_t^{(\ell)}\,||=\Theta\left(\frac{\sqrt{n_{\text{out}}^{(\ell)}}}{\sqrt{n_{\text{in}}^{(\ell)}}}\right). \tag{2}

Recall that the spectral norm, or induced 2-norm, is given by

A:=supx2=1Ax2.||\,\bm{A}\,||:=\sup_{||\,\bm{x}\,||_2=1}\,||\,\bm{A}\bm{x}\,||_2.

In what follows we dispense with full rigor and focus on the essence of the mathematical argument. However we stress that the statements made here can be made fully rigorous, and indeed this is done in (Yang et al. 2023b).

Informally, during neural network training we have the relationship

Wt()x2=Θ(Wt()x2),||\,\bm{W}_t^{(\ell)}\bm{x}\,||_2=\Theta(||\,\bm{W}_t^{(\ell)}\,||\,||\,\bm{x}\,||_2),

from which the implication that (2)(1)(2)\,\Rightarrow\,(1) becomes quite clear if we ignore the activation function, since we have assumed recursively that the inputs x\bm{x} satisfy the feature learning condition (1)(1).

Spectral μ\muP is Stronger than Feature Learning

The reverse implication is not true. Feature learning in the sense of condition (1)(1) does not imply that the spectral feature learning condition (2)(2) is met.

Consider a weight matrix WRn×n\bm{W}\in \R^{n\times n} with weight update ΔW\Delta \bm{W}. Assume that the W0=Θ(1)||\,\bm{W}_0\,||=\Theta(1), but that ΔWt=Θ(nα)||\,\Delta \bm{W}_t\,||= \Theta(n^{-\alpha}) for some α>0\alpha > 0. Further assume that the inputs to this layer satisfy feature learning, that is ht2=Θ(n),Δht2=Θ(n)||\,\bm{h}_t\,||_2 = \Theta(\sqrt{n}), ||\,\Delta \bm{h}_t\,||_2 = \Theta(\sqrt{n}). Ignoring the non-linearity, it is clear that this weight matrix does not satisfy the conditions (2)(2) of spectral μ\muP. However, note that

Wtht2=W0ht+τ=1tΔWτht=Θ(ht2+τ=1tnαht2)=Θ(n),||\,\bm{W}_t\bm{h}_t\,||_2 = ||\,\bm{W}_0\bm{h}_t + \sum_{\tau=1}^t\Delta \bm{W}_\tau \bm{h}_t\,|| = \Theta\left(||\,\bm{h}_t\,||_2 + \sum_{\tau=1}^tn^{-\alpha}||\,\bm{h}_t\,||_2\right) = \Theta(\sqrt{n}),

so the output of the layer scaling satisfies feature learning in the sense of (1)(1). Furthermore

Δ(Wh)t2=ΔWtht+WtΔht2=Θ(ΔWtht2+WtΔht2)=Θ(n1/2α+n1/2)=Θ(n),(3)||\,\Delta(\bm{W}\bm{h})_t\,||_2 = ||\,\Delta\bm{W}_t\bm{h}_t + \bm{W}_t \Delta \bm{h}_t\,||_2 = \Theta(||\,\Delta \bm{W}_t\,||\,||\,\bm{h}_t\,||_2 + ||\,\bm{W}_t\,||\,||\,\Delta \bm{h}_t\,||_2) = \Theta(n^{1/2-\alpha} + n^{1/2})=\Theta(\sqrt{n}),\tag{3}

and the updates also satisfy (1)(1). Thus, feature learning is occurring despite the fact that the spectral conditions are not satisfied!

The reason for this failure should be quite clear: If a weight matrix is sub-maximally updated (in this case ΔW=o(1)||\,\Delta \bm{W}\,||= o(1)), then the correct scaling from the previous layer is propagated to the current layer purely through the initialized weights W0\bm{W}_0. This can be taken to the extreme case by setting the weight updates identically to zero, and we see that feature learning in the sense of (1)(1) continues to occur (see below for an empirical demonstration of this fact).

This is an important distinction because the impetus for μ\muP is to be maximally training the network. Quite literally we would like the weight updates specifically to be updated as maximally as possible. However this criteria is not actually enforced by condition (1)(1), even though this is the philosophical motivation for a feature learning condition like (1)(1). Thus, we suggest that the spectral feature learning condition is the “correct” perspective for maximal update parameterizations, and that this stronger conception of feature learning should be the basis for theory moving forwards.

A Misconception about μ\muP and Time

We briefly address a common misconception that we have encountered when discussing μ\muP. Many readers have read TP-V and come away with the conclusion that μ\muP makes predictions on the size of network activations through time. However this is manifestly not the case.

Consider the more detailed notation f(a;b)=Θa(A)f(\bm{a}; \bm{b}) =\Theta_{\bm{a}}(A), where ff is some function depending on the parameters a,b\bm{a}, \bm{b}. The notation Θa(A)\Theta_{\bm{a}}(A) means that we have the inequality

c(b)A(a)f(a;b)C(b)A(a),c(\bm{b})A(\bm{a})\leqslant f(\bm{a}; \bm{b}) \leqslant C(\bm{b})A(\bm{a}),

where 0<c(b),C(b)<0 < c(\bm{b}), C(\bm{b}) < \infty are constants (in a\bm{a}) and the inequality is expected to hold for all a\bm{a} in some unbounded subset of aRd\bm{a}\ni \R^d.

Specifically for μ\muP, we make statements about the spectral norm of the form Wt=Θn(A(n))||\,\bm{W}_t\,||=\Theta_{n}(A(n)), and notably we do not make statements of the form Wt=Θn,t(A(n,t))||\,\bm{W}_t\,||=\Theta_{n,t}(A(n, t)). Thus, when we say that Wt=Θ(1)||\,\bm{W}_t\,||=\Theta(1), this simply means that as we scale nn, the hidden size, we must have

c(t)Wt(n)C(t).c(t)\leqslant ||\,\bm{W}_t(n)\,||\leqslant C(t).

All that μ\muP says about the upper and lower bounds in time is that the bounds cannot depend on the network width, for sufficiently large networks. In other words, there exist width dependent upper and lower bounds, but absolute upper and lower bounds in time, with respect to width:

c(t)c~(t,n)Wt(n)C~(t,n)C(t).c(t) \leqslant \widetilde{c}(t, n) \leqslant ||\,\bm{W}_t(n)\,||\leqslant \widetilde{C}(t, n)\leqslant C(t).

Coordinate checking across time will not necessarily show constant behavior as it does when we coordinate check across network width.

A Heuristic Toolkit

While the mathematics underlying μ\muP may be both deep and complicated, we have found in practice that deriving μ\muP for new architectures requires insight into neither Tensor Programs nor random matrix theory. Rather, μ\muP derivations can typically be done using a few simple heuristics (which are, of course, ultimately derived from Tensor Programs and random matrix theory!). We present these heuristics which can be used to understand the observed dynamics of neural network training and allow practitioners to derive novel μ\muP scalings.

At a high-level, what we will do is assume that the spectral feature learning conditions (2)(2) ensure hyperparameter transfer, and then systematically examine the weight matrices in the network and make sure that they satisfy these conditions.

In what follows we will use the notion of “typical size” which is used by Yang and is common in the random matrix theory literature (Tao 2012). Readers may instead know this quantity as the first absolute moment and it is given by

ai=Θ(A),Eai=A,a_i = \Theta(A), \quad\Rightarrow\quad \mathbb{E}|a_i|=A,

with ai=Θ(A)a_i = \Theta(A) denoting that aia_i is of typical size AA. This is used to understand what we expect the entires of a random matrix to look like, and is commonly used to heuristically estimate norms of a matrix or vector. As an example computation, consider a random vector xi\bm{x}_i with typical size pp. We can apply the law of large numbers (LLN) to heuristically estimate

x22=ixi2=Θ(nExi2)=Θ(np2).||\,\bm{x}\,||_2^2 = \sum_{i}|x_i|^2 = \Theta\left(n\mathbb{E}|x_i|^2\right) = \Theta(np^2).

With these preliminaries dispensed with we now introduce our heuristics which we can use to derive μ\muP in a variety of settings. Our first heuristic regards the scaling of the model gradient as a function of the layer width:

📈

Heuristic 1: Gradients

Gradients scale like

Wt()L(Wt)={Θ(1), is an output layerΘ(1n),otherwise.\nabla_{\bm{W}_t^{(\ell)}}L(\bm{W}_t)=\begin{cases}\Theta(1),\qquad&\text{$\ell$ is an output layer} \\ \Theta(\frac{1}{n}), \qquad&\text{otherwise}\end{cases}.

Doing our analysis in terms of the spectral norm allows us to use standard theorems from random matrix theory to understand how the size of the weight matrices changes as a function of their input shapes. In particular we can use the Bai-Yin theorem (Bai & Yin 1993) (Yin et al. 1988) which tells us how the spectral norm of a rectangular random matrix scales. We note that there are some additional caveats to applying Bai-Yin, but all of them are satisfied in practice during neural network training:

📈

Heuristic 2: Bai-Yin Theorem

Let WRm×n\bm{W}\in \R^{m\times n} with entries sampled i.i.d. from N(0,σ2)\mathcal{N}(0, \sigma^2), then we have

W=σ(n+m)+L.O.T.||\,\bm{W}\,||=\sigma(\sqrt{n}+\sqrt{m}) + \text{L.O.T.}

In particular, as mm and nn grow large we have

W=Θ(σ(n+m)).||\,\bm{W}\,||=\Theta\left(\sigma(\sqrt{n}+\sqrt{m})\right).

If W\bm{W} is instead sampled from a distribution with non-zero mean N(μ,σ2)\mathcal{N}(\mu, \sigma^2) then we have

W=Θ(μnm).||\,\bm{W}\,||=\Theta(\mu\sqrt{nm}).

Finally, while the internal representations of a neural network do not actually look i.i.d. Gaussian, we have found that treating them as if they are i.i.d. Gaussian is a productive way to derive μ\muP:

📈

Heuristic 3: Playing pretend

μ\muP scalings can be derived by assuming that W0,ΔWt\bm{W}_0, \Delta\bm{W}_t are sampled i.i.d. from a Gaussian distribution. Furthermore we can always assume that the previous layer quantities ht\bm{h}_t and Δht\Delta \bm{h}_t are i.i.d. sampled from a Gaussian.

Heuristic 3 is certainly not rigorous, in fact it isn’t actually even true! But we find in practice that to derive the first order scalings of the weight spectral norms this assumption is sufficient to capture the actual training dynamics.

Example Derivations

Network Initialization

The network initialization can be directly read off of equation (2)(2) together with Heuristic 2 (the Bai-Yin theorem), since for a weight matrix WRm×n\bm{W}\in\R^{m\times n}

W=Θ(σ(n+m))Heuristic 2=Θ(mn)Equation (2).||\,\bm{W}\,||=\underbrace{\Theta(\sigma(\sqrt{n}+\sqrt{m}))}_{\text{Heuristic 2}}=\underbrace{\Theta\left(\frac{\sqrt{m}}{\sqrt{n}}\right)}_{\text{Equation $(2)$}}.

Solving for the standard deviation gives

σ=mn+nm.\sigma = \frac{\sqrt{m}}{n+\sqrt{nm}}.

If we carefully track which of the dimensions is being scaled, then we arrive at the standard μ\muP initialization rule:

σ={Θ(n1), is an output layer,Θ(n1/2), is a hidden layer,Θ(1), is an input layer.\sigma^\ell = \begin{cases}\Theta(n^{-1}), \qquad &\text{$\ell$ is an output layer}, \\ \Theta(n^{-1/2}), \qquad &\text{$\ell$ is a hidden layer}, \\ \Theta(1), \qquad &\text{$\ell$ is an input layer}\end{cases}.

SGD

Stochastic gradient descent is the easiest optimizer to study. We simply update the weights according to

Wt()=Wt1()η()Wt()L(Wt),\bm{W}_t^{(\ell)} = \bm{W}_{t-1}^{(\ell)}-\eta^{(\ell)}\nabla_{\bm{W}_t^{(\ell)}}L(\bm{W}_t),

where η\eta is the learning rate hyperparameter. According to the spectral condition (2)(2), this then implies that we must have

ΔWt()=η()Wt()L(Wt).||\,\Delta \bm{W}_t^{(\ell)}\,||=\eta^{(\ell)}\,||\,\nabla_{\bm{W}_t^{(\ell)}}L(\bm{W}_t)\,||.

Determining the correct scaling for the per-layer learning rates η()\eta^{(\ell)} is then reduced to understanding how the gradient scales in spectral norm. From Heuristic 1 we know how the individual elements of the gradient Wt()L(Wt)\nabla_{\bm{W}_t^{(\ell)}}L(\bm{W}_t) scale, and thus we compute that

Wt()L(Wt)={Θ(n), is an output layer,Θ(1), is a hidden layer,Θ(1/n), is an input layer.||\,\nabla_{\bm{W}_t^{(\ell)}}L(\bm{W}_t)\,||=\begin{cases}\Theta(\sqrt{n}), &\qquad \text{$\ell$ is an output layer}, \\ \Theta(1), &\qquad \text{$\ell$ is a hidden layer}, \\ \Theta(1/\sqrt{n}), &\qquad \text{$\ell$ is an input layer}\end{cases}.

Thus, we can use the spectral feature learning condition (2)(2) to read off the learning rate scalings

η()={Θ(n1), is an output layer,Θ(1), is a hidden layer,Θ(n), is an input layer.\eta^{(\ell)} = \begin{cases}\Theta(n^{-1}), \qquad &\text{$\ell$ is an output layer}, \\ \Theta(1), \qquad &\text{$\ell$ is a hidden layer}, \\ \Theta(n), \qquad &\text{$\ell$ is an input layer} \end{cases}.

Adam

The Adam optimizer (Kingma & Ba 2014) builds on SGD by tracking the first and second moments of the gradients over time

gt=Wt()L(Wt),mt=β1mt1+(1β1)gt,vt=β2vt1+(1β2)gt2,m^t=mt1β1t,v^t=vt1β2t,r^t:=m^tv^t+ε,\begin{align*}g_t &= \nabla_{\bm{W}_t^{(\ell)}}L(\bm{W}_t), \\ m_t &= \beta_1 m_{t-1}+(1-\beta_1)g_t, \\ v_t &= \beta_2v_{t-1}+(1-\beta_2)g_t^2, \\ \hat{m}_t &= \frac{m_t}{1-\beta_1^t}, \\ \hat{v}_t&=\frac{v_t}{1-\beta_2^t}, \\ \hat{r}_t:&=\frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\varepsilon},\end{align*}

and then the weight update is given by

Wt()=Wt1()η()r^t.\bm{W}_t^{(\ell)}=\bm{W}_{t-1}^{(\ell)}-\eta^{(\ell)}\hat{r}_t.

How large are the Adam optimizer steps r^t\hat{r}_t? From Heuristic 1 we know how the gradients scale, and thus

mt=Θ(Wt()L(Wt)),m_t = \Theta(\nabla_{\bm{W}_t^{(\ell)}}L(\bm{W}_t)),

due to the linearity of addition. Similarly, the typical size of the Hadamard product of the gradient has the typical size

vt=Θ((Wt()L(Wt))2)v^t=Θ(Wt()L(Wt)).v_t = \Theta((\nabla_{\bm{W}_t^{(\ell)}}L(\bm{W}_t))^2) \quad \Rightarrow \quad \sqrt{\hat{v}_t}=\Theta(\nabla_{\bm{W}_t^{(\ell)}}L(\bm{W}_t)).

In particular we then expect that the typical size of the r^t\hat{r}_t is

r^t=Θ(1),εv^t.\hat{r}_t = \Theta(1), \qquad \varepsilon \ll \sqrt{\hat{v}_t}.

Technically we can be more precise and determine exactly how small ε\varepsilon must be to get consistent training dynamics. This scaling was first observed in the literature seemingly independently by (Dey et al. 2025) and (Everett et al. 2024). Let mt=A(n)m^t=Θ(1)\overline{m}_t = A(n)\hat{m}_t = \Theta(1) and vt=A2(n)v^t=Θ(1)\overline{v}_t=A^2(n)\hat{v}_t=\Theta(1), where A(n)A(n) captures the heterogeneity of the gradient scaling between layers. After some algebra we arrive at the informal scaling

r^t=mtvt+A(n)ε.\hat{r}_t = \frac{\overline{m}_t}{\sqrt{\overline{v}_t} + A(n)\varepsilon}.

This implies that for transferable dynamics to continue, we must induce a dependence ε(n)\varepsilon(n) so that A(n)ε(n)=Θ(1)A(n)\varepsilon(n)=\Theta(1). We find in practice that for small-scale experimentation that setting ε\varepsilon small (say ε<1010\varepsilon < 10^{-10}) suffices to have transferable dynamics without adjusting the Adam ε\varepsilon parameter. However for large-scale runs mt\overline{m}_t and vt\overline{v}_t may shift significantly in time and violate the μ\muP hypotheses in practice.

Because we know the typical size of the Adam update steps is r^t=Θ(1)\hat{r}_t = \Theta(1), we can apply Heuristic 2 to understand that if the weights have shape Rm×n\R^{m\times n} then the spectral norm is simply r^t=Θ(mn)||\,\hat{r}_t\,||=\Theta(\sqrt{mn}). Then applying the spectral feature learning conditions (2)(2) we have

η()={Θ(n1), is an output layer,Θ(n1), is a hidden layer,Θ(1), is an input layer.(4)\eta^{(\ell)} = \begin{cases}\Theta(n^{-1}), \qquad &\text{$\ell$ is an output layer}, \\ \Theta(n^{-1}), \qquad &\text{$\ell$ is a hidden layer}, \\ \Theta(1), \qquad &\text{$\ell$ is an input layer} \end{cases}.\tag{4}

AdamW

AdamW builds on the the Adam optimizer by additionally adding a weight decay term with decay strength λ()\lambda^{(\ell)} (Loshchilov & Hutter 2017). The Adam weight update is then replaced by

Wt()=(1λ()η())Wt1()η()m^tv^t+ε,(5)\bm{W}_t^{(\ell)}=(1-\lambda^{(\ell)}\eta^{(\ell)})\bm{W}_{t-1}^{(\ell)}-\eta^{(\ell)}\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon}, \tag{5}

which adds dynamical pressure to return the weights to the origin at every step. To ensure that these updates retain their correct size in the spectral norm we must have that λ()η()=Θ(1)\lambda^{(\ell)}\eta^{(\ell)}=\Theta(1).

To see why, suppose that λ()η()=Θ(nα)\lambda^{(\ell)}\eta^{(\ell)}=\Theta(n^{-\alpha}) for α>0\alpha > 0, then in the limit that nn\rightarrow \infty our update rule collapses to the original Adam update and we retain no benefit from using weight decay. On the other hand, if λ()η()=Θ(nα)\lambda^{(\ell)}\eta^{(\ell)}=\Theta(n^\alpha), then for nn sufficiently large our update rule is

ΔWt()nαWt1(),\Delta \bm{W}_t^{(\ell)} \sim n^{\alpha}\bm{W}_{t-1}^{(\ell)},

and our training will cease to be dependent on the data. I.e. meaningful learning becomes impossible.

These considerations together with the Adam learning rate scaling (4)(4) lead us to conclude that for the weight decay update described in (5)(5) we should scale the weight decay according to

λ()={Θ(n), is an output layer,Θ(n), is a hidden layer,Θ(1), is an input layer.\lambda^{(\ell)} = \begin{cases}\Theta(n), \qquad &\text{$\ell$ is an output layer}, \\ \Theta(n), \qquad &\text{$\ell$ is a hidden layer}, \\ \Theta(1), \qquad &\text{$\ell$ is an input layer} \end{cases}.

Depth and Complete-P

The spectral μ\muP framework can also be extended to derive the corrected depth scalings arrived at in Complete-P through the addition of an extraneous desiderata. We argue that the depth scaling from TP-VI is wrong because their updates are too small in norm. We show that depth-μ\muP does not satisfy the spectral feature learning conditions, and furthermore that assuming the spectral feature learning conditions prevents lazy-learning (Dey et al. 2025).

Thus, we are able to arrive at the correct Complete-P (Dey et al. 2025) scaling without adding additional desiderata, using only the spectral feature learning conditions.

For our setting we consider the residual blocks of a depth LL network given by

h+1=h+LαF(h;W),=1,,L,\bm{h}^{\ell+1}=\bm{h}^\ell + L^{-\alpha}\mathcal{F}_\ell(\bm{h}^\ell; \bm{W}^\ell), \qquad \ell=1, \cdots, L,

for a residual block F\mathcal{F}_\ell with parameters θ\bm{\theta}^\ell, usually an MLP block. Of course, to study the operator norms we can write this computation as

G:=(I+LαF(;W)),=1,,L,\bm{G}^\ell:=\left(\bm{I} + L^{-\alpha}\mathcal{F}_\ell(\,\cdot\,; \bm{W}^\ell)\right), \qquad \ell=1, \cdots, L,

so that Gh=h+1\bm{G}^\ell\bm{h}^{\ell} = \bm{h}^{\ell+1}. Making standard assumptions about composition and lack of cancellation we have

G=Θ(G1+LαF).||\,\bm{G}^\ell\,|| = \Theta\left(||\,\bm{G}^{\ell - 1}\,|| + L^{-\alpha}||\,\mathcal{F}_\ell\,||\right).

Recursing this identity gives us the spectral bound

GL=Θ(1+L1αmaxF).||\,\bm{G}^L\,|| =\Theta\left(1+L^{1-\alpha}\max_\ell||\,\mathcal{F}_\ell\,||\right).

In order for the network outputs to remain stable at initialization, we require that the summation through the entirety of the network at initialization scales correctly, that is

L1αmaxF=O(1)L^{1-\alpha}\max_\ell||\,\mathcal{F}_\ell\,||=\mathcal{O}(1)

in both LL and nn, where nn is the hidden size. For typical MLP implementations it is sensible to require that F=Θ(1)||\,\mathcal{F}_\ell\,||=\Theta(1), in which case we must have that α=1\alpha=1 to keep the correct scaling. However this is not the only available way to parameterize depth (see below). Note that for any α>1\alpha > 1 (Yang et al. 2023a) prove that the parameterization is trivial (in the limit the network converges to the identity). Another way to think about this is that for non-trivial learning to occur we must have the contributions from the residual branch and the MLP branch be the same asymptotic size at the output of the network.

Depth μ\muP Doesn’t Satisfy Spectral Feature Learning

Our first observation is that the proposed depth-μ\muP framework from TP-VI does not satisfy the spectral feature learning conditions (2)(2). The authors of TP-VI consider MLP blocks F=WV\mathcal{F}_\ell = \bm{W}^\ell\bm{V}^\ell, with W,VRn×n\bm{W}^\ell, \bm{V}^\ell\in \R^{n\times n}. Our argument will also apply to rectangular matrices, but we choose square matrices here for the ease of exposition. Depth-μ\muP suggests initializing the hidden layers from N(0,1/n)\mathcal{N}(0, 1/n), but setting the learning rate to be η=Θ(Lα1n1)\eta^\ell = \Theta(L^{\alpha-1}n^{-1}). Thus we have

W0=Θ(1)Θ(Lα1)=ΔWt.||\,\bm{W}^\ell_0\,||=\Theta(1)\ne\Theta(L^{\alpha-1})=||\,\Delta \bm{W}_t^\ell\,||.

In this case as LL\rightarrow \infty, the weight updates are too small to learn meaningful features, our updates are no longer maximal! Note that we will still pass a coordinate check in this situation because of the failure case described by equation (3)(3) (see also the empirical coordinate check failures discussed below), further providing evidence for the superiority of the spectral perspective.

Spectral μ\muP Prevents Lazy-Learning

The Complete-P depth parameterization is motivated by the following observation: even though feature learning is occurring in the sense of (1)(1), the model weights get “stuck” in the “lazy-regime”, severely harming performance. From the spectral μ\muP perspective, it is clear why this is happening: with the sub-maximal updates we barely move away from the initialized weights.

Roughly speaking, lazy-learning occurs for a layer h\bm{h}^\ell we have

ΔWhΔWhlin,WΔWhlin,W=o(1),asN,L,(6)\frac{|\Delta_{\bm{W}}\bm{h}^\ell-\Delta_{\bm{W}}\bm{h}_\ell^{\text{lin}, \bm{W}}|}{|\Delta_{\bm{W}}\bm{h}_\ell^{\text{lin}, \bm{W}}|}=o(1), \qquad \text{as}\quad N, L\rightarrow \infty, \tag{6}

where

ΔWhlin,W(W;W0)=h(W0)+Wh(W)W=W0WW0.\Delta_{\bm{W}}\bm{h}_\ell^{\text{lin}, \bm{W}}(\bm{W}; \bm{W}_0)=\bm{h}(\bm{W}_0)+ \langle\,\nabla_{\bm{W}}\bm{h}(\bm{W})|_{\bm{W}=\bm{W}_0}\,|\,\bm{W}-\bm{W}_0\,\rangle.

Under this framework we are saying that lazy-learning occurs if the linearization around the initial weights is a good approximation as we scale N,LN, L. But this is precisely what spectral feature learning aims to prevent!

Observe that by Taylor expansion we simply have

ΔWh=ΔWhlin,W+12ΔWTW2h(W)ΔW+L.O.T.\Delta_{\bm{W}}\bm{h}^\ell=\Delta_{\bm{W}}\bm{h}_\ell^{\text{lin}, \bm{W}}+\frac{1}{2}\Delta\bm{W}^T\,\nabla_{\bm{W}}^2\bm{h}(\bm{W})\,\Delta\bm{W} + \text{L.O.T.}

in other words

ΔWhΔWhlin,WΔWhlin,WΔWTW2h(W)ΔWh(W0)ΔW2W.\frac{|\Delta_{\bm{W}}\bm{h}^\ell-\Delta_{\bm{W}}\bm{h}_\ell^{\text{lin}, \bm{W}}|}{|\Delta_{\bm{W}}\bm{h}_\ell^{\text{lin}, \bm{W}}|}\approx\frac{|\Delta\bm{W}^T\,\nabla_{\bm{W}}^2\bm{h}(\bm{W})\,\Delta\bm{W}|}{|\bm{h}(\bm{W}_0)|} \approx \frac{||\,\Delta \bm{W}\,||^2}{||\,\bm{W}\,||}.

Note that if we enforce the spectral feature learning conditions (2)(2) then this term is never o(1)o(1), but for depth-μ\muP the right hand side decays like Lα1=o(1)L^{\alpha-1}=o(1) and thus exhibits lazy-learning whenever α1\alpha\ne 1.

In summary, using spectral μ\muP as the basis for our theory means that we do not need the additional Desiderata from the Complete-P paper to derive Complete-P, simplifying the theoretical analysis.

An ABC-Type Family of Depth Parameterizations

The failure of depth μ\muP is caused by the weight updates being the wrong size for the initialization, but this leaves open the possibility of shrinking the initialization to match the size of the weight updates. Sample W0N(0,(Lα1n)1)\bm{W}_0^\ell \sim \mathcal{N}(0, (L^{\alpha - 1}\sqrt{n})^{-1}) so that

W0=Θ(Lα1)||\,\bm{W}_0^\ell\,||=\Theta(L^{\alpha-1})

to match the weight updates. We now have

LαWt=LαΘ(Wt)=Θ(1),||\,\sum_\ell L^{-\alpha}\bm{W}_t^\ell\,||=L^{-\alpha}\Theta(\sum_{\ell}||\,\bm{W}_t^\ell\,||)=\Theta(1),

as desired. It remains to be seen whether or not this parameterization actually leads to feature learning and μ\mu-transfer in practice. There is reason to suspect it may not, chiefly that having the weights and weight updates shrinking so severely as we scale depth seems undesirable. We leave further theoretical and empirical investigations of this depth parameterization to future work.

Muon

Muon is a recently introduced optimizer (Jordan et al. 2024) which has shown promising results for training LLMs at scale (Liu et al. 2025) (Bai et al. 2025).

The Muon update rule is given by

ΔWt=ηO(Wt1),\Delta \bm{W}_t = -\eta\mathbf{O}(\bm{W}_{t-1}),

where O\bm{O} is the approximately orthogonalized gradient with respect to Wt1\bm{W}_{t-1}, found using a Newton-Schultz iteration scheme (Jordan et al. 2024). From the perspective of spectral μ\muP what we primarily care about is the size of the spectral norm of O(Wt1)\mathbf{O}(\bm{W}_{t-1}), which is Θ(1)\Theta(1) by construction (Bernstein 2025). This implies that the learning rate for the hidden layers in a neural network using Muon should scale like η=Θ(1)\eta = \Theta(1). Learning rate should transfer without having to adjust the initialization scheme or the learning rate (Assuming that we are using Adam under SP for the initialization and output layers).

Muon is typically applied with decoupled weight decay and the full weight update rule

ΔWt=ληWt1ηO(Wt1).\Delta\bm{W}_t = -\lambda\eta\bm{W}_{t-1}-\eta\mathbf{O}(\bm{W}_{t-1}).

Note that our analysis from above continues to hold, except now that the learning rate is Θ(1)\Theta(1) for hidden layers, the weight decay will also be Θ(1)\Theta(1). Not only does Muon fix the learning rate scale, but it also fixes the weight decay scale when using decoupled weight decay.

Does Muon Kill μ\muP

As authors and teams have demonstrated the effectiveness of using MuON at scale the usefulness of μ\muP as a discipline have been called into question. However, we argue here that this may be a distinction without a difference, since the principles of μ\muP remain relevant, even when one uses Muon.

For example, Muon won’t transfer learning rate by default on GQA, and the reasons are the same as the reasons outlined in (Chickering et al. 2025). Namely that the spectral norm and the expected operator norm of the network computation do not agree. Thus, getting Muon to work in this setting requires the same fundamental understanding of the underlying behavior of the computation in the spectral norm. Whether or not doing this analysis constitutes “μ\muP for Muon” is up for debate, and frankly the distinction is somewhat meaningless. Regardless, we believe that much in the same way students are encouraged to study SGD before moving on to studying Adam, so should practitioners be encouraged to understand μ\muP prior to moving on to Muon.

Finally, we note that as of this writing, efficient, large-scale, open-source Muon implementations are lacking. Because of this there may be a period of time where Adam continues to be the preferred method of training models. Furthermore, the practical benefits of large-scale training with Muon are not fully understood. It may be the case that the benefits of training with Muon vanish for very large training runs (Wen et al. 2025). It is challenging to ablate a large-scale LLM training run, which makes committing to using Muon challenging. However, we stress that an understanding of the strengths and weaknesses of Muon is currently lacking, but we will likely understand this optimizer better in the coming months and years.

The Implied Dynamics of μ\muP

We pause to consider the implications of not using μ\muP. In particular we argue that we should understand this situation as representing different layers learning at the incorrect rate. We also argue why we see the standard-parameterization scaling heuristics that we do.

Given that most neural network training already initializes the weights using Kaiming-He, which is correct from the perspective of both standard and spectral μ\muP, the deficiency in using the standard parameterization must be understood exclusively throughout the weight update scalings. We consider the common case of the Adam optimizer and can then understand the issues with the standard parameterization by considering the proposed μ\muP modification to the learning rates which derived above in (4)(4).

These scalings imply that during training with the standard parameterization, the embedding layer will be learning nn times slower than the hidden layers and unembedding layers (this is complicated by weight tying, which we ignore in this blog). Assuming that we tune the training to find an optimal learning rate, we expect to find an empirical law η1/n\eta \sim 1/n, since the majority of layers in a network are hidden layers, which naturally prefer this scaling. However, training at this reduced learning rate will severely degrade the rate at which the embedding layers are updating, leading to the majority of training taking place with essentially frozen and essentially random embedding weights!

This is a potential cause of issues during pre-training, especially as the model sizes get large: the embedding layer is essentially frozen at it’s random initialization, and the model does not learn embedding features quickly enough to send meaningful signal to the lowest layers attention blocks. Worse, the embedding component of the transformer loss contributes noise to the gradient, slowing down the rate at which we can train models at all. As we decrease the learning rate further to compensate for this discrepancy, we exacerbate the frozen input weights, leading to a situation where the model’s capacity is reduced. This is why in the TP-V paper we see improved loss when using μ\muP instead of SP for the largest models: as nn\rightarrow \infty in SP, the embedding weights become frozen to their random initialized weights, effectively reducing the model capacity. μ\muP allows us to fully utilize the model’s capacity during training.

Transferring Across Weight Decay, Learning Rate, and Batch Size

The original TP5 paper (Yang et al. 2022) suggests that we should not expect learning rate transfer across weight decay or batch size. While we do not offer a conclusive rebuttal to this assertion, but we consider some more recent work which has made strides in articulating the problem more thoroughly.

We follow a line of work originating with (Wang & Aitchison 2024) and explored in more depth by (Bergsma et al. 2025). In particular, Wang & Aitchison suggest that weight decay and learning rate should be related through the quantity τema\tau_{\text{ema}}

τema:=BληD,\tau_{\text{ema}}:=\frac{B}{\lambda\eta D},

where BB is the batch size and DD is the dataset size. They suggest that this quantity should transfer across model training (implicitly this transfer is understood only to occur at the optimal learning rate) and this suggestion is based off of a discrete dynamics argument focused on the the change in the weights induced by the total integrated weight decay during training.

We emphasize an important point which we feel is not sufficiently stated in the existing literature: we should only expect that τema\tau_{\text{ema}} is transferable at the optimal learning rate. The reason should be clear: for a fixed dataset and batch size there are two limiting cases for the weight decay and learning rate. In the first case, λ\lambda\rightarrow \infty as η0\eta\rightarrow 0, learning will not happen since the learning dynamics will be governed purely by the exponential decay of the weights. In the other limiting case, λ0\lambda\rightarrow 0 and η\eta\rightarrow \infty, in which case the dynamics will be too unstable to meaningfully discuss training.

In the figure below we demonstrate that properly scaled decoupled weight decay and τema\tau_{\text{ema}} both exhibit transferability when maintaining a constant TPP and using an empirical BD0.5B\propto D^{0.5} scaling law for batch size (see Figures below).

Figure from (Chickering et al. 2025) showing hyperparameter transfer across both learning rate and weight decay while scaling the batch size and data size to keep a constant TPP. We see good transfer of weight decay when using the suggested weight decay transfer scaling.
Figure from (Chickering et al. 2025) showing hyperparameter transfer across both learning rate and τepoch\tau_{\text{epoch}} while scaling the batch size and data size to keep a constant TPP. We see near perfect transfer of τepoch\tau_{\text{epoch}} with the weight decay transfer scaling.

Finally we argue that for a fixed maximum model size we can transfer learning rate across batch size and model size using μ\muP so long as the batch size is sufficiently large and we allow ourselves to scale the dataset size. I.e. we are in the infinite data regime.

The figure below shows the relationship between loss, learning rate, and batch size in the constant TPP (constant data) and constant iterations (infinite data) regimes.

Voronoi plots comparing the loss landscapes for constant TPP and constant iterations. In the case of constant iterations the optimal learning rate is roughly stable and the loss decreases monotonically as a function of batch size. In the constant TPP (constant data) regime, the optimal learning rate is less stable and the learning rate and batch size must be jointly tuned. See (Bergsma et al. 2025) for a detailed empirical analysis of the batch size and TPP scaling laws in the constant data regime.

The infinite data regime is obviously unrealistic for any practical training, but it offers us an ablative setting for small model experimentation with architectural considerations like μ\muP. To test modifications to μ\muP we can use a sufficiently large batch size to isolate the effects of the model architecture, compared to the effects of the substrate (dataset and batch size). Once we are sure that our μ\muP implementation is working as we expect we can move on to finding and tuning in the compute optimal setting.

Some Pitfalls of Coordinate Checking

In this section we summarize some of the pitfalls we have encountered when performing coordinate checking. Before moving forward we discuss some intuitions for practitioners new to coordinate checking. Below we plot a “clean” coordinate check, and note that the coordinates appear stable as we scale model size, with the weights themselves (top row) showing lower variance than the weight updates (bottom row). The first figure shows a standard coordinate check on the activations.

An example of a coordinate check for a model which is implemented correctly. Notes that already in this setting it can be challenging to intuit how “flat” the lines should be, since by the third iteration the updates Δht\Delta \bm{h}_t exhibit some not qute flat behavior for the largest models.

A general intuition is that the Δht\Delta \bm{h}_t updates appear to deviate from the “flat” as training progresses, while the spectral ΔWt\Delta\bm{W}_t coordinate checks get more stable as training progresses. This is easily understood as a consequence of the fact that Δht\Delta \bm{h}_t is a sum of two products (see above) and as such does not directly measure the weight updates.

Spectral coordinate check for the same model checked in the previous figure. Note that in neither case are the resulting qualitative plots easy to interpret.

A Failure Case: Sub-Maximal Interior Layer Weight Updates

Building off the computations from the previous section on spectral μ\muP, we show that the failure case described above, namely the weight updates of an interior hidden layer being too small, will not show up in a (Yang et al. 2022)-style coordinate check.

To be more specific, since we can write

Δ(Wh)t=ΔWtht+WtΔht,\Delta (\bm{W}\bm{h})_t = \Delta \bm{W}_t\bm{h}_t + \bm{W}_t\Delta\bm{h}_t,

both terms will contribute to the output size. Since our recursion assumes that both ht\bm{h}_t and Δht\Delta \bm{h}_t are the correct size, this means that if we set the weight updates to zero, i.e. ΔWt=0\Delta \bm{W}_t=0, then we have Δ(Wh)t=W0Δht\Delta(\bm{W}\bm{h})_t = \bm{W}_0\Delta \bm{h}_t, but this is actually correct in norm, since

Δ(Wh)t2=Θ(W0Δht)=Θ(n).||\,\Delta(\bm{W}\bm{h})_t\,||_2 = \Theta(||\,\bm{W}_0\,||\,||\,\Delta \bm{h}_t\,||)=\Theta(\sqrt{n}).

The layer is passing coordinate checks, but the layer is constant, there is no learning going on, and this represents a bug in our implementation.

The following figure demonstrates the issue. We run a coordinate check on a GPT-2 style LLM, using a μ\muP implementation with the hidden layer learning rate set to zero. Thus, the hidden layer weights are constant Wt=W0\bm{W}_t = \bm{W}_0. Despite this, our model has a (mostly) clean coordinate check.

Coordinate checking for a GPT2-style LLM with the hidden layer learning rate set to be zero. Despite the fact that interior layers are not learning, we still pass the qualitative coordinate check suggested by (Yang et al. 2023b).

If we follow the suggestions of TP-V, then we would conclude that this implementation is correct. However, our implementation does not pass the more stringent spectral coordinate checks which we advocate for. Below is the spectral coordinate check for the same model where we can clearly see that no learning is occurring.

The following figures are taken from our recent paper (Chickering et al. 2025). When applying μ\muP to grouped-query attention (GQA), one finds that the naïve approach passes a standard coordinate check (first figure), but fails a spectral coordinate check (second figure) and will not produce robust learning rate transfer. We addressed this issue by performing the analysis in the spectral norm rather than the activation 2-norm. This particular failure case further highlights the “correctness” of the spectral perspective.

Coordinate check for a naïve GQA implementation which preserves layer variance but not spectral norm. According to this coordinate check we should have learning rate transfer. The ht2||\,h_t\,||_2 norm also pass a coordinate check.
The spectral coordinate check for the same model does not pass, explaining why we do not see learning rate transfer for this naïve implementation.

A Failure Case: Non-Power Law Scalings & Power Laws with Questionable Exponents

In the course of our experiments we encountered a second subtle case of failing coordinate checks. This case involves a coordinate check with a non-power law scaling and usually indicates that there is a subtle implementation bug somewhere in the system. Concretely we can force this bug to show up in two places (1) when using the Adam optimizer with the ε\varepsilon parameter set to be too large, and (2) when looking at a mixture-of-experts router with a poorly tuned load balancing loss.

The following figure shows a coordinate check where the hidden layers are being sub-maximally updated due to the Adam ε\varepsilon parameter being set too high. In this case the updates behave roughly like the first moment, which decays as the hidden size increases according to Heuristic 1. The decay in weight updates leads to a shift in learning rate during μ\mu-transfer.

A failing (standard) coordinate check which fails in a non-standard way. Debugging reveals that these lines exhibit an n1/4n^{-1/4} power-law. Exponents other than 11 and 1/21/2 are unusual to encounter and in our experience indicate a mis-aligned hyperparamater which was improperly accounted for in the derivation.

Conclusions

If we can leave the reader with a single takeaway it is that for any large-scale training runs teams should use some combination of μ\muP and Muon to ensure that all of the layers are training at the “correct” rate. Beyond that, when working with μ\muP implementations one should favor a spectral norm perspective to an activation norm perspective to avoid some subtle pitfalls that can occur when working with μ\muP.

References

  1. (Bai & Yin 1993)
    Bai, Z. D., Yin, Y. Q. Limit of the Smallest Eigenvalue of a Large Dimensional Sample Covariance Matrix. Ann. Probab. 1993.
  1. (Bai et al. 2025)
    Kimi Team. Kimi K2: Open Agentic Intelligence. arXiv. 2025.
  1. (Bergsma et al. 2025)
    Bergsma, S., Dey, N., Gosal, G., Gray, G., Soboleva, D., Hestness, J. Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training. NeurIPS (to appear). 2026.
  1. (Bernstein 2025)
    Bernstein, J. Deriving Muon. 2025.
  1. (Chickering et al. 2025)
    Chickering, K. R., Wang, H., Wu, M., Moreno, A., Chen, M., Ma, X., Soboleva, D., Hestness, J., Liu, Z., Xing, E. P. GQA-μ\muP: Maximal Update Parameterizations for Grouped Query Attention. Submitted. 2025.
  1. (Dey et al. 2023)
    Dey, N., Gosal, G., Chen, Z., Khachane, H., Marshall, W., Pathria, R., Tom, M., Hestness, J. Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster. arXiv. 2023.
  1. (Dey et al. 2024)
    Dey, N., Anthony, Q., Hestness, J. The Practitioner’s guide to the maximal update parameterization. Cerebras Blog. 2024.
  1. (Dey et al. 2025)
    Dey, N., Zhang, B. C., Noci, L., Li, M., Bordelon, B., Bergsma, S., Pehlevan, C., Hanin, B., Hestness, J. Don’t be lazy: CompleteP enables compute-efficient deep transformers. NeurIPS. 2025.
  1. (Everett et al. 2024)
    Everett, K., Xiao, L., Wortsman, M., Alemi, A. A., Novak, R., Liu, P. J., Gur, I., Sohl-Dickstein, J., Kaelbling, L. P., Lee, J., Pennington, J. Scaling Exponents Across Parameterizations and Optimizers. arXiv. 2024.
  1. (Hayou 2025)
    Hayou, S. A Proof of Learning Rate Transfer Under μ\muP. arXiv. 2025.
  1. (Jordan et al. 2024)
    Jordan, K., Jin, Y., Boza, V., You, J., Cesista, F., Newhouse, L., Bernstein, J. Muon: An optimizer for hidden layers in neural networks. Keller Jordan Blog. 2024.
  1. (Kingma & Ba 2014)
    Kingma, P. D., Ba, J. Adam: A Method for Stochastic Optimization. arXiv. 2014.
  1. (Liu et al. 2025)
    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, Zhilin Yang. Muon is Scalable for LLM Training. arXiv. 2025.
  1. (Loshchilov & Hutter 2017)
    Loshchilov, I., Hutter, F. Decoupled Weight Decay Regularization. arXiv. 2017.
  1. (Tao 2012)
    Tao, T. Topics in Random Matrix Theory. AMS GSM. 2012.
  1. (Wang & Aitchison 2024)
    Wang, X., Aitchison, L. How to set AdamW’s weight decay as you scale model and dataset size. arXiv. 2024.
  1. (Wen et al. 2025)
    Wen, K., Hall, D., Ma, T., Liang, P. Fantastic Pretraining Optimizers and Where to Find Them. arXiv. 2025.
  1. (Yang 2019)
    Yang G. Tensor Programs I: Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes. arXiv. 2019.
  1. (Yang 2020a)
    Yang G. Tensor Programs II: Neural Tangent Kernel for Any Architecture. arXiv. 2020.
  1. (Yang 2020b)
    Yang, G. Tensor Programs III: Neural Matrix Laws. arXiv. 2020.
  1. (Yang & Hu 2020)
    Yang, G., Hu, E. Feature Learning In Infinite-Width Neural Networks. arXiv. 2020.
  1. (Yang et al. 2022)
    Yang, G., Hu, E. J., Babuschkin, I., Sidor, S., Liu, X., Farhi, D., Ryder, N., Pachocki, J., Chen, W., Gao, J. Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer. arXiv. 2022.
  1. (Yang et al. 2023a)
    Yang, G., Yu, D., Zhu, C., Hayou, S. Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks. arXiv. 2023.
  1. (Yang et al. 2023b)
    Yang, G., Simon, J. B., Bernstein, J. A Spectral Condition for Feature Learning. arXiv. 2023.
  1. (Yin et al. 1988)
    Yin, Y. Q., Bai, Z. D., Krishnaiah, P. R. On the limit of the largest eigenvalue of the large dimensional sample covariance matrix. Probability Theory and Related Fields. 1988.

Citation

@misc{chickering2025mup,
	author={Kyle R. Chickering},
	title={The Spectral Maximal Update Parameterization in Theory and Practice},
	year={2025},
	url={},
	publisher={Kyle Chickering's Blog}
}