Universal One-third Time Scaling in Learning Peaked Distributions

Name: Universal One-third Time Scaling in Learning Peaked Distributions
Author: Yizhou Liu, Ziming Liu, Cengiz Pehlevan, Jeff Gore

Yizhou Liu, Ziming Liu, Cengiz Pehlevan, Jeff Gore

paper2026-02-03English

Start Reading

Description

Training large language models (LLMs) is computationally expensive, partly because the loss exhibits slow power-law convergence whose origin remains debatable. Through systematic analysis of toy models and empirical evaluation of LLMs, we show that this behavior can arise intrinsically from the use of softmax and cross-entropy. When learning peaked probability distributions, e.g., next-token distributions, these components yield power-law vanishing losses and gradients, creating a fundamental optimization bottleneck. This ultimately leads to power-law time scaling of the loss with a universal exponent of $1/3$. Our results provide a mechanistic explanation for observed neural scaling and suggest new directions for improving LLM training efficiency.