Аннотация:
Scaling laws for large language models (LLMs) reveal a striking empirical regularity: model performance improves according to predictable power laws as training data and compute scale. These laws have profoundly shaped the development of modern AI, yet their origins have remained largely empirical and theoretically unexplained. To uncover the underlying mechanism, we introduce power-law kernel regression, a minimal yet structurally faithful model that captures the essential ingredients driving scaling behavior. By analyzing its stochastic training dynamics through a continuous-time stochastic differential equation, we develop the framework of Functional Scaling Laws (FSL). FSL elevates classical scaling laws from predicting a final-step loss to predicting the entire loss trajectory. This functional viewpoint reveals an intrinsic-time structure that unifies training dynamics across model sizes, data scales, and learning-rate schedules. In particular, FSL provides a principled explanation for why widely used schedules—such as warmup–stable–decay—are so effective. Finally, experiments on LLM pre-training demonstrate that FSL offers a principled framework for both understanding and guiding large-scale model training.
|