Gérard Ben Arous (New York University)

tbc
Abstract
tbc

Elizabeth Collins-Woodfin (McGill University)

High-dimensional dynamics of SGD for Gaussian mixture models

Abstract

We study the dynamics of streaming SGD in the context of high-dimensional k-component Gaussian mixture models.  Using techniques from high-dimensional probability, matrix theory, and stochastic calculus, we show that, when the data dimension d grows proportionally to the number of samples n, SGD converges to a deterministic equivalent, characterized by a system of ordinary differential equations.  A key contribution of our technique is that it works for non-isotropic data.  As a simple example, I will discuss logistic regression on the 2-component model for various data covariance structures to illustrate the SGD dynamics for GMMs with non-isotropy.  I will also discuss an extension of our methods to models with a growing number of components (k of order log(d)).  This is based on work in progress with Inbar Seroussi.

Murat Erdogdu (University of Toronto)

Learning quadratic neural networks in high dimensions

Abstract

We study the optimization and sample complexity of gradient-based training of a two-layer student neural network with quadratic activation function in the high-dimensional regime, where the input is Gaussian and the response is generated from a two-layer teacher network with quadratic activation, and power-law decay on the second-layer coefficients. We provide a sharp analysis of the SGD dynamics in the feature learning regime, and derive scaling laws for the prediction risk that highlight the power-law dependencies on the optimization time, sample size, and model width.

Zhou Fan (Yale University)

Dynamical mean-field analysis of adaptive Langevin diffusions

Abstract

In many applications of statistical estimation via sampling, one may wish to sample from a high-dimensional target distribution that is adaptively evolving to the samples already seen. We study an example of such dynamics, given by a Langevin diffusion for posterior sampling in a Bayesian linear regression model with i.i.d. regression design, whose prior continuously adapts to the Langevin trajectory via a maximum marginal-likelihood scheme. Using techniques of dynamical mean-field theory (DMFT), we provide a precise characterization of a high-dimensional asymptotic limit for the joint evolution of the prior parameter and law of the Langevin sample. We then carry out an analysis of the equations that describe this DMFT limit, under conditions of approximate time-translation-invariance which include, in particular, settings where the posterior law satisfies a log-Sobolev inequality. In such settings, we show that this adaptive Langevin trajectory converges on a dimension-independent time horizon to an equilibrium state that is characterized by a system of scalar fixed-point equations, and the associated prior parameter converges to a critical point of a replica-symmetric limit for the model free energy. We explore the nature of the free energy landscape and its critical points in a few simple examples, where such critical points may or may not be unique.

This is joint work with Justin Ko, Bruno Loureiro, Yue M. Lu, and Yandi Shen.

Florent Krzakala (EPFL)

tbc

Abstract

Zhenyu Liao (Huazhong University of Science and Technology)

Inversion Bias of Random Matrices: Precise Characterization, Implications for Randomized Numerical Linear Algebra, and Beyond

Abstract

Bruno Loureiro (ENS & CNRS)

High-dimensional limit of SGD for sequence single-index models

Abstract

In this talk, I will discuss the high-dimensional limits of SGD for sequence single-index models : a generalisation of the standard single-index model to the sequential domain, encompassing simplified one-layer attention architectures. Despite its simplicity, this model captures some important aspects of sequential data, allowing us to investigate questions such as the benefits of sequence length and the role of positional encoding in learning semantic vs. positional features of the target with attention-based architectures.

Yue Lu (Harvard University)

Attention-Based In-Context Learning: Insights from Random Matrix Theory

Abstract

Transformers have a remarkable ability to learn and execute tasks based on examples provided within the input itself, without explicit prior training. It has been argued that this capability, known as in-context learning (ICL), is a cornerstone of Transformers’ success, yet questions about the necessary sample complexity, pretraining task diversity, and context length for successful ICL remain unresolved. Here, we provide a precise answer to these questions in an exactly solvable model of ICL of a linear regression task by linear attention. We derive sharp asymptotics for the learning curve in a phenomenologically-rich scaling regime where the token dimension is taken to infinity; the context length and pretraining task diversity scale proportionally with the token dimension; and the number of pretraining examples scales quadratically. We demonstrate a double-descent learning curve with increasing pretraining examples, and uncover a phase transition in the model’s behavior between low and high task diversity regimes: In the low diversity regime, the model tends toward memorization of training tasks, whereas in the high diversity regime, it achieves genuine in-context learning and generalization beyond the scope of pretrained tasks. These theoretical insights, obtained by a random matrix analysis, are empirically validated through experiments with both linear attention and full nonlinear Transformer architectures.

Joint work with Mary Letey, Jacob Zavatone-Veth, Anindita Maiti, and Cengiz Pehlevan.
https://arxiv.org/abs/2405.11751

Theodor Misiakiewicz (Yale University)

Fundamental limits of learning under group equivariance

Abstract

Modern machine learning heavily relies on the success of large-scale models trained via gradient-based algorithms. A major effort in recent years has been to understand the fundamental limits of these learning algorithms: What is the complexity of gradient-based training? Which distributions can these algorithms learn efficiently? Can we identify simple principles underlying feature learning? We focus in this talk on a key property of generic gradient-based methods: their equivariance with respect to a large symmetry group. We develop a group-theoretic framework to analyze the complexity of equivariant learning algorithms when trained on a given target distribution.  This framework reveals a natural factorization of the group-distribution pair, and suggests an optimal sequential, adaptive learning process. Using these results, we revisit recent work on learning juntas and multi-index models using gradient algorithms.

This is based on joint work with Hugo Koubbi (Yale), Nirmit Joshi (TTIC), and Nati Srebro (TTIC).

Liza Rebrova (Princeton University)

tbc

Abstract
tbc

Inbar Seroussi (Tel-Aviv University)

High-Dimensional SGD Theory: Insights for Algorithmic Design

Abstract

Stochastic optimization methods are fundamental in modern machine learning, yet understanding their remarkable performance remains a significant theoretical challenge. This talk presents a theoretical framework for analyzing stochastic gradient descent (SGD) in the high-dimensional regime, where both the sample size and parameter dimension grow large. The analysis focuses on generalized linear models and multi-index problems trained with Gaussian data having a general covariance structure. The limiting dynamics are governed by a set of low-dimensional ordinary differential equations (ODEs). This setup encompasses many important optimization problems, including logistic regression, phase retrieval, and two-layer neural networks.

The second part of the talk presents two applications of this theory. First, analysis of stochastic adaptive algorithms, such as line search and AdaGrad-Norm, reveals how data anisotropy influences algorithmic performance. Second, examination of differentially private SGD with gradient clipping demonstrates that the theoretical framework yields improved error rates for risk estimation in the challenging regime of aggressive clipping.

Denny Wu (New York University)

Learning shallow neural networks in high dimensions: SGD dynamics and scaling laws

Abstract

We study the sample and time complexity of online stochastic gradient descent (SGD) for learning a two-layer neural network with $M$ orthogonal neurons on isotropic Gaussian data. We focus on the challenging “extensive-width” regime M>>1 and allow for large condition number in the second-layer parameters, covering the power-law scaling a_m= m^{-\beta} as a special case. We characterize the SGD dynamics for the training of a student two-layer neural network and identify sharp transition times for the recovery of each signal direction. In the power-law setting, our analysis entails that while the learning of individual teacher neurons exhibits abrupt phase transitions, the juxtaposition of emergent learning curves at different timescales results in a smooth scaling law in the cumulative objective.

Lechao Xiao (Google)

Rethinking conventional wisdom in machine learning: from generalization to scaling

Abstract

The remarkable success of large language pretraining and the discovery of scaling laws signify a paradigm shift in machine learning. Notably, the primary objective has evolved from minimizing generalization error to reducing approximation error, and the most effective strategy has transitioned from regularization (in a broad sense) to scaling up models. This raises a critical question: Do the established principles that proved successful in the generalization-centric era remain valid in this new era of scaling? This paper examines several influential regularization-based principles that may no longer hold true in the scaling-centric, large language model (LLM) era. These principles include explicit L2 regularization and implicit regularization through small batch sizes and large learning rates. Additionally, we identify a new phenomenon termed « scaling law crossover, » where two scaling curves intersect at a certain scale, implying that methods effective at smaller scales may not generalize to larger ones. Together, these observations highlight two fundamental questions within this new paradigm:

∙ Guiding Principles for Scaling: If regularization is no longer the primary guiding principle for model design, what new principles are emerging to guide scaling?

∙ Model Comparison at Scale: How to reliably and effectively compare models at the scale where only a single experiment is feasible?

Yizhe Zhu (University of Southern California)

Non-convex matrix sensing: Breaking the quadratic rank barrier in the sample complexity

Abstract

For the problem of reconstructing a low-rank matrix from a few linear measurements, two classes of algorithms have been widely studied in the literature: convex approaches based on nuclear norm minimization, and non-convex approaches that use factorized gradient descent. Under certain statistical model assumptions, it is known that nuclear norm minimization recovers the ground truth as soon as the number of samples scales linearly with the number of degrees of freedom of the ground truth. In contrast, while non-convex approaches are computationally less expensive, existing recovery guarantees assume that the number of samples scales at least quadratically with the rank. In this talk, we consider the problem of reconstructing a positive semidefinite matrix from a few Gaussian measurements. We improve the previous rank-dependence in the sample complexity of non-convex matrix factorization from quadratic to linear. Our proof relies on a probabilistic decoupling argument, where we show that the gradient descent iterates are only weakly dependent on the individual entries of the measurement matrices. Joint work with Dominik Stöger (KU Eichstätt-Ingolstadt).