{"id":18020,"date":"2025-06-04T13:25:12","date_gmt":"2025-06-04T17:25:12","guid":{"rendered":"https:\/\/www.crmath.ca\/page-calendrier\/abstracts-approximation-2025\/"},"modified":"2025-06-17T09:03:05","modified_gmt":"2025-06-17T13:03:05","slug":"abstracts-random","status":"publish","type":"page-calendrier","link":"https:\/\/www.crmath.ca\/en\/page-calendrier\/abstracts-random\/","title":{"rendered":"abstracts random-2025"},"content":{"rendered":"<p><div class=\"fusion-fullwidth fullwidth-box fusion-builder-row-1 fusion-flex-container nonhundred-percent-fullwidth non-hundred-percent-height-scrolling\" style=\"--awb-border-radius-top-left:0px;--awb-border-radius-top-right:0px;--awb-border-radius-bottom-right:0px;--awb-border-radius-bottom-left:0px;--awb-flex-wrap:wrap;\" ><div class=\"fusion-builder-row fusion-row fusion-flex-align-items-flex-start fusion-flex-content-wrap\" style=\"max-width:1420.64px;margin-left: calc(-4% \/ 2 );margin-right: calc(-4% \/ 2 );\"><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-0 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:0px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-1\"><h4>G\u00e9rard Ben Arous (New York University)<\/h4>\n<p>A dynamical spectral transition for SGD for Gaussian mixtures<\/p>\n<details open=\"open\">\n<summary>Abstract<\/summary>\n<p>This is joint work with Jiaoyang Huang, Reza Gheissari and Aukosh Jagannath.<br \/>\nI will briefly cover the recent notions of summary statistics and effective dynamics for high-dimensional optimization dynamics, and then show how this works for the central case of classification for mixtures of Gaussians. We will see the emergence of these low-dimensional effective dynamics as a dynamical spectral BBP transition.<\/p>\n<\/details>\n<h4>Elizabeth Collins-Woodfin (McGill University)<\/h4>\n<p>High-dimensional dynamics of SGD for Gaussian mixture models<\/p>\n<details open=\"open\">\n<summary>Abstract<\/summary>\n<p>We study the dynamics of streaming SGD in the context of high-dimensional k-component Gaussian mixture models.\u00a0 Using techniques from high-dimensional probability, matrix theory, and stochastic calculus, we show that, when the data dimension d grows proportionally to the number of samples n, SGD converges to a deterministic equivalent, characterized by a system of ordinary differential equations.\u00a0 A key contribution of our technique is that it works for non-isotropic data.\u00a0 As a simple example, I will discuss logistic regression on the 2-component model for various data covariance structures to illustrate the SGD dynamics for GMMs with non-isotropy.\u00a0 I will also discuss an extension of our methods to models with a growing number of components (k of order log(d)).\u00a0 This is based on work in progress with Inbar Seroussi.<\/p>\n<\/details>\n<h4>Zhou Fan (Yale University)<\/h4>\n<p>Dynamical mean-field analysis of adaptive Langevin diffusions<\/p>\n<details open=\"open\">\n<summary>Abstract<\/summary>\n<p>In many applications of statistical estimation via sampling, one may wish to sample from a high-dimensional target distribution that is adaptively evolving to the samples already seen. We study an example of such dynamics, given by a Langevin diffusion for posterior sampling in a Bayesian linear regression model with i.i.d. regression design, whose prior continuously adapts to the Langevin trajectory via a maximum marginal-likelihood scheme. Using techniques of dynamical mean-field theory (DMFT), we provide a precise characterization of a high-dimensional asymptotic limit for the joint evolution of the prior parameter and law of the Langevin sample. We then carry out an analysis of the equations that describe this DMFT limit, under conditions of approximate time-translation-invariance which include, in particular, settings where the posterior law satisfies a log-Sobolev inequality. In such settings, we show that this adaptive Langevin trajectory converges on a dimension-independent time horizon to an equilibrium state that is characterized by a system of scalar fixed-point equations, and the associated prior parameter converges to a critical point of a replica-symmetric limit for the model free energy. We explore the nature of the free energy landscape and its critical points in a few simple examples, where such critical points may or may not be unique.<\/p>\n<p>This is joint work with Justin Ko, Bruno Loureiro, Yue M. Lu, and Yandi Shen.<\/p>\n<\/details>\n<h4>Damine Ferbach (Universit\u00e9 de Montr\u00e9al)<\/h4>\n<p>Dimension-adapted Momentum Outscales SGD<\/p>\n<details open=\"open\">\n<summary>Abstract<\/summary>\n<p>We investigate scaling laws for stochastic momentum algorithms with small batch on the power law random features model, parameterized by data complexity, target complexity, and model size. When trained with a stochastic momentum algorithm, our analysis reveals four distinct loss curve shapes determined by varying data-target complexities. While traditional stochastic gradient descent with momentum (SGD-M) yields identical scaling law exponents to SGD, dimension-adapted Nesterov acceleration (DANA) improves these exponents by scaling momentum hyperparameters based on model size and data complexity. This outscaling phenomenon, which also improves compute-optimal scaling behavior, is achieved by DANA across a broad range of data and target complexities, while traditional methods fall short. Extensive experiments on high-dimensional synthetic quadratics validate our theoretical predictions and large-scale text experiments with LSTMs show DANA&#8217;s improved loss exponents over SGD hold in a practical setting.<\/p>\n<\/details>\n<h4>Florent Krzakala (EPFL)<\/h4>\n<p>Some Recent Progress in Asymptotics for High-Dimensional Neural Networks<\/p>\n<details open=\"open\">\n<summary>Abstract<\/summary>\n<div id=\"docs-chrome\" class=\"docs-material companion-enabled\" tabindex=\"0\" role=\"banner\" aria-label=\"Menu bar\">\n<div id=\"docs-additional-bars\">\n<div id=\"formula-bar-container\">\n<div id=\"formula-bar-name-box-wrapper\" class=\"name-box-enabled formula-bar-with-name-box-wrapper\">\n<div id=\"formula-bar\" class=\"formula-bar\">\n<p style=\"font-weight: 400;\">I will present recent advances in the asymptotic analysis of overparameterized neural networks, focusing on two extensions of classical models: deep architectures and nonlinear activations.<\/p>\n<p style=\"font-weight: 400;\">First, we study gradient descent on hierarchical Gaussian targets and show that depth enables effective dimensionality reduction, leading to improved sample complexity over kernel or shallow methods.<\/p>\n<p style=\"font-weight: 400;\">Second, we derive sharp generalization thresholds for two-layer networks with quadratic activations by mapping the learning dynamics to a convex matrix sensing problem with nuclear norm regularization.<\/p>\n<p style=\"font-weight: 400;\">These results combine tools from random matrix theory, convex optimization, and statistical physics to clarify the inductive biases and limits of deep learning in high dimensions.<\/p>\n<h4>Mufan Li (Princeton University)<\/h4>\n<p>The Proportional Scaling Limit of Neural Networks<\/p>\n<details open=\"open\">\n<summary>Abstract<\/summary>\n<p>Recent advances in deep learning performance have all relied on scaling up the number of parameters within neural networks, consequently making asymptotic scaling limits a compelling approach to theoretical analysis. In this talk, we explore the proportional infinite-depth-and-width limit, where the role of depth can be adequately studied, and the limit remains a great model of finite size networks. At initialization, we characterize the limiting distribution of the network via a stochastic differential equation (SDE) for the feature covariance matrix. Furthermore, in the linear network setting, we characterize the spectrum of the covariance matrix in the large data limit via a geometric variant of Dyson Brownian motions. Finally, we will briefly discuss ongoing work towards analyzing training dynamics.<\/p>\n<\/details>\n<h4>Zhenyu Liao (Huazhong University of Science and Technology)<\/h4>\n<p>Inversion Bias of Random Matrices: Precise Characterization, Implications for Randomized Numerical Linear Algebra, and Beyond<\/p>\n<details open=\"open\">\n<summary>Abstract<\/summary>\n<p>Given a random variable X with expectation E[X], one generally has E[1\/X] \\neq 1\/E[X], due to the nonlinear nature of the inverse. A similar phenomenon holds for random matrices, and this fundamental inversion bias has important implications for modern statistical and numerical methods.<\/p>\n<p>In this talk, I will discuss how this bias arises in a range of randomized sketching techniques (including random sampling and random projections) that are widely used in large-scale machine learning (ML) and randomized numerical linear algebra (RandNLA). Drawing on joint work with Micha\u0142 Derezi\u0144ski (U Michigan), Edgar Dobriban (UPenn), Michael Mahoney (UC Berkeley), and Chengmei Niu (HUST), we exploit leave-one-out techniques from random matrix theory (RMT) to precisely characterize this inversion bias and show that it can, in some cases at least, be corrected with ease.<\/p>\n<p>We further leverage these technical results to establish problem-independent local convergence rates for sub-sampled Newton methods.<\/p>\n<\/details>\n<h4>Bruno Loureiro (ENS &amp; CNRS)<\/h4>\n<p>High-dimensional limit of SGD for sequence single-index models<\/p>\n<details open=\"open\">\n<summary>Abstract<\/summary>\n<\/details>\n<p>In this talk, I will discuss the high-dimensional limits of SGD for sequence single-index models : a generalisation of the standard single-index model to the sequential domain, encompassing simplified one-layer attention architectures. Despite its simplicity, this model captures some important aspects of sequential data, allowing us to investigate questions such as the benefits of sequence length and the role of positional encoding in learning semantic vs. positional features of the target with attention-based architectures.<\/p>\n<h4>Yue Lu (Harvard University)<\/h4>\n<p>Attention-Based In-Context Learning: Insights from Random Matrix Theory<\/p>\n<details open=\"open\">\n<summary>Abstract<\/summary>\n<p>Transformers have a remarkable ability to learn and execute tasks based on examples provided within the input itself, without explicit prior training. It has been argued that this capability, known as in-context learning (ICL), is a cornerstone of Transformers&#8217; success, yet questions about the necessary sample complexity, pretraining task diversity, and context length for successful ICL remain unresolved. Here, we provide a precise answer to these questions in an exactly solvable model of ICL of a linear regression task by linear attention. We derive sharp asymptotics for the learning curve in a phenomenologically-rich scaling regime where the token dimension is taken to infinity; the context length and pretraining task diversity scale proportionally with the token dimension; and the number of pretraining examples scales quadratically. We demonstrate a double-descent learning curve with increasing pretraining examples, and uncover a phase transition in the model&#8217;s behavior between low and high task diversity regimes: In the low diversity regime, the model tends toward memorization of training tasks, whereas in the high diversity regime, it achieves genuine in-context learning and generalization beyond the scope of pretrained tasks. These theoretical insights, obtained by a random matrix analysis, are empirically validated through experiments with both linear attention and full nonlinear Transformer architectures.<\/p>\n<p>Joint work with Mary Letey, Jacob Zavatone-Veth, Anindita Maiti, and Cengiz Pehlevan.<br \/>\nhttps:\/\/arxiv.org\/abs\/2405.11751<\/p>\n<h4>Noah Marshall (McGill University)<\/h4>\n<p>To Clip or not to Clip: The Dynamics of SGD with Gradient Clipping in High Dimensions<\/p>\n<details open=\"open\">\n<summary>Abstract<\/summary>\n<p>The success of modern machine learning is due in part to the adaptive optimization methods that have been developed to deal with the difficulties of training large models over complex datasets. One such method is gradient clipping: a practical procedure with limited theoretical underpinnings. In this talk, I discuss work studying gradient clipping on a least squares problem under streaming SGD.<\/p>\n<p>We show that the risk dynamics are tracked by a system of ODEs in high-dimensions and in particular in the limit of large intrinsic dimension\u2014a model and dataset dependent notion of dimensionality. I will discuss when gradient clipping can be used to improve SGD performance and propose a simple heuristic for near optimal scheduling of the clipping threshold.<\/p>\n<h4>Theodor Misiakiewicz (Yale University)<\/h4>\n<p>Fundamental limits of learning under group equivariance<\/p>\n<details open=\"open\">\n<summary>Abstract<\/summary>\n<p>Modern machine learning heavily relies on the success of large-scale models trained via gradient-based algorithms. A major effort in recent years has been to understand the fundamental limits of these learning algorithms: What is the complexity of gradient-based training? Which distributions can these algorithms learn efficiently? Can we identify simple principles underlying feature learning? We focus in this talk on a key property of generic gradient-based methods: their equivariance with respect to a large symmetry group. We develop a group-theoretic framework to analyze the complexity of equivariant learning algorithms when trained on a given target distribution.\u00a0 This framework reveals a natural factorization of the group-distribution pair, and suggests an optimal sequential, adaptive learning process. Using these results, we revisit recent work on learning juntas and multi-index models using gradient algorithms.<\/p>\n<p>This is based on joint work with Hugo Koubbi (Yale), Nirmit Joshi (TTIC), and Nati Srebro (TTIC).<\/p>\n<\/details>\n<h4>Inbar Seroussi (Tel-Aviv University)<\/h4>\n<p>High-Dimensional SGD Theory: Insights for Algorithmic Design<\/p>\n<details open=\"open\">\n<summary>Abstract<\/summary>\n<\/details>\n<p>Stochastic optimization methods are fundamental in modern machine learning, yet understanding their remarkable performance remains a significant theoretical challenge. This talk presents a theoretical framework for analyzing stochastic gradient descent (SGD) in the high-dimensional regime, where both the sample size and parameter dimension grow large. The analysis focuses on generalized linear models and multi-index problems trained with Gaussian data having a general covariance structure. The limiting dynamics are governed by a set of low-dimensional ordinary differential equations (ODEs). This setup encompasses many important optimization problems, including logistic regression, phase retrieval, and two-layer neural networks.<br \/>\nThe second part of the talk presents two applications of this theory. First, analysis of stochastic adaptive algorithms, such as line search and AdaGrad-Norm, reveals how data anisotropy influences algorithmic performance. Second, examination of differentially private SGD with gradient clipping demonstrates that the theoretical framework yields improved error rates for risk estimation in the challenging regime of aggressive clipping.<\/p>\n<h4>Mert Vural (University of Toronto)<\/h4>\n<p>Learning quadratic neural networks in high dimensions: SGD dynamics and scaling laws<\/p>\n<details open=\"open\">\n<summary>Abstract<\/summary>\n<\/details>\n<p>We study the optimization and sample complexity of gradient-based training of a two-layer neural network with quadratic activation function in the high-dimensional regime, where the input is Gaussian and the response is generated from a two-layer teacher network with quadratic activation, and power-law decay on the second-layer coefficients. We provide a sharp analysis of the SGD dynamics in the feature learning regime, for both the population limit and the finite-sample (online) discretization, and derive scaling laws for the prediction risk that highlight the power-law dependencies on the optimization time, sample size, and model width. Our analysis combines a precise characterization of the associated matrix Riccati differential equation with novel matrix monotonicity arguments to establish convergence guarantees for the infinite-dimensional effective dynamics.<\/p>\n<div>\n<h4>Denny Wu (New York University)<\/h4>\n<p style=\"font-weight: 400;\">Learning single-index models with neural networks<\/p>\n<details open=\"open\">\n<summary>Abstract<\/summary>\n<\/details>\n<div>\n<p>Single-index models are given by a univariate link function applied to a one-dimensional projection of the input. Recent works have shown that the statistical complexity of learning this function class with online SGD is governed by the information exponent of the link function. In this talk, we discuss two variations of prior analyses.\u00a0First, we consider the learning of single-index polynomials via SGD, but with reused training data. We show that two-layer neural networks optimized by an SGD-based algorithm can learn this target with almost linear sample complexity, regardless of the information exponent; this complexity surpasses the CSQ lower bound and matches the information-theoretic limit up to polylogarithmic factors.\u00a0Next,\u00a0we consider an in-context learning (ICL) setting where a nonlinear transformer is pretrained via gradient descent. We show that when the single-index target is drawn from a rank-<i>r<\/i>\u00a0subspace, the in-context sample complexity of the pretrained transformer scales with the subspace dimension\u00a0<i>r<\/i>\u00a0but not the ambient dimension\u00a0<i>d<\/i>, which\u00a0outperforms estimators that only have access to the in-context data.<\/p>\n<\/div>\n<h4>Lechao Xiao (Google)<\/h4>\n<p>Rethinking conventional wisdom in machine learning: from generalization to scaling<\/p>\n<details open=\"open\">\n<summary>Abstract<\/summary>\n<p>The remarkable success of large language pretraining and the discovery of scaling laws signify a paradigm shift in machine learning. Notably, the primary objective has evolved from minimizing generalization error to reducing approximation error, and the most effective strategy has transitioned from regularization (in a broad sense) to scaling up models. This raises a critical question: Do the established principles that proved successful in the generalization-centric era remain valid in this new era of scaling? This paper examines several influential regularization-based principles that may no longer hold true in the scaling-centric, large language model (LLM) era. These principles include explicit L2 regularization and implicit regularization through small batch sizes and large learning rates. Additionally, we identify a new phenomenon termed &#8220;scaling law crossover,&#8221; where two scaling curves intersect at a certain scale, implying that methods effective at smaller scales may not generalize to larger ones. Together, these observations highlight two fundamental questions within this new paradigm:<\/p>\n<p>\u2219 Guiding Principles for Scaling: If regularization is no longer the primary guiding principle for model design, what new principles are emerging to guide scaling?<\/p>\n<p>\u2219 Model Comparison at Scale: How to reliably and effectively compare models at the scale where only a single experiment is feasible?<\/p>\n<\/details>\n<h4>Yizhe Zhu (University of Southern California)<\/h4>\n<p>Non-convex matrix sensing: Breaking the quadratic rank barrier in the sample complexity<\/p>\n<details open=\"open\">\n<summary>Abstract<\/summary>\n<p>For the problem of reconstructing a low-rank matrix from a few linear measurements, two classes of algorithms have been widely studied in the literature: convex approaches based on nuclear norm minimization, and non-convex approaches that use factorized gradient descent. Under certain statistical model assumptions, it is known that nuclear norm minimization recovers the ground truth as soon as the number of samples scales linearly with the number of degrees of freedom of the ground truth. In contrast, while non-convex approaches are computationally less expensive, existing recovery guarantees assume that the number of samples scales at least quadratically with the rank. In this talk, we consider the problem of reconstructing a positive semidefinite matrix from a few Gaussian measurements. We improve the previous rank-dependence in the sample complexity of non-convex matrix factorization from quadratic to linear. Our proof relies on a probabilistic decoupling argument, where we show that the gradient descent iterates are only weakly dependent on the individual entries of the measurement matrices. Joint work with Dominik St\u00f6ger (KU Eichst\u00e4tt-Ingolstadt).<\/p>\n<\/details>\n<\/div>\n<\/details>\n<\/details>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/details>\n<\/div><div class=\"fusion-text fusion-text-2\"><\/details>\n<\/div><\/div><\/div><\/div><div class=\"fusion-builder-row fusion-row fusion-flex-align-items-flex-start fusion-flex-content-wrap\" style=\"max-width:1420.64px;margin-left: calc(-4% \/ 2 );margin-right: calc(-4% \/ 2 );\"><\/div><\/div><div class=\"fusion-fullwidth fullwidth-box fusion-builder-row-2 fusion-flex-container nonhundred-percent-fullwidth non-hundred-percent-height-scrolling\" style=\"--awb-border-radius-top-left:0px;--awb-border-radius-top-right:0px;--awb-border-radius-bottom-right:0px;--awb-border-radius-bottom-left:0px;--awb-flex-wrap:wrap;\" ><\/div><div class=\"fusion-fullwidth fullwidth-box fusion-builder-row-3 fusion-flex-container nonhundred-percent-fullwidth non-hundred-percent-height-scrolling\" style=\"--awb-border-radius-top-left:0px;--awb-border-radius-top-right:0px;--awb-border-radius-bottom-right:0px;--awb-border-radius-bottom-left:0px;--awb-flex-wrap:wrap;\" ><div class=\"fusion-builder-row fusion-row fusion-flex-align-items-flex-start fusion-flex-content-wrap\" style=\"max-width:1420.64px;margin-left: calc(-4% \/ 2 );margin-right: calc(-4% \/ 2 );\"><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-1 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:0px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-3\"><\/div><\/div><\/div><\/div><\/div><\/p>\n","protected":false},"author":14,"template":"","class_list":["post-18020","page-calendrier","type-page-calendrier","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/www.crmath.ca\/en\/wp-json\/wp\/v2\/page-calendrier\/18020","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.crmath.ca\/en\/wp-json\/wp\/v2\/page-calendrier"}],"about":[{"href":"https:\/\/www.crmath.ca\/en\/wp-json\/wp\/v2\/types\/page-calendrier"}],"author":[{"embeddable":true,"href":"https:\/\/www.crmath.ca\/en\/wp-json\/wp\/v2\/users\/14"}],"version-history":[{"count":12,"href":"https:\/\/www.crmath.ca\/en\/wp-json\/wp\/v2\/page-calendrier\/18020\/revisions"}],"predecessor-version":[{"id":18161,"href":"https:\/\/www.crmath.ca\/en\/wp-json\/wp\/v2\/page-calendrier\/18020\/revisions\/18161"}],"wp:attachment":[{"href":"https:\/\/www.crmath.ca\/en\/wp-json\/wp\/v2\/media?parent=18020"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}