Alexandre d’Aspremont (ENS Paris)

Frank-Wolfe meets Shapley-Folkman: a systematic approach for solving nonconvex separable problems

Abstract

We consider separable nonconvex optimization problems under affine constraints. For these problems, the Shapley-Folkman theorem provides an upper bound on the duality gap as a function of the nonconvexity of the objective functions, but does not provide a systematic way to construct primal solutions satisfying that bound. In this work, we develop a two-stage approach to do so. The first stage approximates the optimal dual value with a large set of primal feasible solutions. In the second stage, this set is trimmed down to a primal solution by computing (approximate) Caratheodory representations. The main computational requirement of our method is tractability of the Fenchel conjugates of the component functions and their (sub)gradients. When the function domains are convex, the method recovers the classical duality gap bounds obtained via Shapley-Folkman. When the function domains are nonconvex, the method also recovers classical duality gap bounds from the literature, based on a more general notion of nonconvexity.

Amitabh Basu (Johns Hopkins University)

Complexity of convex optimization with integer variables

Abstract

We consider the problem of minimizing a convex function under convex constraints, where some or all of the decision variables are constrained to be integer, with access to first-order oracles for the objective function and separation oracles for the constraints. We focus on the information complexity of the problem, i.e., the minimum number of oracle calls to solve the problem to a desired accuracy. We will present nearly tight upper and lower bounds for the problem, extending classical results from continuous convex optimization. We also initiate the study of, and obtain the first set of results on, information complexity under oracles that only reveal partial first-order information, e.g., where one can only make a binary query on the function value or (sub)gradient at a given point. Finally, we will discuss the surprisingly unexplored topic of the sample complexity of stochastic convex mixed-integer optimization. We will close the talk with some open questions.

Stephen Becker (University of Colorado Boulder)

0th, 1st and 2nd order optimization methods for learning problems

Abstract

Learning and optimization are closely intertwined disciplines, and we discuss some aspects of the interplay between them. Machine learning has created classes of interesting problems for optimizers, as well as given optimizers new tools and computational paradigms such as GPU computing and mature automatic-differentiation frameworks. Optimizers are working on methods to solve these classes of problems, often just to efficiently minimize empirical risk but sometimes also to give statistical learning guarantees about generalization error.

We give some comments about the state of the field, as well as focusing on some specifics that have been investigated by our research group. In particular, we discuss 0th, 1st and 2nd order optimization methods which are all inspired by solving machine learning problems. 0th order (or « derivative-free ») methods have classically been inspired by PDE-constrained optimization for time-dependent PDE, but are now also motivated by machine learning applications such as hyperparameter tuning. We present some of our methods, including one designed to take advantage of « low-fidelity » information available in specific learning problems like adversarial attacks and soft prompting of LLMs.

For 1st order methods, we revisit the classic stochastic gradient descent with a new convergence analysis, and discuss the implications for generalization error, as well as the shortcomings of this « modular » analysis.

Finally, we look at 2nd order methods, which are still not common for training large machine learning problems due to the computational cost of linear algebra in these high-dimensional spaces. We analyze a regularized variant of an old heuristic, saddle-free Newton, and then discuss several efficient methods for the linear algebra using matrix-free methods exploiting access to Hessian-vector multiplies via automatic differentiation software.

James V. Burke (University of Washington)

A Class of Structured Convex Optimization over Probability Measures

Abstract

We consider the problem of optimizing the composition of a closed proper convex function with a continuous linear operator over the set of probability measures on a compact set. The duality theory is discussed along with first-order optimality conditions. A non-convex embedding into finite dimensions is described and the structure of the solution set is discussed. A number of classical applications of these results to non-parametric mixture modeling, optimal design, stochastic programming, and entropy maximization are briefly presented. If time permits, preliminary numerical experiments will be presented.

The material presented is joint work with Yeongcheon Baek and Chris Jordan-Squire.

Ying Cui (University of California Berkeley)

Selecting the Best Subgradient: A Descent-Oriented Subgradient Algorithm for Nonsmooth Optimization

Abstract

Subgradients do not vary continuously for nonsmooth functions, making the design of convergent descent methods particularly challenging, especially since a negative subgradient does not necessarily yield descent. Classical methods such as bundle techniques and gradient sampling attempt to overcome this by aggregating subgradients from nearby points, but these approaches are often complex and may lack deterministic guarantees.
In this talk, I first present a unifying principle behind these methods and introduce a general framework for provably convergent descent algorithms based on merely first-order and zeroth-order information. Central to this framework is a simple yet powerful technique called subgradient regularization, which generates stable descent directions using only function values and subgradients at a single point, eliminating the need for neighborhood sampling. This framework captures a broad class of nonsmooth marginal functions, including pointwise maxima and minima of smooth functions. Notably, when applied to composite functions, our approach recovers the classical prox-linear method and offers a new dual perspective. I will conclude with numerical experiments showcasing the method’s robustness on challenging nonsmooth problems, such as Nesterov’s Chebyshev-Rosenbrock function.

Frank E. Curtis (Lehigh University)

A Stochastic-Gradient-based Interior-Point Method for Inequality-Constrained Continuous Optimization

Abstract

Mateo Diaz (Johns Hopkins University)

tba

Abstract

tba

Maryam Fazel (University of Washington)

tba

Abstract

tba

Quanquan Gu (University of California, Los Angeles)

Unleashing the power of variance reduction for training large models

Abstract

Training deep neural networks–and more recently, large language models demands efficient and scalable optimizers. Adaptive gradient algorithms like Adam, AdamW, and their variants have been central to this task. Despite the development of numerous variance reduction algorithms in the past decade aimed at accelerating stochastic optimization in both convex and nonconvex settings, variance reduction has not found widespread success in training deep neural networks or large language models. Consequently, it has remained a less favored approach in modern AI. In this talk, I will introduce a unified optimization framework, MARS (Make vAriance Reduction Shine), which reconciles preconditioned gradient methods with variance reduction via a scaled stochastic recursive momentum technique. Within this framework, I will introduce three instances of MARS that leverage preconditioned gradient updates based on AdamW, Lion, and Shampoo, respectively. In addition, I will draw a connection between our algorithms and existing optimizers. Experimental results on training GPT-2 models indicate that MARS consistently outperforms AdamW by a large margin.

Satyen Kale (Google Research)

tba

Abstract

tba

Russell Luke (University of Göttingen)

Markov chains and random operator splitting

Abstract

In order to « learn » the probability density of electrons in a molecule from pulsed X-ray diffraction samples (about 10e+9 per reconstruction), we compute a Markov chain of random variables whose stationary distributions correspond to those of the electrons of the molecule. To understand the convergence of this procedure, we build on techniques that we have established for determining rates of convergence of numerical methods for inconsistent nonconvex optimization, lifted to the setting of probability spaces to arrive at a convergence analysis for noncontractive Markov operators. This approach has many other applications, for instance the analysis of distributed randomized algorithms in machine learning.

Antonio Silveti-Falls (Université Paris-Saclay)

Training deep learning models with norm-constrained LMOs

Abstract
In this talk, I discuss optimization methods that leverage the linear minimization oracle (LMO) over a norm-ball and their application to training huge neural networks. We propose a new stochastic family of algorithms that uses the LMO to adapt to the geometry of the problem and, perhaps surprisingly, show that they can be applied to unconstrained problems. The resulting update rule unifies several existing optimization methods under a single framework. Furthermore, we propose an explicit choice of norm for deep architectures, which, as a side benefit, leads to the transferability of hyperparameters across model sizes. Experimentally, we demonstrate significant speedups on nanoGPT training without any reliance on Adam. The proposed method is memory-efficient, requiring only one set of model weights and one set of gradients, which can be stored in half-precision.

Shoham Sabach (Technion)

From reinforcement learning to bilevel optimization: Algorithms and theory

Abstract

In this talk, we explore a novel optimization perspective on learning an accurate value function – a critical component in many reinforcement learning (RL) approaches. This perspective reveals the potential of bilevel optimization, an emerging field that has garnered increasing interest from researchers in optimization and machine learning due to its wide-ranging applications. We will discuss recent advancements in bilevel optimization, examining algorithmic innovations, theoretical convergence and complexity analyses.

Steve Vavasis (University of Waterloo)

Data-dependent running time for first-order methods for classification problems

Abstract

Large-scale problems in data science are often modeled with optimization, and the optimization model is usually solved with first-order methods that may converge at a sublinear (slow) rate. Therefore, it is of interest to terminate the optimization algorithm as soon as the underlying data science task is accomplished. We consider FISTA for solving binary classification problems modeled either with a support-vector machine or robustly as ellipsoid separation. We propose data-dependent termination criteria for FISTA that allow for early termination when the data is well conditioned. Joint work with Matthew Hough of University of Waterloo.

Andre Wibisono (Yale University)

Mixing Time of the Proximal Sampler in Relative Fisher Information via Strong Data Processing Inequality

Abstract

Sampling is a fundamental algorithmic task in statistics and computer science. Sampling has a deep relationship with optimization, for example via the perspective of sampling as optimization in the space of probability distributions. Many sampling algorithms are based on discretizing the Langevin dynamics, and they have mixing time guarantees which parallel the convergence guarantees from optimization. In this talk, we study the Proximal Sampler algorithm, which is an approximate proximal discretization of the Langevin dynamics. We show that when the target probability distribution is strongly log-concave, the relative Fisher information converges exponentially fast along the Proximal Sampler; this matches the exponential convergence of the squared gradient norm in strongly convex optimization. We discuss our proof which proceeds by establishing a strong data processing inequality for relative Fisher information along the Gaussian channel under strong log-concavity, and a weak data processing inequality along the reverse Gaussian channel for the target distribution.

Lin Xiao (Facebook AI)

Stochastic approximation with block coordinate optimistic stepsizes

Abstract

We consider stochastic approximation methods with block-coordinate stepsizes and propose adaptive stepsize rules that aim to minimize the expected distance of the next iterate from an optimal point. These stepsize rules use online estimates of the mean and second moment of the search direction along each block coordinate. The popular Adam algorithm can be interpreted as using a particular heuristic for such estimation. By leveraging a simple conditional estimator, we derive variants that require fewer hyperparameters and optimizer states but obtain comparable performance. In addition, our convergence analysis relies on a simple aiming condition that assumes neither smoothness nor convexity, thus has broader applicability.