Tatiana Bubba (University of Bath)

Self-Supervised Deep Equilibrium models: learning from limited data

Abstract
Deep learning has emerged as a powerful tool for solving inverse problems in imaging, including computed tomography (CT). However, most approaches require paired training data with ground truth images, which can be difficult to obtain, e.g., in medical applications. In this talk, I present a self-supervised Deep Equilibrium (DEQ) framework for sparse-angle CT reconstruction that trains directly on undersampled measurements. Under suitable assumptions, it is possible to establish theoretical guarantees (i.e., self-supervised updates match those of fully-supervised training), for the case of non-unitary operators like that of CT. Numerical experiments on sparse-angle CT data demonstrate that this approach outperforms existing self-supervised methods while matching the performance of supervised approaches, achieving state-of-the-art results with as few as 16 projection angles.

Hung-Hsu (Edward) Chou (University of Pittsburg)

More is Less: Understanding Compressibility of Neural Networks via Implicit Regularization and Neural Collapse

Abstract

Despite their recent successes in various tasks, most modern machine learning algorithms lack theoretical guarantees, which are crucial to further development towards delicate tasks. One mysterious phenomenon is that, among infinitely many possible ways to fit data, the algorithms always find the « good » ones, even when the definition of « good » is not specified by the designers. In this talk I will cover the empirical and theoretical study of the connection between the good solutions in neural networks and the sparse solutions in compressed sensing with four questions in mind: What happens? When does it happen? Why does it happen? How can we improve it? The key concepts are implicit regularization, Bregman divergence, neural tangent kernel, and neural collapse.

Albert Cohen (Sorbonne University)

Stable nonlinear inversion and application and application to interface reconstruction from cell-averages

Abstract

In this lecture, we first discuss measures of optimal dimensionality reduction that may be thoughts as benchmark for efficient numerical methods in forward simulation or inverse problem. While the concept of n-width, introduced in 1936 by Kolmogorov, is well taylored to linear methods, the study of analogous concepts for nonlinear approximation is still the object of ongoing research. We then present a general framework for solving inverse problems using nonlinear approximation spaces. The main principles build up on the so called problems using nonlinear approximation spaces. The main principles build up on the so called Parametrized Background Data Weak method (PBDW), which can be thought as a linear counterpart.

As an application we study the reconstruction of sharp interfaces from coarse resolution cell averages for which linear methods are known to be uneffective. We discuss the convergence rates of these nonlinear reconstruction strategies and their optimality.

Nick Dexter (Florida State University)

Active and Compressive Approaches to High-Dimensional Function Approximation in Computational Science

Abstract

This talk discusses the sampling complexity of learning with ReLU neural networks and neural operators. For mappings belonging to relevant approximation spaces, upper bounds on the best-possible convergence rate of any learning algorithm are discussed, with respect to the number of samples. In the finite-dimensional case, these bounds imply a « theory-to-practice »-gap between parametric and sampling complexities. This result is extended to the infinite-dimensional setting of operator learning, with application to DeepONets and the Fourier neural operator. It is shown that the best-possible convergence rate in a Bochner Lp-norm is bounded by Monte-Carlo rates of order 1/p.

Diane Guignard (University of Ottawa)

Approximating partial differential equations with missing input data

Abstract

Mark Iwen (Michigan State University)

Sparse Spectral Methods for Solving High-Dimensional and Multiscale Elliptic PDEs

Abstract

Anastasis Kratsios (McMaster University)

Understanding Neural Networks Reasoning via Neural Network Representations of Algorithms

Abstract

One of the main open questions in contemporary AI is understanding what type of reasoning neural networks can perform if trained correctly. We partially answer that question by understanding any reasoning task as an emulation of a circuit whose gates determine the type of logic being implemented, e.g., Boolean circuits for predicate logical reasoning or tropical circuits for pure dynamic programming-based problem solving, hybrids thereof, etc…
We construct a meta-procedure for converting roughly any circuit into a feedforward neural network (NN), utilizing ReLU activations and skip connections, by replacing its gates from a list of standard ReLU MLP emulators. We show that, on any digital computer, any circuit can be emulated in this way, implying that there is no reasoning task which an NN cannot perform on a digital computer. Moreover, the number of neurons of the resulting NNs scale proportionally to the emulated circuit’s complexity. Several new applications are obtained, from guarantees of randomized circuit emulation to computation of a Turing machine at any pre-determined state by NNs. Importantly, we show that any universal approximation guarantee can be directly deduced from our result, implying a new connection between AI reasoning, computability theory, and approximation theory, which can provide insights into the internal reasoning processes of AI systems.
Joint work: Samuel Lanthaler

Sophie Langer (University of Twente)

tba

Abstract

tbc

Samuel Lanthaler (California Institute of Technology)

Theory-to-Practice Gap for Neural Networks and Neural Operators

Abstract

This talk discusses the sampling complexity of learning with ReLU neural networks and neural operators. For mappings belonging to relevant approximation spaces, upper bounds on the best-possible convergence rate of any learning algorithm are discussed, with respect to the number of samples. In the finite-dimensional case, these bounds imply a « theory-to-practice »-gap between parametric and sampling complexities. This result is extended to the infinite-dimensional setting of operator learning, with application to DeepONets and the Fourier neural operator. It is shown that the best-possible convergence rate in a Bochner Lp-norm is bounded by Monte-Carlo rates of order 1/p.

Jonas Latz (University of Manchester)

Deep Gaussian processes: Bayesian learning for large-scale data

Abstract

Gaussian processes (GPs) have gained popularity as flexible machine learning models for regression and function approximation with an in-built method for uncertainty quantification. GPs suffer when the amount of training data is large or when the underlying function contains multiscale features that are difficult to represent by an isotropic kernel. The training of GPs with large scale data is often performed through inducing point approximations (also known as sparse GP regression), where the size of the covariance matrices in GP regression is reduced considerably through a greedy search on the data set. Deep Gaussian processes have recently been introduced as hierarchical models that resolve multi-scale features by composing multiple Gaussian processes. Whilst GPs can be trained through a simple analytical formula, deep GPs require a sampling or, more usual, a variational approximation. Variational approximations lead to large-scale stochastic, non-convex optimisation problems and the resulting approximation tends to represent uncertainty incorrectly.

In this work, we combine variational learning with MCMC to develop a particle-based expectation-maximisation method to simultaneously find inducing points within the large-scale data (variationally) and accurately train the Gaussian processes (sampling-based). The result is a highly efficient and accurate methodology for deep GP training on large scale data. We test the method on standard benchmark problems.

Deanna Needell (University of California, Los Angeles)

Randomized Kaczmarz methods with beyond-Krylov convergence

Abstract

Randomized Kaczmarz methods form a family of linear system solvers that converge by repeatedly projecting their iterates onto randomly sampled equations. While effective in some contexts, such as highly over-determined least squares, Kaczmarz methods are traditionally deemed secondary to Krylov subspace methods, since this latter family of solvers can exploit outliers in the input’s singular value distribution to attain fast convergence on many ill-conditioned systems. In this talk, we introduce an accelerated randomized block Kaczmarz algorithm that exploits outlying singular values in the input to attain a fast Krylov-style convergence. Moreover, we show that our approach captures large outlying singular values provably faster than popular Krylov methods. We also develop an optimized variant for positive semidefinite systems, demonstrating empirically that it is competitive in arithmetic operations with both CG and GMRES on a collection of benchmark problems. To attain these results, we introduce several novel algorithmic improvements to the Kaczmarz framework, including adaptive momentum acceleration, Tikhonov-regularized projections, and a memoization scheme for reusing information from previously sampled equation blocks. This work is joint with Michał Dereziński, Elizaveta Rebrova, and Jiaming Yang.

Clarice Poon (University of Warwick)

Learning from samples: Inverse problems over measures

Abstract
Estimating parameters from samples of an optimal probability distribution is crucial for applications ranging from socio-economic modeling to biological system analysis. In these cases, the probability distribution arises as the solution to an optimization problem that models how agents interact in static settings or how the system evolves over time in dynamic scenarios. In this work, we investigate the stability of a broad class of convex methods for estimating key parameters—typically a cost function for static problems and a potential function for dynamic problems. Our approach is based on unbalanced optimal transport (UOT) with entropic regularization. Unlike classical optimal transport, UOT allows for probability measures with varying total mass, making it well-suited for handling noisy or incomplete data and for modeling dynamical systems. I will show how this problem can be tackled using a convex loss function and discuss the associated recovery properties, including sample complexity bounds, as well as regularization properties. This is joint work with Francisco Andrade and Gabriel Peyré.

Andrew Stuart (California Institute of Technology)

Allowing Image And Text Data To Communicate

Abstract

A fundamental problem in artificial intelligence is the question of how to simultaneously deploy data from different sources such as audio, image, text and video; such data is known as multimodal. In this talk I will focus on the canonical problem of aligning image and text data, and describe some of the mathematical ideas underlying the challenge of allowing them to communicate. I will describe the encoding of text and image in Euclidean spaces and describe contrastive learning methods to identify and learn embeddings which align these two modalities; I will also describe the attention mechanism, a form of nonlinear
correlation in vector-valued sequences. Attention turns out to be useful beyond this specific context, and I will show how it may be used to design and learn maps between Banach spaces or between spaces of probability measures.

Giang Tran (University of Waterloo)

Fast Multipole Attention for Transformer Neural Networks

Abstract

based models to long sequences. To address this, we present Fast Multipole Attention (FMA), a new attention mechanism that uses a divide-and-conquer strategy to reduce the time and memory complexity of attention for sequences of length n from O(n^2) to O(n \log n) or O(n), while retaining a global receptive field. The hierarchical approach groups queries, keys, and values into O( \log n) levels of resolution, where groups at greater distances are increasingly larger in size and the weights to compute group quantities are learned. As such, the interaction between tokens far from each other is considered in lower resolution in an efficient hierarchical manner. This multi-level divide-and-conquer strategy is inspired by fast summation methods from n-body physics and the Fast Multipole Method. We perform evaluation on autoregressive and bidirectional language modeling tasks as well as on image classification. We find empirically that the Fast Multipole Transformer performs much better than other efficient transformers and state-of-the-art vision transformers in terms of memory size and accuracy. The FMA mechanism has the potential to empower large language models with much greater sequence lengths, taking the full context into account in an efficient, naturally hierarchical manner during training and when generating long sequences.

Rebecca Willett (University of Chicago)

Off-the-shelf Algorithmic Stability

Abstract

Algorithmic stability holds when our conclusions, estimates, fitted models, predictions, or decisions are insensitive to small changes to the training data. Stability has emerged as a core principle for reliable data science, providing insights into generalization, cross-validation, uncertainty quantification, and more. Whereas prior literature has developed mathematical tools for analyzing the stability of specific machine learning (ML) algorithms, we study methods that can be applied to arbitrary learning algorithms to satisfy a desired level of stability. First, I will discuss how bagging is guaranteed to stabilize any prediction model, regardless of the input data. Thus, if we remove or replace a small fraction of the training data at random, the resulting prediction will typically change very little. Our analysis provides insight into how the size of the bags (bootstrap datasets) influences stability, giving practitioners a new tool for guaranteeing a desired level of stability. Second, I will describe how to extend these stability guarantees beyond prediction modeling to more general statistical estimation problems where bagging is not as well known but equally useful for stability. Specifically, I will describe a new framework for stable classification and model selection by combining bagging on class or model weights with a stable, “soft” version of the argmax operator. This is joint work with Jake Soloff and Rina Barber.

Jakob Zech (Heidelberg University)

Statistical Learning Theory for Neural Operators

Abstract