Lijia Zhou (周里佳) | publications

2024

ICLR

An Agnostic View on the Cost of Overfitting in (Kernel) Ridge Regression

Lijia Zhou, James B. Simon, Gal Vardi, and Nathan Srebro

In International Conference on Learning Representations

Abs PDF

We study the cost of overfitting in noisy kernel ridge regression (KRR), which we define as the ratio between the test error of the interpolating ridgeless model and the test error of the optimally-tuned model. We take an “agnostic” view in the following sense: we consider the cost as a function of sample size for any target function, even if the sample size is not large enough for consistency or the target is outside the RKHS. We analyze the cost of overfitting under a Gaussian universality ansatz using recently derived (non-rigorous) risk estimates in terms of the task eigenstructure. Our analysis provides a more refined characterization of benign, tempered and catastrophic overfitting (qv Mallinar et al. 2022).

2023

Thesis

A Statistical Learning Theory for Models with High Complexity

Lijia Zhou

PhD thesis (Department of Statistics, University of Chicago)

Abs Official Slides

Understanding why high-dimensional estimators can generalize beyond finite training samples is a fundamental problem in statistical learning theory. The traditional intuition, as suggested by Occam’s razor, is that models with low complexity tend to generalize better. We can often find simple models that explain the training data well if the high-dimensional data distribution has some hidden low-dimensional structure (for example, sparse linear regression and low-rank matrix recovery). However, contrary to our traditional intuition, complex models which interpolate noisy training labels can also enjoy good generalization in some settings. This phenomenon, which we call "interpolation learning," has significantly challenged our theoretical foundation of statistical learning. In this thesis, we present a novel Moreau envelope generalization theory to establish the concentration of measure in high dimensions. Since our result can precisely quantify the role of model complexity in generalization error, we can establish strong consistency results even though the norm of the high-dimensional interpolants that we consider diverges. In addition to proving sharp non-asymptotic bounds for interpolants in various contexts, we also recover versions of classical results from the compressed sensing and high-dimensional statistics literature. Applications of our theory include kernel ridge regression, max-margin classification, phase retrieval, matrix sensing, and some simple neural networks.
NeurIPS

Uniform Convergence with Square-Root Lipschitz Loss

Lijia Zhou, Zhen Dai, Frederic Koehler, and Nathan Srebro

In Advances in Neural Information Processing Systems

Abs PDF

We establish generic uniform convergence guarantees for Gaussian data in terms of the Rademacher complexity of the hypothesis class and the Lipschitz constant of the square root of the scalar loss function. We show how these guarantees substantially generalize previous results based on smoothness (Lipschitz constant of the derivative), and allow us to handle the broader class of square-root-Lipschitz losses, which includes also non-smooth loss functions appropriate for studying phase retrieval and ReLU regression, as well as rederive and better understand “optimistic rate” and interpolation learning guarantees.
JDS

Optimistic Rates: A Unifying Theory for Interpolation Learning and Regularization in Linear Regression

Lijia Zhou, Frederic Koehler, Danica J. Sutherland, and Nathan Srebro

In ACM/IMS Journal of Data Science, Volume 1 Issue 2

Abs Official PDF

We study a localized notion of uniform convergence known as an "optimistic rate" (Panchenko 2002; Srebro et al. 2010) for linear regression with Gaussian data. Our refined analysis avoids the hidden constant and logarithmic factor in existing results, which are known to be crucial in high-dimensional settings, especially for understanding interpolation learning. As a special case, our analysis recovers the guarantee from Koehler et al. (2021), which tightly characterizes the population risk of low-norm interpolators under the benign overfitting conditions. Our optimistic rate bound, though, also analyzes predictors with arbitrary training error. This allows us to recover some classical statistical guarantees for ridge and LASSO regression under random designs, and helps us obtain a precise understanding of the excess risk of near-interpolators in the over-parameterized regime.

2022

NeurIPS

A Non-Asymptotic Moreau Envelope Theory for High-Dimensional Generalized Linear Models

Lijia Zhou, Frederic Koehler, Pragya Sur, Danica J. Sutherland, and Nathan Srebro

In Advances in Neural Information Processing Systems

Abs Official PDF Poster

We prove a new generalization bound that shows for any class of linear predictors in Gaussian space, the Rademacher complexity of the class and the training error under any continuous loss can control the test error under all Moreau envelopes of the loss. We use our finite-sample bound to directly recover the "optimistic rate" of Zhou et al. (2021) for linear regression with the square loss, which is known to be tight for minimal l2-norm interpolation, but we also handle more general settings where the label is generated by a potentially misspecified multi-index model. The same argument can analyze noisy interpolation of max-margin classifiers through the squared hinge loss, and establishes consistency results in spiked-covariance settings. More generally, when the loss is only assumed to be Lipschitz, our bound effectively improves Talagrand’s well-known contraction lemma by a factor of two, and we prove uniform convergence of interpolators (Koehler et al. 2021) for all smooth, non-negative losses. Finally, we show that application of our generalization bound using localized Gaussian width will generally be sharp for empirical risk minimizers, establishing a non-asymptotic Moreau envelope theory for generalization that applies outside of proportional scaling regimes, handles model misspecification, and complements existing asymptotic Moreau envelope theories for M-estimation.

2021

NeurIPS

Uniform Convergence of Interpolators: Gaussian Width, Norm Bounds, and Benign Overfitting

Frederic Koehler, Lijia Zhou, Danica J. Sutherland, and Nathan Srebro

In Advances in Neural Information Processing Systems (Oral)

Abs Official PDF Poster Slides

We consider interpolation learning in high-dimensional linear regression with Gaussian data, and prove a generic uniform convergence guarantee on the generalization error of interpolators in an arbitrary hypothesis class in terms of the class’s Gaussian width. Applying the generic bound to Euclidean norm balls recovers the consistency result of Bartlett et al. (2020) for minimum-norm interpolators, and confirms a prediction of Zhou et al. (2020) for near-minimal-norm interpolators in the special case of Gaussian data. We demonstrate the generality of the bound by applying it to the simplex, obtaining a novel consistency result for minimum l1-norm interpolators (basis pursuit). Our results show how norm-based generalization bounds can explain and be used to analyze benign overfitting, at least in some settings.

2020

NeurIPS

On Uniform Convergence and Low-Norm Interpolation Learning

Lijia Zhou, Danica J. Sutherland, and Nathan Srebro

In Advances in Neural Information Processing Systems (Spotlight)

Abs Official PDF Poster Slides

We consider an underdetermined noisy linear regression model where the minimum-norm interpolating predictor is known to be consistent, and ask: can uniform convergence in a norm ball, or at least (following Nagarajan and Kolter) the subset of a norm ball that the algorithm selects on a typical input set, explain this success? We show that uniformly bounding the difference between empirical and population errors cannot show any learning in the norm ball, and cannot show consistency for any set, even one depending on the exact algorithm and distribution. But we argue we can explain the consistency of the minimal-norm interpolator with a slightly weaker, yet standard, notion: uniform convergence of zero-error predictors in a norm ball. We use this to bound the generalization error of low- (but not minimal-) norm interpolating predictors.