Machine Learning and Statistics

Moise Blanchard
Universal Online Learning

We study the subject of optimistic universal online learning with non-i.i.d. processes. The notion of universal consistent learning was defined by Hanneke [Hanneke, 2021] in an effort to study learning theory under minimal assumptions, where the objective is to obtain low long-run average loss for any target function. We are interested in characterizing processes for which learning is possible and whether there exist learning rules guaranteed to be universally consistent given the only assumption that such learning is possible. In the case where the loss function is unbounded, [Hanneke, 2021] gave a characterization of learnable processes and proved the existence of such learning rules. We give a drastically simpler formulation of this characterization which shows that the simple memorization learning rule is adequate for this context. When the loss function is bounded, we show that learnable processes are invariant from the choice of output space which allows to restrict the analysis to the binary classification setting. Finally, we give a full characterization of processes for which learning is possible in the bounded loss setting and show that a variant from the 1-nearest neighbor algorithm is optimistically universal. This result holds for general separable input Borel space and separable output space with bounded loss. This closes the four main open problems posed in [Hanneke, 2021] on universal learning.

Julia Briden
Switching State-Space Models for Thermospheric Density Estimation

Accurate estimates of thermospheric density are essential for low Earth orbit prediction. Current machine learning-based Reduced-Order Modeling (ROM) techniques allow for real-time data assimilation and forecasting without the computational costs of physics based models. While these models effectively capture the systems’ nonlinear dynamics, they do not generalize well for varying levels of space weather activity. This work addresses this issue by extending machine learning-based ROM techniques to include Switching State-Space Models (SSSM).

Raaz Dwivedi
Distribution Compression in Near-Linear Time

In distribution compression, one aims to accurately summarize a probability distribution \(P\) using a small number of representative points. Near-optimal thinning procedures achieve this goal by sampling \(n\) points from a Markov chain and identifying \(\sqrt{n}\) points with \(\tilde{O}(1/\sqrt{n})\) discrepancy to \(P\). Unfortunately, these algorithms suffer from quadratic or super-quadratic runtime in the sample size \(n\). To address this deficiency, we introduce Compress++, a simple meta-procedure for speeding up any thinning algorithm while suffering at most a factor of 4 in error. When combined with the quadratic-time kernel halving and kernel thinning algorithms of Dwivedi and Mackey (2021), Compress++ delivers \(\sqrt{n}\) points with \(O(\sqrt{\log n/n})\) integration error and better-than-Monte-Carlo maximum mean discrepancy in \(O(n \log^3 n)\) time and \(O(\sqrt{n} \log^2 n)\) space. Moreover, Compress++ enjoys the same near-linear runtime given any quadratic-time input and reduces the runtime of super-quadratic algorithms by a square-root factor. In our benchmarks with high-dimensional Monte Carlo samples and Markov chains targeting challenging differential equation posteriors, Compress++ matches or nearly matches the accuracy of its input algorithm in orders of magnitude less time.

Nicholas Johnson
Sparse and Low Rank Matrix Decomposition: A Discrete Optimization Approach

We study the Sparse Plus Low Rank decomposition problem (SLR), which is the problem of decomposing a corrupted data matrix \(D\) into a sparse matrix \(Y\) containing the perturbations plus a low rank matrix \(X\). SLR is a fundamental problem in Operations Research and Machine Learning arising in many applications such as data compression, latent semantic indexing, collaborative filtering and medical imaging. We introduce a novel formulation for SLR that directly models the underlying discreteness of the problem. For this formulation, we develop an alternating minimization heuristic to compute high quality solutions and a novel semidefinite relaxation that provides meaningful bounds for the solutions returned by our heuristic. We further develop a custom branch and bound routine that leverages our heuristic and convex relaxation that solves small instances of SLR to certifiable near-optimality. Our heuristic can scale to \(n=10000\) in hours, our relaxation can scale to \(n=200\) in hours, and our branch and bound algorithm can scale to \(n=25\) in minutes. Our numerical results demonstrate that our approach outperforms existing state-of-the-art approaches in terms of the MSE of the low rank matrix and that of the sparse matrix.

Sung Min (Sam) Park
Analyzing ML Pipeline with Datamodels

Current supervised machine learning models rely on an abundance of training data. Yet, understanding the underlying structure and biases of this data—and how they impact models—remains challenging. We present a new conceptual framework, datamodels, for directly modeling predictions as functions of training data. We instantiate our framework with simple parametric models (e.g. linear) and apply it to deep neural networks trained on standard vision datasets. Despite the complexity of the underlying process (e.g. SGD on overparameterized neural networks), the resulting datamodels can accurately predict model outputs as linear functions of the presence of different training examples. These datamodels, in turn, give rise to powerful tools for analyzing the ML pipeline: a predictive model for counterfactual impact of removing different training data; a data similarity metric that is faithful to the model class under study; and a rich embedding and graph that gives a principled way to study structure in the data, such as memorized subpopulations.

Austin Iglesias Saragih
Using Interpretable Machine Learning to Predict and Improve On-Time Delivery

With the growth of e-commerce bolstered by the COVID-19 pandemic, consumers require online deliveries to be fast and reliable. Late parcels are expensive to companies as unreliable delivery performance leads to the loss of costumers. Additionally, it is common practice for retailers and couriers to offer customers compensation or pay penalties for delays, further costing companies. To avoid losing customers and incurring late charges, it is in companies' best interests to 1) have trustworthy models to determine and rectify which parcels in the last mile delivery process are likely to arrive late and 2) find the most significant operations improvement opportunities. Unfortunately, unexplained black-box models are unable to reach these goals. In contrast, interpretable machine learning can both predict and react to late parcels, resolving last-mile delivery issues as well as find operational improvement opportunities. Finally, companies can study the trade-offs between late delivery penalties and intervention costs to identify the best solutions for delays, such as express delivery of to-be-late parcels or informing the affected parties in advance. This study has a four-step framework. First, we leverage next-day parcel delivery data to find the best interpretable model for predicting on-time delivery. Second, we measure the model's confidence through robustness and stability measurements. Third, we identify the most significant areas for operations improvement through SHapley Additive exPlanations (SHAP) feature importance ranking. Finally, we evaluate the perscriptive trade-off between identifying and expediting deliveries and penalties for late deliveries along the steps of the delivery process. Our findings indicate that Optimal Classification Tree (OCT) is the best-performing interpretable model with the high accuracy (0.89). We also confirm that faster pickup lead times and earlier courier assignments can reduce late deliveries. Overall, this study initiates the development of interpretable, stable, and robust predictive analytics in last-mile delivery. Further research in this area will be instrumental in improving delivery operations and customer service.

Will Stephenson
Measuring the Sensitivity of Gaussian Processes to Kernel Choice

Gaussian processes (GPs) are used to make medical and scientific decisions, including in cardiac care and monitoring of carbon dioxide emissions. But the choice of GP kernel is often somewhat arbitrary. In particular, uncountably many kernels typically align with qualitative prior knowledge (e.g. function smoothness or stationarity). But in practice, data analysts choose among a handful of convenient standard kernels (e.g. squared exponential). In the present work, we ask: Would decisions made with a GP differ under other, qualitatively interchangeable kernels? We show how to formulate this sensitivity analysis as a constrained optimization problem over a finite-dimensional space. We can then use standard optimizers to identify substantive changes in relevant decisions made with a GP. We demonstrate in both synthetic and real-world examples that decisions made with a GP can exhibit substantial sensitivity to kernel choice, even when prior draws are qualitatively interchangeable to a user.

Yunzong Xu
Offline Reinforcement Learning: Fundamental Barriers for Value Function Approximation

We consider the offline reinforcement learning problem, where the aim is to learn a decision making policy from logged data. Offline RL -- particularly when coupled with (value) function approximation to allow for generalization in large or continuous state spaces -- is becoming increasingly relevant in practice, because it avoids costly and time-consuming online data collection and is well suited to safety-critical domains. Existing sample complexity guarantees for offline value function approximation methods typically require both (1) distributional assumptions (i.e., good coverage) and (2) representational assumptions (i.e., ability to represent some or all Q-value functions) stronger than what is required for supervised learning. However, the necessity of these conditions and the fundamental limits of offline RL are not well understood in spite of decades of research. This led Chen and Jiang (2019) to conjecture that concentrability (the most standard notion of coverage) and realizability (the weakest representation condition) alone are not sufficient for sample-efficient offline RL. We resolve this conjecture in the positive by proving that in general, even if both concentrability and realizability are satisfied, any algorithm requires sample complexity polynomial in the size of the state space to learn a non-trivial policy.


Information and Networks

Zeyu Jia
Non-Parametric Threshold for Smoothed Empirical Wasserstein Distance

Consider an empirical measure \(P_n\) induced by \(n\) iid samples from a \(d\)-dimensional \(K\)-subgaussian distribution \(P\). We show that when \(K<\sigma\), the Wasserstein distance \(W_2^2(P_n * \mathcal{N}(0,\sigma^2 I_d), P*\mathcal{N}(0,\sigma^2 I_d))\) converges at the parametric rate \(O(1/n)\), and when \(K > \sigma\), there exists a \(K\)-subgaussian distribution \(P\) such that \(W_2^2(P_n * \mathcal{N}(0,\sigma^2 I_d), P*\mathcal{N}(0,\sigma^2 I_d)) = \omega(1/n)\). This resolves the open problems in [Goldfeld, 2020], closes the gap between where we get parametric rate and where we do not have parametric rate. Our result provides a complete characterization of the range of parametric rates for subgaussian \(P\). In addition, when \(K<\sigma\), we establish more delicate results about the convergence rate of W2 distance squared. Assuming the distribution is one dimensional, we provide both the lower bound and the upper bound, demonstrating that the rate changes gradually from \(\Theta(1/\sqrt{n})\) to \(\Theta(1/n)\) as \(\sigma/K\) goes from \(0\) to \(1\). Moreover, we also establish that \(D_{KL}(P_n * \mathcal{N}(0,\sigma^2 I_d)\|P*\mathcal{N}(0,\sigma^2 I_d)) = \tilde{\mathcal{O}}(1/n)\). These results indicate a dichotomy of the convergence rate between the W2 distance squared and the KL divergence, resulting in the failure of \(T_{2}\)-transportation inequality when \(\sigma < K\), hence also resolving the open problem in [Wang, 2016] about whether \(K < \sigma\) is necessary in proving whether the log-Sobolev inequality holds for \(\mathbb{P} * \mathcal{N}(0, \sigma^{2})\).

Nitya Mani
Minimum Feedback Arc Sets in Directed Graphs

For a directed graph \(G\), let \(\beta(G)\) denote the size of a minimum feedback arc set, a smallest subset of edges whose deletion leaves an acyclic subgraph. We show that if \(G\) is an oriented graph with \(m\) edges, then \(\beta(G) \leq m/2-O(m^{3/4})\) and that this bound is tight in general, in the process giving a randomized polynomial time algorithm to find a small feedback arc set. When our oriented graph has additional structure, we are able to strengthen these bounds. If the underlying undirected graph of \(G\) is free of a family of undirected subgraphs, we get a tight (up to a constant factor in the lower order surplus term) bound on \(\beta(G)\), implying that for every exponent \(0 \le \alpha \le 1\) there is a family of directed graphs with \(m/2 - c_1 m^{3/4+\alpha} \le \beta(G) \le m/2 - c_2 m^{3/4 + \alpha}\). We also observe that if \(G\) has a fixed forbidden bipartite subgraph \(H\), the exponent \(3/4\) can be improved. Finally, we give a characterization of quasirandom directed graphs via minimum feedback arc sets.

Manon Revel
Representation in Binary Decision-Making

In response to the frustration with the current democratic institutions that failed to adapt to the evolving society, alternative decision-making processes emerged in recent years. Requests for more representativity of the citizenry resulted in the formation of citizens assemblies, but the latter do not guarantee that the randomly chosen representatives are competent to decide on the issues at stake. After investigating the performance of classic representative democracy, we study another voting paradigm, fluid democracy, that relies on letting agents choose between voting themselves and transitively delegating their votes to better-informed agents in their neighborhood. Such a system allows for more direct representativity while also enabling competent voters to be endogenously identified in a social network. While fluid democracy has been viewed as a system that can combine the best aspects of direct and representative democracy, it can also result in situations where few voters amass a large amount of influence. To analyze the impact of this shortcoming, we consider what has been called an epistemic setting, where voters decide on a binary issue for which there is a ground truth. Previous work has shown that under certain assumptions on the delegation mechanism, the concentration of power is so severe that fluid democracy is less likely to identify the ground truth than direct voting. We examine different, arguably more realistic, classes of mechanisms, and prove they behave well by ensuring that (with high probability) there is a limit on concentration of power. Our proofs demonstrate that delegations can be treated as stochastic processes and that they can be compared to well-known random graph processes from the literature such as preferential attachment and multi-types branching process that are sufficiently bounded for our purposes. Our results suggest that the concerns raised about fluid democracy can be overcome, thereby bolstering the case for this emerging paradigm.

M. Taha Toghani
Communication-Efficient and Scalable Algorithms for Decentralized Consensus, Stochastic Optimization, Inference

We study the decentralized consensus, stochastic optimization, and hypothesis testing problems under compressed communications. Our main objective is to optimize the sum of \(n\) functions, each accessible to a single node, where local communications are subject to a fixed communication network. First, we propose a new decentralized consensus algorithm with quantized communications that scales linearly with the network size \(n\). We prove that the proposed method converges to the average of the initial values held locally by the agents. Moreover, we consider the decentralized consensus and stochastic optimization problems under arbitrary compressed message sharing over a directed graph. We present an iterative push-sum algorithm with arbitrary compressed communications and provide convergence rates for the stochastic optimization problem on smooth functions that are either (i) strongly convex, (ii) convex, or (iii) non-convex. We finally study the problem of distributed hypothesis testing (non-Bayesian learning) and propose a Byzantine-resilient algorithm with compressed belief sharing. We prove an almost sure asymptotic and a probabilistic non-asymptotic linear convergence of the beliefs around the set of optimal hypotheses. We present numerical experiments that confirm our theoretical results and illustrate the scalability, communication efficiency, and fault tolerance of our algorithms.

Vishrant Tripathi
Online Learning for Indexable Restless Multi-Armed Bandits

We consider indexable restless multi-armed bandits in an episodic setting, where the cost functions are unknown, time-varying and possibly adversarial across episodes. For this problem, we design a new online learning algorithm called Follow-the-Perturbed-Whittle-Leader and show that it has low regret compared to the best fixed scheduling policy in hindsight, while remaining computationally feasible. The algorithm and its regret analysis are novel and of independent interest to the study of online learning for combinatorial optimization problems using an approximate oracle. We further design algorithms that achieve sublinear regret compared to the best dynamic policy when the environment is slowly varying. Finally, we apply our algorithms to mobility tracking over a wireless network. We consider non-stationary and adversarial mobility models and illustrate the performance benefit of using our online learning algorithms compared to an oblivious scheduling policy.


Control and Game Theory

Abdullah Alomar
PerSim: Data-Efficient Offline Reinforcement Learning with Heterogeneous Agents via Personalized Simulators

We consider offline reinforcement learning (RL) with heterogeneous agents under severe data scarcity, i.e., we only observe a single historical trajectory for every agent under an unknown, potentially sub-optimal policy. We find that the performance of state-of-the-art offline and model-based RL methods degrade significantly given such limited data availability, even for commonly perceived "solved" benchmark settings such as "MountainCar" and "CartPole". To address this challenge, we propose PerSim, a model-based offline RL approach which first learns a personalized simulator for each agent by collectively using the historical trajectories across all agents, prior to learning a policy. We do so by positing that the transition dynamics across agents can be represented as a latent function of latent factors associated with agents, states, and actions; subsequently, we theoretically establish that this function is well-approximated by a "low-rank" decomposition of separable agent, state, and action latent functions. This representation suggests a simple, regularized neural network architecture to effectively learn the transition dynamics per agent, even with scarce, offline data. We perform extensive experiments across several benchmark environments and RL methods. The consistent improvement of our approach, measured in terms of both state dynamics prediction and eventual reward, confirms the efficacy of our framework in leveraging limited historical data to simultaneously learn personalized policies across agents.

Sarah Cen
Regulating Algorithmic Filtering on Social Media

By filtering the content that users see, social media platforms have the ability to influence users' perceptions and decisions, from their dining choices to their voting preferences. This influence has drawn scrutiny, with many calling for regulations on filtering algorithms, but designing and enforcing regulations remains challenging. In this work, we examine three questions. First, given a regulation, how would one design an audit to enforce it? Second, does the audit impose a performance cost on the platform? Third, how does the audit affect the content that the platform is incentivized to filter? In response to these questions, we propose a method such that, given a regulation, an auditor can test whether that regulation is met with only black-box access to the filtering algorithm. We then turn to the platform's perspective. The platform's goal is to maximize an objective function while meeting regulation. We find that there are conditions under which the regulation does not place a high performance cost on the platform and, notably, that content diversity can play a key role in aligning the interests of the platform and regulators.

Charles Dawson
Differentiable Robotics: Tools for Automated Design and Robustness Analysis

Modern robotic systems involve complex interactions not only between software and hardware components, but also between the robot and its environment. As a result, designing these systems requires a holistic approach that considers this coupling and scales to complex robots with multiple subsystems, in order to both optimize the design of robotic systems and analyze their robustness to environmental uncertainty. In this talk, I will showcase how recently-developed programming languages for automatic differentiation and parallelization can be applied to manage this complexity and automate design optimization and robustness analysis for complex robotic systems. After describing how these tools enable a novel simulation-driven approach to design optimization and robustness analysis, I will present several case studies, including the end-to-end optimization of sensor networks to enable robot navigation.

Peter Fisher
Linear Quadratic Control Using a High-Order Tuner

We propose and prove the stability of a new algorithm for discrete-time control of time-invariant plants with unknown parameters. Recent papers have introduced the high-order tuner, an algorithm for parameter estimation of discrete-time dynamical systems based on Nesterov's accelerated gradient descent algorithm. We extend this algorithm to address the linear quadratic regulator (LQR) control problem on discrete-time LTI systems when system parameters are unknown. We show that the high-order tuner has performance bounds that are comparable to or better than those of other recently-proposed algorithms, and these theoretical results are verified in simulation.

Rabab Haider
Advanced Decision Making Tools for Smart Distribution Grids

As the US pushes aggressively towards clean energy goals, such as having 40% of electricity generated by solar by 2035, the electricity system as we know it must adapt to the influx of renewables and small-scale resources. Focusing on customer-owned solar generation, I show that hierarchical control and optimization architectures which leverage both local and distributed decision-making schemes can be used to coordinate both legacy devices and new inverter-based resources. The resulting grid can be operated efficiently and reliably at high solar penetration.

Runyu Zhang
On the Effect of Log-Barrier Regularization in Decentralized Softmax Gradient Play in Multiagent Systems

Softmax policy gradient is a popular algorithm for learning optimal policies in single-agent reinforcement learning, particularly since projection is not needed for each gradient update. However, in multi-agent systems, the lack of central coordination introduces additional difficulties beyond the single-agent case. Even for a stochastic game with identical interest, there can be multiple Nash Equilibria (NEs), disabling any proof technique that relies on the existence of a unique global optimum. Moreover, the softmax parameterization introduces non-NE policies with zero gradient, making NE-seeking difficult for gradient-based algorithms. In this paper, we study the finite time convergence of decentralized softmax gradient play in a special form of game, Markov Potential Games (MPGs), that includes the identical interest game as a special case. We investigate both gradient play and natural gradient play, with and without \(\log\)-barrier regularization. Establishing convergence in the unregularized cases relies on an assumption that the stationary policies are isolated, and yields convergence bounds that contain a trajectory dependent constant that can be arbitrarily large. By contrast, the introduction of \(\log\)-barrier regularization overcomes these drawbacks, but slightly worsens the dependence on other factors such as the action set size. An empirical study on a bandit game confirms the theoretical findings.

Zijie Zhou
Stratifying Online Field Experiments Using the Pigeonhole Design

Practitioners and academics have long appreciated the benefits that experimentation brings to firms. For online web-facing firms, however, it still remains challenging in handling heterogeneity when experimental units arrive sequentially in online field experiments. In this paper, we study a novel online experimental design problem, which we refer to as the "Online Stratification Problem." In this problem, experimental units with heterogeneous covariate information arrive sequentially and must be immediately assigned into either the control or the treatment group, with an objective of minimizing the total discrepancy between the two. To solve this problem, we propose a novel experimental design approach, which we refer to as the "Pigeonhole Design." The pigeonhole design partitions the covariate space into smaller spaces, which we refer to as pigeonholes, and balances the number of control and treatment units as much as possible. We analyze the theoretical performance of the pigeonhole design and show its effectiveness by comparing against two well-known benchmark designs: the match-pair design and the completely randomized design.


Algorithms and Optimization

Vassilis Digalakis Jr
Slowly Varying Regression under Sparsity

We consider the problem of parameter estimation in slowly varying regression models with sparsity constraints. We formulate the problem as a mixed integer optimization problem and demonstrate that it can be reformulated exactly as a binary convex optimization problem through a novel exact relaxation. The relaxation utilizes a new equality on Moore-Penrose inverses that convexifies the non-convex objective function while coinciding with the original objective on all feasible binary points. This allows us to solve the problem significantly more efficiently and to provable optimality using a cutting plane-type algorithm. We develop a highly optimized implementation of such algorithm, which substantially improves upon the asymptotic computational complexity of a straightforward implementation. We further develop a heuristic method that is guaranteed to produce a feasible solution and, as we empirically illustrate, generates high quality warm-start solutions for the binary optimization problem. We show, on both synthetic and real-world datasets, that the resulting algorithm outperforms competing formulations in comparable times across a variety of metrics including out-of-sample predictive performance, support recovery accuracy, and false positive rate. The algorithm enables us to train models with 10,000s of parameters, is robust to noise, and able to effectively capture the underlying slowly changing support of the data generating process.

Kevin Doherty
Performance Guarantees for Spectral Initialization in Rotation Averaging and Pose-Graph SLAM

In this work, we describe an initialization method based on spectral relaxation for rotation averaging and pose-graph optimization in the setting of simultaneous localization and mapping (SLAM). These optimization problems are typically both large-scale and nonconvex; in consequence, efficient methods for computing high-quality initial estimates are needed to enhance the speed and performance of the algorithms underlying these robot perception problems. To this end, we present an easily-implemented initialization method based on spectral relaxation tailored to the rotation averaging and SLAM problems. Critically, we also present error bounds controlling both the error of this estimate with respect to the ground truth as well as its distance to the corresponding global minimizer of the problem. These are, to our knowledge, the first such performance guarantees for spectral initializations to rotation averaging and pose-graph SLAM. The form of our bounds reveals the spectral properties of the measurement graphs to be central objects of interest in controlling these quantities. Moreover, our analysis is based upon new error bounds for global minimizers, which are likely to be of independent interest. Finally, our empirical results suggest that spectral estimates typically perform far better than the worst-case analysis suggests, producing comparable or higher-quality initializations to state-of-the-art techniques with lower computational cost.

Andy Haupt
Cautious Bandits

Consider a stochastic multi-armed bandit environment. A bandit algorithm is risk-averse if it chooses a deterministic arm over a stochastic arm of the same expectation with high probability in time. A risk averse algorithm may choose, at each finite time, an arm of lower expectation with more than ½ probability. For a wide range of schedules, epsilon-Greedy is risk-averse, and, in the worst-case, risk aversion can explain much its regret. Risk aversion is independent of other properties of algorithms such as mean-basedness. We survey qualitative evidence of misalignment of user risk preferences and algorithm risk preferences in recommender systems.

Eren Can Kizildag
A Curious Case Of Symmetric Binary Perceptron Model: Algorithms And Barriers

It has been very recently shown that the symmetric binary perceptron (SBP) exhibits an extreme form of clustering at all positive densities: almost all of its solutions are singletons separated by large distances. This suggests that finding a solution to SBP is likely to be computationally intractable. At the same time,SBP admits polynomial-time algorithms succeeding at low enough densities. This conundrum challenges the view that clustering implies algorithmic hardness. In this paper, we conduct a different landscape analysis to understand the true algorithmic tractability of this problem. Guided by statistical physics insights, we show that SBP exhibits the multi Overlap Gap Property (m-OGP), an intricate geometric property known to be a rigorous barrier for large classes of algorithms. Our analysis shows that the m-OGP threshold (a) is well below the satisfiability threshold; and (b) is nearly tight: up to polylogarithmic factors, it matches the best algorithmic threshold. We then leverage the m-OGP to establish that any sufficiently stable algorithm (appropriately defined) fails to find a satisfying solution at all densities slightly above the known algorithmic threshold. Our hardness result for the stable algorithms is based on Ramsey Theory from extremal combinatorics, and is of independent interest. The most technically involved part of this work is establishing the stability of the known algorithms (in particular, the one by Kim and Roche), which unlike in several prior models, do not appear to fall into the class of low-degree polynomials."

Xiang Meng
A Branch-and-Bound Search Method for Optimal Regression Trees

The decision tree is well-known for its conciseness and interpretability and is favored in applications ranging from revenue management and bioinformatics to healthcare. Traditional methods apply heuristics recursively in each split to quickly solve the model, which may not capture the best representation of the dataset. Motivated by this point, several recent publications have studied the use of Mixed Integer Programming (MIP) for finding the optimal decision tree. However, slow convergence and limited scalability have beset the MIP formulations. To overcome this difficulty, we design a novel branch-and-bound method for learning optimal regression trees. The algorithm recursively splits the search space by candidate splitting points according to percentiles of feature distribution. It then maps the continuous features into buckets separated by these candidate points and estimates lower bounds by leaving out certain buckets. We show experimentally that our approach can handle datasets with tens of thousands of instances and provide several orders of magnitude improvements compared to existing methods.

Prem Talwai
A PAC-Bayesian Adaptation Scheme for Regularized Regression

In this paper, we propose a PAC-Bayesian a posteriori parameter selection scheme for adaptive regularized regression in Hilbert spaces under general, unknown source conditions. We demonstrate that our approach is adaptive to misspecification, and achieves the optimal learning rate under weak distributional assumptions. Unlike existing parameter selection schemes, the computational complexity of our approach is independent of sample size. We discuss applications of our approach to statistical inverse problems and oracle-efficient contextual bandit algorithms.

Feng Zhu
Online Matching with Reusable Network Resources and Decaying Rewards: New Framework and Analysis

We build a new unified modeling and analysis framework for a broad class of online matching problems. The proposed unified framework encompasses a number of classical online matching problems and accommodates three practical features: reusable resources, network resources and decaying rewards. For the unified framework, we provide a unified performance analysis tool for greedy and greedy-like policies, measured by competitive ratios under adversarial environment. We prove that greedy-like policies can achieve near-optimal performances under the unified framework. We then dig deeper into several representative special classes of online matching problems, which impose additional realistic structural assumptions on top of the unified framework. We prove that slight modifications to greedy-like policies can successfully utilize additional structural information to significantly enhance policy performances.