# Algorithms and Optimization

##### Kwangjun Ahn

Learning threshold neurons via the “edge of stability”Existing analyses of neural network training often operate under the unrealistic assumption of an extremely small learning rate. This lies in stark contrast to practical wisdom and empirical studies, such as the work of J.\ Cohen et al.\ (ICLR 2021), which exhibit startling new phenomena (the ``edge of stability'' or ``unstable convergence'') and potential benefits for generalization in the large learning rate regime. Despite a flurry of recent works on this topic, however, the latter effect is still poorly understood. In this paper, we take a step towards understanding genuinely non-convex training dynamics with large learning rates by performing a detailed analysis of gradient descent for simplified models of two-layer neural networks. For these models, we provably establish the edge of stability phenomenon and discover a sharp phase transition for the step size below which the neural network fails to learn ``threshold-like'' neurons (i.e., neurons with a non-zero first-layer bias). This elucidates one possible mechanism by which the edge of stability can in fact lead to better generalization, as threshold neurons are basic building blocks with useful inductive bias for many tasks.

##### Feng Zhu

Stochastic Bandit Policies with Safety against Heavy-tailed RiskWe study the stochastic multi-armed bandit problem and design new policies that enjoy both worst-case optimality for expected regret and light-tailed risk for regret distribution. Starting from the two-armed bandit setting with time horizon T, we propose a simple policy and prove that the policy (i) enjoys the worst-case optimality for the expected regret at order O(√T lnT) and (ii) has the worst-case tail probability of incurring a linear regret decay at an exponential rate exp(−Ω(√T)), a rate that we prove to be best achievable for all worst-case optimal policies. Briefly, our proposed policy achieves a delicate balance between doing more exploration at the beginning of the time horizon and doing more exploitation when approaching the end, compared to the standard Successive Elimination policy and Upper Confidence Bound policy. We then improve the policy design and analysis to work for the general K-armed bandit setting. Specifically, the worst-case probability of incurring a regret larger than any x > 0 is upper bounded by exp(−Ω(x/√KT)). We also enhance the policy design to accommodate the “any-time” setting where T is not known a priori, and prove equivalently desired policy performances as compared to the “fixed-time” setting with known T. A brief account of numerical experiments is conducted to illustrate the theoretical findings. We conclude by extending our proposed policy design to (1) a stochastic linear bandit setting (2) a stochastic multi-armed bandit setting with adversarial baseline rewards, and proving that some adaptations of our policies lead to both worst-case optimality in terms of expected regret order and light-tailed risk on the regret distribution.

##### Sarah Cen

Network Synthetic Interventions: A Causal Framework for Panel Data with Network InterferenceWe propose a generalization of the synthetic controls and synthetic interventions methodology to incorporate network interference. We consider the estimation of unit-specific treatment effects from panel data where there are spillover effects across units and in the presence of unobserved confounding. Key to our approach is a novel latent factor model that takes into account network interference and generalizes the factor models typically used in panel data settings. We propose an estimator, "network synthetic interventions", and show that it consistently estimates the mean outcomes for a unit under an arbitrary sequence of treatments for itself and its neighborhood, given certain observation patterns hold in the data. We corroborate our theoretical findings with simulations.

##### Renato Berlinghieri

Gaussian processes at the Helm(holtz): A better way to model ocean currentsUnderstanding the behavior of ocean currents has important practical applications. Since we expect current dynamics to be smooth but highly non-linear, Gaussian processes (GPs) offer an attractive model. In particular, one existing approach is to consider the velocities of the buoys as sparse observations of a vector field in two spatial dimensions and one time dimension. But we show that applying a GP, e.g. with a standard square exponential kernel, directly to this data fails to capture real-life current structure, such as continuity of currents and the shape of vortices. By contrast, these physical properties are captured by divergence and curl-free components of a vector field obtained through a Helmholtz decomposition. So we propose instead to model these components with a GP directly. We show that, because this decomposition relates to the original vector field just via mixed partial derivatives, we can still perform inference given the original data with only a small constant multiple of additional computational expense. We illustrate our method on real oceans data.

##### Daniel Shen

Valuing Uncertainty in Electricity Markets ApproachThe electric grid will need to incorporate unprecedented amounts of variable renewable energy (VRE) to achieve decarbonization goals. Successful integration of VRE into the grid requires forecasting the amount of available resource and accounting for uncertainties in the forecast. This presentation will cover the basics of restructured market operations in the United States as well as present a method for valuing the energy and uncertainty associated with zero-variable-cost VRE.

##### Ashkan Soleymani

Causal Feature Selection via Orthogonal SearchThe problem of inferring the direct causal parents of a response variable among a large set of explanatory variables is of high practical importance in many disciplines. However, established approaches often scale at least exponentially with the number of explanatory variables, are difficult to extend to nonlinear relationships, and are difficult to extend to cyclic data. Inspired by {\em Debiased} machine learning methods, we study a one-vs.-the-rest feature selection approach to discover the direct causal parent of the response. We propose an algorithm that works for purely observational data while also offering theoretical guarantees, including the case of partially nonlinear relationships possibly under the presence of cycles. As it requires only one estimation for each variable, our approach is applicable even to large graphs. We demonstrate significant improvements compared to established approaches.

# Control and Game Theory

##### Gilhyun Ryou

Multi-fidelity time-optimal trajectory generationGenerating a time-optimal trajectory for highly maneuverable vehicles, such as a quadrotor aircraft, is challenging because the optimal trajectory is located on the boundary of the set of dynamically feasible trajectories. This boundary is difficult to model as it involves limitations of the entire system, including complex dynamics and electromechanical phenomena, in agile high-speed flight. We propose a multi-fidelity Bayesian optimization framework that models the feasibility constraints based on analytical approximations, numerical simulations, and real-world experiments. In subsequent works, we demonstrate how the proposed algorithm can be applied to more complicated systems, such as tailsitter vehicles and multiple drones. Finally, the proposed algorithm is extended to real-time applications with a temporal deep neural network and sequence-to-sequence learning.

##### Yi Tian

Can Direct Latent Model Learning Solve Linear Quadratic Gaussian Control?We study the task of learning state representations from potentially high-dimensional observations, with the goal of controlling an unknown partially observable system. We pursue a direct latent model learning approach, where a dynamic model in some latent state space is learned by predicting quantities directly related to planning (e.g., costs) without reconstructing the observations. In particular, we focus on an intuitive cost-driven state representation learning method for solving Linear Quadratic Gaussian (LQG) control, one of the most fundamental partially observable control problems. As our main results, we establish finite-sample guarantees of finding a near-optimal state representation function and a near-optimal controller using the directly learned latent model. To the best of our knowledge, despite various empirical successes, prior to this work it was unclear if such a cost-driven latent model learner enjoys finite-sample guarantees. Our work underscores the value of predicting multi-step costs, an idea that is key to our theory, and notably also an idea that is known to be empirically valuable for learning state representations.

##### Allen Wang

Plasma Control: A Critical Path Problem for Fusion EnergyUnbeknownst to most people, "plasma control" in magnetic fusion reactors is one of the most challenging real-time control problems attempted by humankind; being under-observed, under-actuated, stochastic, and requiring, in some cases, sub millisecond response times. Progress on this challenging problem is becoming increasingly urgent. The SPARC fusion experiment, under construction by Commonwealth Fusion Systems in collaboration with MIT PSFC, is on track to achieve production of >100 MW of energy with an energy gain factor of ~10x by 2026. To then translate into a viable power plant, we need to radically reduce the frequency and severity of "plasma disruptions", events where we lose control of the plasma. While plasma disruptions don't pose any risk to public safety, they severely threaten economic viability as they may disable the power plant for weeks or months. The goal of this talk is to raise awareness of this problem and highlight potential opportunities for ideas from machine learning and robot autonomy to make an impact.

##### Lujie Yang

Suboptimal Controller Synthesis for CartPoles and Quadrotors via Sums-of-SquaresSums-of-squares (SOS) optimization is a promising tool to synthesize certifiable controllers for control tasks. However, many examples in the literature fail to generate impressive dynamic behaviors on complicated robotic systems. Here, we improve upon previous approaches, and show that we can apply SOS to synthesize controllers with bounded suboptimal performance for various underactuated robotic systems by finding good approximations of the value function. We summarize a unified SOS framework to synthesize both under- and over- approximations of the value function for continuous-time, control-affine systems and use these approximations to generate suboptimal controllers, and perform regional analysis on the closed-loop system driven by these controllers. We then extend the formulation to handle hybrid systems with contacts. We demonstrate that our method can generate tight under- and over- approximations of the value function with low-degree polynomials, which are used to provide stabilizing controllers for continuous-time systems including the inverted pendulum, the cartpole, and the 3D quadrotor as well as the planar pusher, a hybrid system. To the best of our knowledge, this is the first time that a SOS-based time-invariant controller can swing up and stabilize a cartpole, and push the planar slider to the desired pose.

##### Varun Murali

Perception-aware planning for fast and agile robotsIn this talk, we explore the problem of jointly optimizing the certainty of navigation and the speed of the robot. Perception-aware allows the modification of desired paths to better exploit available information to navigate to the goal with a desired certainty. Specifically, we propose a novel optimal control method for finding a perturbed quadrotor trajectory such that the resulting trajectory can be executed with greater certainty than the initial. Certainty is quantified by the ego pose estimation computed from fusing the on board IMU and vision-based landmarks. We employ deep model-free reinforcement learning to learn how to manipulate smooth input trajectories that are parameterized by B ́ezier curves. We first formulate the trajectory optimization problem as a reinforcement learning problem on a compressed state space. We then introduce a modular reinforcement learning gym environment for this trajectory optimization in conjunction with differentially flat vehicles.

##### Alexandre Amice

Rigorous Polyhedral Decompositions of Collision-Free Configuration Space for Robot ManipulatorsUnderstanding the geometry of collision-free configuration space (C-free) in the presence of task space obstacles is an essential ingredient for collision-free motion planning. While it is possible to check for collisions at a point using standard algorithms, to date no practical method exists for computing C-free regions with rigorous certificates due to the complexity of mapping task-space obstacles through the kinematics. In this work, we present the first to our knowledge rigorous method for approximately decomposing C-free into certified polyhedral regions. Our method, called C-IRIS (C-space Iterative Regional Inflation by Semidefinite programming), generates large, convex polytopes in a rational parametrization of the configuration space which are rigorously certified to be collision-free. Such regions have been shown to be useful for both optimization-based and randomized motion planning. Based on convex optimization, our method works in arbitrary dimensions, only makes assumptions about the convexity of the obstacles in the task space, and is fast enough to scale to realistic problems in manipulation.

# Machine Learning and Statistics

##### Anzo Teh

Improved empirical Bayes estimation for the Poisson mixture model via empirical risk minimizationWe consider the problem of mean estimation based on observed samples for Poisson mixture model. The celebrated Robbins estimator has been shown to achieve asymptotic optimal regret bound for certain classes of prior, but in practice its lack of smoothness results in a high empirical error. Minimum distance based estimators, on the other hand, achieve both optimal asymptotic bound and good empirical performance, but the computation is slow and very difficult to generalize to multidimensional settings. The monotonicity of the Bayes estimator motivates us to consider the empirical risk minimizer (ERM) among the monotone class, both with respect to a suitably framed objective function and the Robbins. We show this ERM can achieve optimal regret (via Rademacher averages), good empirical performance, and is also simple to implement.

##### Sung Min (Sam) Park

TRAK: Understanding Model Predictions at Scale via the Empirical NTKThe goal of {\em data attribution} is to trace model predictions back to the training data. Prior approaches to this task exhibit a strict tradeoff between computational demand and efficacy. Sampling-based methods (such as Shapley values and datamodels) are effective but require training thousands of models, making them impractical for large models or datasets. On the other hand, approximate methods such as those based on influence functions are fast but tend to be unreliable in large-scale settings. In this work, we introduce TRAK (Tracing by Rewinding the After Kernel), a data attribution method for overparameterized models that brings us closer to the best of both worlds. The key idea behind TRAK is to view overparameterized models as kernel machines---via its empirical neural tangent kernel---and then to leverage our understanding of the kernel domain to derive data attribution scores. Using only a handful of model checkpoints, TRAK matches the performance of attribution methods that use thousands of trained models, reducing costs by up to three orders of magnitude. We demonstrate the utility of TRAK by applying it to a variety of large-scale settings across different data modalities: image classification models; CLIP models, and large language models.

##### Austin J Stromme

On the sample complexity of entropic optimal transportWe study the sample complexity of entropic optimal transport in high dimensions using computationally efficient plug-in estimators. We significantly advance the state of the art by establishing dimension-free, parametric rates for estimating various quantities of interest, including the entropic regression function which is a natural analog to the optimal transport map. As an application, we propose a practical model for transfer learning based on entropic optimal transport and establish parametric rates of convergence for nonparametric regression and classification.

##### Adityanarayanan Radhakrishnan

Feature learning in neural networks and kernel machines that recursively learn featuresNeural networks have achieved impressive results on many technological and scientific tasks. Yet, their empirical successes have outpaced our fundamental understanding of their structure and function. By identifying mechanisms driving the successes of neural networks, we can provide principled approaches for improving neural network performance and develop simple and effective alternatives. In this work, we isolate the key mechanism driving feature learning in fully connected neural networks by connecting neural feature learning to the average gradient outer product. We subsequently leverage this mechanism to design Recursive Feature Machines (RFMs), which are kernel machines that learn features. We show that RFMs (1) accurately capture features learned by deep fully connected neural networks, (2) achieve state-of-the-art results on tabular datasets, surpassing a broad spectrum of models including neural networks, and (3) shed light on recently observed deep learning phenomena such as grokking, lottery tickets, simplicity biases, and spurious features.

##### Yaqi Duan

Optimal policy evaluation using kernel-based temporal difference methodsWe study non-parametric methods for estimating the value function of an infinite-horizon discounted Markov reward process (MRP). We establish non-asymptotic bounds on the statistical error of a kernel-based least-squares temporal difference (LSTD) estimate, which can be understood either as a non-parametric instrumental variables method, or as a projected approximation to the Bellman fixed point equation. Our analysis imposes no assumptions on the transition operator of the Markov chain, but rather only conditions on the reward function and population-level kernel LSTD solutions. Using empirical process theory and concentration inequalities, we establish a non-asymptotic upper bound on the error with explicit dependence on the effective horizon $H = (1-\gamma)^{-1}$ of the Markov reward process, the eigenvalues of the associated kernel operator, as well as the instance-dependent variance of the Bellman residual error. In addition, we prove minimax lower bounds over sub-classes of MRPs, which shows that our rate is optimal in terms of the sample size $n$ and the effective horizon $H$. Whereas existing worst-case theory predicts cubic scaling ($H^3$) in the effective horizon, our theory reveals that there is in fact a much wider range of scalings, depending on the kernel, the stationary distribution, and the variance of the Bellman residual error. Notably, it is only parametric and near-parametric problems that can ever achieve the worst-case cubic scaling.

##### Rohan Alur

A Practical Audit for Human Expertise via Conditional Permutation TestingDespite recent advances in machine learning, many high-stakes prediction tasks (e.g. patient triage and diagnosis) are still handled by trained human experts. A common source of concern about automation in these settings is that humans can access substantive information (e.g. direct conversations with a patient) which is simply unavailable to a would-be predictive model. Furthermore, humans can seek out additional information adaptively, incorporating new ‘features’ which are specifically relevant to a given instance. In this work, we provide a practical algorithm to audit for this kind of human expertise in general prediction tasks. In particular, given a set of historical features, human predictions and realized outcomes, we develop a procedure which tests whether the forecaster is productively incorporating private information into their predictions. Our techniques are inspired by the Model-X Knockoffs framework (Candes et al., 2017) and the Conditional Permutation Test (Berrett et al., 2019), but we avoid requiring knowledge of the joint distribution on features and predictions. We demonstrate that our algorithm maintains a number of desirable theoretical properties, including in finite samples, while remaining exceptionally straightforward for practitioners to both implement and interpret.

##### Lelia Marie Hampton

Fair and Robust Deep Learning Under Heavy-Tailed UncertaintyAI systems for societal decision-making must be safe for real world deployment, especially in mission critical settings (e.g., healthcare, medical research, policy analysis and development). However, real world data is heavy-tailed, and this leads to inherent uncertainty in deep learning systems trained on imbalanced data. Unfortunately, this reality can lead to harm for various social groups. For instance, marginalized groups are often underrepresented in datasets (e.g., healthcare, genomics, census data). Consequently, heavy-tailed settings create difficulties for statistical learning models to capture a signal from this data, leading to compounded known unknowns and unknown unknowns. However, further data collection may not always be feasible. For example, rare diseases by their very nature create challenges in further data collection. Thus, practitioners must benchmark and develop algorithmic approaches in addition to ethical data development to enhance safety in settings with heavy-tailed uncertainty, especially as it pertains to historically marginalized groups. In high stakes settings, models should not only output predictions, but also confidence scores. This capability allows the adjustment of a model’s predictions based on their confidence score.

# Information and Networks

##### Chandler Squires

Linear Causal Disentanglement via InterventionsCausal disentanglement seeks a representation of data involving latent variables that relate to one another via a causal model. A representation is identifiable if both the latent model and the transformation from latent to observed variables are unique. In this paper, we study observed variables that are a linear transformation of a linear latent causal model. Data from interventions are necessary for identifiability: if one latent variable is missing an intervention, we show that there exist distinct models that cannot be distinguished. Conversely, we show that a single intervention on each latent variable is sufficient for identifiability. Our proof uses a generalization of the RQ decomposition of a matrix that replaces the usual orthogonal and upper triangular conditions with analogues depending on a partial order on the rows of the matrix, with partial order determined by a latent causal model. We corroborate our theoretical results with a method for causal disentanglement that accurately recovers a latent causal model.

##### Sylvie Koziel

How much data is too much data?The transition towards smart grids requires the deployment of cyber technologies that can dynamically monitor and control the electrical power system. Research literature exposes plenty of ways to use various sensor data to improve power grid reliability and efficiency. But collecting and processing data has an economic and environmental cost, not to mention the technical difficulties associated with sensor failures, cyber threats and processing of huge amounts of data. How much data is really needed to keep the grid reliable, considering the deployment of electric vehicles and distributed generation? This is the central question of the talk. First, the reasons why more data in power systems might be needed will be clarified, as well as the drawbacks associated with more data generation. During the talk, we will also present the work done so far to assess the level of data granularity needed for power grids to be smarter, reliable and efficient.

##### Vishrant Tripathi

WiSwarm: Age-of-Information-based Wireless Networking for Collaborative Teams of UAVs BarriersThe Age-of-Information (AoI) metric has been widely studied in the theoretical communication networks and queuing systems literature. However, experimental evaluation of its applicability to complex real-world time-sensitive systems is largely lacking. In this work, we develop, implement, and evaluate an AoI-based application layer middleware that enables the customization of WiFi networks to the needs of time-sensitive applications. By controlling the storage and flow of information in the underlying WiFi network, our middleware can: (i) prevent packet collisions; (ii) discard stale packets that are no longer useful; and (iii) dynamically prioritize the transmission of the most relevant information. To demonstrate the benefits of our middleware, we implement a mobility tracking application using a swarm of UAVs communicating with a central controller via WiFi. Our experimental results show that, when compared to WiFi-UDP/WiFi-TCP, the middleware can improve information freshness by a factor of 109x/48x and tracking accuracy by a factor of 4x/6x, respectively. Most importantly, our results also show that the performance gains of our approach increase as the system scales and/or the traffic load increases.

##### Gioele Zardini

Co-design of Complex Systems: From Embodied Intelligence to Mobility SystemsWhen designing autonomous systems, we need to consider multiple trade-offs at various abstraction levels, and choices of single components need to be studied jointly. For instance, the design of future mobility solutions (autonomous vehicles, micromobility solutions, etc.) and the design of the mobility systems they enable are closely coupled. Indeed, knowledge about the intended service of novel mobility solutions would impact their design and deployment process, whilst insights about their technological development could significantly affect transportation management policies. In this talk, I will present a framework to co-design complex systems, instantiated in the purpose of co-designing future mobility systems all the way from the policies the city can design to the embodied intelligence in autonomous vehicles, leveraging a monotone theory of co-design and tools from game theory. Through various case studies, I will show how this approach allows one to solve heterogeneous problems, unifying different modeling techniques and promoting interdisciplinarity, modularity, and compositionality. Finally, I will discuss current and future challenges, and will provide some points for discussion.

##### Wentao Weng

Efficient decentralized multi-agent learning in asymmetric bipartite queuing systemsLearning in multi-agent systems often poses significant challenges due to interference between agents. In particular, unlike classical stochastic systems, the performance of an agent's action is not drawn i.i.d. from some distribution but is directly affected by the (unobserved) actions of the other agents. This is the reason why most collaborative multi-agent learning approaches aim to globally coordinate all agents' actions to evade this interference. In this talk, we focus on agents in a decentralized bipartite queuing system, where N agents request service from K servers. Prior decentralized approaches aim to globally identify a coordinated schedule or do not take advantage of the bipartite structure, which leads to significant shortcomings: performance that degrades exponentially in the number of servers, requirement of shared randomness and unique identifiers, and computationally demanding algorithms. In contrast, we provide a low-complexity algorithm that is run decentrally by each agent, avoids the shortcomings of "global coordination" and leads the queuing system to have efficient performance in asymmetric bipartite queuing systems while also having additional robustness properties.

##### Austin Saragih

Value of Information Analysis for Supply Chain Network Design Under UncertaintyA critical strategic decision-making problem in supply chain management is the design of supply chain networks, especially in the face of many uncertainties. Existing approaches consider uncertainties, but they do not consider the benefit of resolving them. In this paper, we integrate the cost of a supply chain network with resolved uncertainty and the cost of resolving the uncertainty. We present a framework and implementation of the value of information (VOI) analysis for supply chain network design (SCND) under uncertainty. We formulate an optimal information-gathering strategy for decision-makers to prioritize the most essential segments of their supply chain and develop a polynomial time solution algorithm. Based on stylized results and numerical experiments, we show a significant value of partial information gathering. VOI analysis yields a simple, quickly solvable, and beneficial information-gathering strategy for improved decision-making. Furthermore, VOI analysis is generalizable to many other operations management settings under uncertainty.