# Student Talk Abstracts

##### Ryan Cory-Wright

A Unified Approach to Mixed-Integer Optimization: Nonlinear Formulations and Scalable AlgorithmsWe propose a unified framework to address a family of classical mixed-integer optimization problems, including network design, facility location, unit commitment, sparse portfolio selection, binary quadratic optimization and sparse learning problems. These problems exhibit logical relationships between continuous and discrete variables, which are usually reformulated linearly using a big-M formulation. In this work, we challenge this longstanding modeling practice and express the logical constraints in a non-linear way. By imposing a regularization condition, we reformulate these problems as convex binary optimization problems, which are solvable using an outer-approximation procedure. In numerical experiments, we establish that a general-purpose numerical strategy, which combines cutting-plane, first-order and local search methods, solves these problems faster and at a larger scale than state-of-the-art mixed-integer linear or second-order cone methods. Our approach successfully solves network design problems with 100s of nodes and provides solutions up to 40% better than the state-of-the-art; sparse portfolio selection problems with up to 3200 securities compared with 400 securities for previous attempts; and sparse regression problems with up to 100000 covariates.

##### Igor Kadota

WiFresh: Age-of-Information from Theory to ImplementationEmerging applications, such as autonomous vehicles and smart factories, increasingly rely on sharing time-sensitive information for monitoring and control. In such application domains, it is essential to keep information fresh, as outdated information loses its value and can lead to system failures and safety risks. The Age-of-Information (AoI) is a recently proposed performance metric that captures the freshness of the information from the perspective of the destination. In this talk, we consider a wireless network with a base station receiving time-sensitive information from a number of nodes through unreliable channels. We formulate a discrete-time decision problem to find a transmission scheduling policy that optimizes the AoI in the network. First, we derive a lower bound on the achievable AoI performance. Then, we develop three low-complexity scheduling policies with performance guarantees: a randomized policy, a Max-Weight policy and a Whittle’s Index policy. Leveraging our theoretical results, we propose WiFresh: a simple yet unconventional architecture for wireless networks that achieves near optimal AoI. To demonstrate the impact of WiFresh in real operating scenarios, we deploy and validate our architecture using programmable radios and computing platforms.

##### Manxi Wu

Bayesian Learning with Adaptive Load Allocation StrategiesWe study a Bayesian learning dynamics induced by agents who repeatedly allocate loads on a set of resources based on their belief of an unknown parameter that affects the cost distributions of resources. In each step, belief update is performed according to Bayes' rule using the agents' current load and a realization of costs on resources that they utilized. Then, agents choose a new load using an adaptive strategy update rule that accounts for their preferred allocation based on the updated belief. We prove that beliefs and loads generated by this learning dynamics converge almost surely. The convergent belief accurately estimates cost distributions of resources that are utilized by the convergent load. We establish conditions on the initial load and strategy updates under which the cost estimation is accurate on all resources. These results apply to Bayesian learning in congestion games with unknown latency functions. Particularly, we provide conditions under which the load converges to an equilibrium or socially optimal load with complete information of cost parameter. We also design an adaptive tolling mechanism that eventually induces the socially optimal outcome.

##### Jean-Baptiste Seby

Multi-trek Separation in Linear Structural Equation ModelsIn this work we give a correspondence between the higher order cumulants and combinatorial structure in the corresponding graph for linear structural equation models whose error terms are not necessarily Gaussian. It has previously been shown that low rank of the covariance matrix corresponds to the trek separation criterion in the corresponding graph. Generalizing this criterion to multiple sets of vertices, we characterize when determinants of subtensors of the higher order cumulant tensors vanish. This criterion generalizes to the case of hidden variables as well. More precisely, we show that the determinant of the subtensor of the \(k\)-th order cumulants for \(k\) sets of vertices with equal cardinality vanishes if and only if there is no system of \(k\)-treks between these sets without sided intersection. For example, this criterion enables us to identify the presence of multi-way interaction in the case of hidden variables.

##### Shuvomoy Das Gupta

On seeking efficient Pareto optimal points in multi-player minimum cost flow problemsWe propose a multi-player extension of the minimum cost flow problem inspired by a transportation problem that arises in modern transportation industry. We associate one player with each arc of a directed network, each trying to minimize its cost function subject to the network flow constraints. In our model, the cost function can be any general nonlinear function, and the flow through each arc is an integer. We present algorithms to compute efficient Pareto optimal point(s), where the maximum possible number of players minimize their cost functions simultaneously. The computed Pareto optimal points are Nash equilibriums if the problem is transformed into a finite static game in normal form.

##### Anish Agarwal

On Robustness of Principal Component RegressionPrincipal Component Regression (PCR) is a simple, but powerful and ubiquitously utilized method. Its effectiveness is well established when the covariates exhibit low-rank structure. However, its ability to handle settings with noisy, missing, and mixed (discrete and continuous) valued covariates is not understood and remains an important open challenge. As the main contribution of this work, we establish the robustness of PCR (without any change) in this respect, and provide meaningful finite-sample analysis. In the process, we establish that PCR is equivalent to performing Linear Regression after pre-processing the covariate matrix via Hard Singular Value Thresholding (HSVT). That is, PCR is equivalent to the recently proposed robust variant of the Synthetic Control method in the context of counterfactual analysis using observational data. As an immediate consequence, we obtain finite-sample analysis of the Robust Synthetic Control (RSC) estimator that was previously absent. As an important contribution to the Synthetic Control literature, we establish that an (approximate linear) synthetic control always exists in the setting of a generalized factor model (or latent variable model) and need not be assumed as an axiom as is traditionally done in the literature. We further discuss a surprising implication of the robustness property of PCR with respect to noise, i.e., PCR can learn a good predictive model even if the covariates are tactfully transformed to preserve (differential) privacy. Finally, this work advances the state-of-the-art analysis for HSVT by establishing stronger guarantees with respect to the l2,∞-norm rather than the Frobenius norm as is commonly done in the matrix estimation literature, which may be of interest in its own right.

##### Agni Orfanoudaki

Personalized Treatment for Coronary Artery Disease Patients: A Machine Learning ApproachCurrent clinical practice guidelines for managing Coronary Artery Disease (CAD) account for general cardiovascular risk factors. However, they do not present a framework that considers personalized patient-specific characteristics. Using the electronic health records of 21460 patients, we created data-driven models for personalized CAD management that significantly improve health outcomes relative to the standard of care. We develop binary classifiers to detect whether a patient will experience an adverse event due to CAD within a 10-year time frame. Combining the patients' medical history and clinical examination results, we achieve 81.5% AUC. For each treatment, we also create a series of regression models that are based on different supervised machine learning algorithms. We are able to estimate with average \(R2\)= 0.801 the time from diagnosis to a potential adverse event (TAE) and gain accurate approximations of the counterfactual treatment effects. Leveraging combinations of these models, we present ML4CAD, a novel personalized prescriptive algorithm. Considering the recommendations of multiple predictive models at once, ML4CAD identifies for every patient the therapy with the best expected outcome using a voting mechanism. We evaluate its performance by measuring the prescription effectiveness and robustness under alternative ground truths. We show that our methodology improves the expected TAE upon the current baseline by 24.11%, increasing it from 4.56 to 5.66 years. The algorithm performs particularly well for the male (24.3% improvement) and Hispanic (58.41% improvement) subpopulations. Finally, we create an interactive interface, providing physicians with an intuitive, accurate, readily implementable, and effective tool.

##### Sitan Chen

Learning Mixtures of Linear Regressions in Subexponential TimeWe consider the problem of learning a mixture of linear regressions (MLRs). An MLR is specified by \(k\) nonnegative mixing weights \(p_1, ... , p_k \) summing to 1, and \(k\) unknown regressors \(w_1,...,w_k \in R^d\). A sample from the MLR is drawn by sampling \(i\) with probability \(p_i\), then outputting \((x,y)\) where \(y = (x, w_i) + \eta \), where \(\eta \sim N(0,\sigma^2)\) for noise rate \( \sigma \). Mixtures of linear regressions are a popular generative model, and have been studied extensively in machine learning and theoretical computer science. However, all previous algorithms for learning the parameters of an MLR require running time and sample complexity scaling exponentially with \(k\). In this work, we give the first algorithm for learning an MLR that runs in time which is subexponential in \(k\). Specifically, we give an algorithm which runs in time scaling only in \(\exp(\sqrt{k})\) and outputs the parameters of the MLR to high accuracy, even in the presence of nontrivial regression noise. We demonstrate a new method that we call "Fourier moment descent" which uses univariate density estimation and low-degree moments of the Fourier transform of suitable univariate projections of the MLR to iteratively refine our estimate of the parameters. To the best of our knowledge, these techniques have never been used in the context of high dimensional distribution learning, and may be of independent interest. We also show that our techniques can be used to give a subexponential time algorithm for a natural hard instance of the subspace clustering problem, which we call "learning mixtures of hyperplanes."

##### Arthur Delarue

The Price of InterpretabilityWhen quantitative models are used to support decision-making on complex and important topics, understanding a model's "reasoning" can increase trust in its predictions, expose hidden biases, or reduce vulnerability to adversarial attacks. However, the concept of interpretability remains loosely defined and application-specific. In this paper, we introduce a mathematical framework in which machine learning models are constructed in a sequence of interpretable steps. We show that for a variety of models, a natural choice of interpretable steps recovers standard interpretability proxies (e.g., sparsity in linear models). We then generalize these proxies to yield a parametrized family of consistent measures of model interpretability. This formal definition allows us to quantify the "price of interpretability, i.e., the tradeoff with predictive accuracy. We demonstrate practical algorithms to apply our framework on real and synthetic datasets.

##### Chulhee (Charlie) Yun

Small ReLU networks are powerful memorizers: a tight analysis of memorization capacityThe focus of this talk will be about finite sample expressivity, i.e., memorization power of ReLU networks. Recent results require N hidden nodes to memorize/interpolate arbitrary N data points. In contrast, by exploiting depth, we show that 3-layer ReLU networks with \(\Omega (\sqrt{N}) \) hidden nodes can perfectly memorize most datasets with N points. We also prove that width \( \Theta(\sqrt{N})\) is necessary and sufficient for memorizing \(N \) data points, proving tight bounds on memorization capacity. The sufficiency result can be extended to deeper networks; we show that an \(L\)-layer network with \(W\) parameters in the hidden layers can memorize \(N\) data points if \(W=\Omega(N) \). Combined with a recent upper bound \(\mathrm{O}(WL\log W) \) on VC dimension, our construction is nearly tight for any fixed L. Time permitting, I will also briefly talk about memorization capacity of residual networks and the dynamics of stochastic gradient descent (SGD) initialized near a memorizing global minimum of the empirical risk. This is joint work with Prof. Suvrit Sra and Prof. Ali Jadbabaie.

##### Rahul Singh

Kernel Instrumental Variable RegressionInstrumental variable (IV) regression is a strategy for learning causal relationships in observational data. If measurements of input \(X \) and output \(Y\) are confounded, the causal relationship can nonetheless be identified if an instrumental variable \(Z\) is available that influences \(X\) directly, but is conditionally independent of \(Y\) given \(X\) and the unmeasured confounder. The classic two-stage least squares algorithm (2SLS) simplifies the estimation problem by modeling all relationships as linear functions. We propose kernel instrumental variable regression (KIV), a nonparametric generalization of 2SLS, modeling relations among \(X\), \(Y\), and \(Z\) as nonlinear functions in reproducing kernel Hilbert spaces (RKHSs). We prove the consistency of KIV under mild assumptions, and derive conditions under which convergence occurs at the minimax optimal rate for unconfounded, single-stage RKHS regression. In doing so, we obtain an efficient ratio between training sample sizes used in the algorithm's first and second stages. In experiments, KIV outperforms state of the art alternatives for nonparametric IV regression.

##### Michael (Lingzhi) Li

Fast Exact Matrix Completion: A Unifying Optimization FrameworkWe consider the problem of matrix completion of rank k on an n×m matrix. We show that both the general case and the case with side information can be formulated as a combinatorical problem of selecting k vectors from p column features. We demonstrate that it is equivalent to a separable optimization problem that is amenable to stochastic gradient descent. We design fastImpute, based on projected stochastic gradient descent, to enable efficient scaling of the algorithm of sizes of \(10^5 \times 10^5\). We report experiments on both synthetic and real-world datasets that show fastImpute significantly outperforms current state-of-the-art in both the accuracy of the matrix recovered and the time needed across all cases.

##### Alireza Fallah

On Theory of Model-Agnostic Meta-Learning AlgorithmsWe study the convergence of a class of gradient-based Model-Agnostic Meta- Learning (MAML) methods and characterize their overall computational complexity as well as their best achievable level of accuracy in terms of gradient norm for nonconvex loss functions. In addition, we answer some of the unanswered questions for the implementation of these algorithms including how to choose their learning rate (stepsize) and the batch size for both tasks and datasets corresponding to tasks. In particular, we show that MAML can find an \(\epsilon \)-first-order stationary point for any positive \(\epsilon \) after at most \(\mathrm{O}(1/\epsilon^2)\) iterations at the expense of requiring second-order information. We also show that the first order approximation of MAML (FO-MAML) which ignores the second-order information required in the update of MAML cannot achieve any small desired level of accuracy. We further propose a new variant of the MAML algorithm called Hessian-free MAML (HF-MAML) which preserves all theoretical guarantees of MAML, without requiring access to the second-order information of loss functions. Based on a joint work with Aryan Mokhtari (UT Austin) and Asu Ozdaglar (MIT).

##### Marwa El Halabi

Optimal approximation for unconstrained non-submodular minimizationSubmodular function minimization is a well studied problem; existing algorithms solve it exactly or up to arbitrary accuracy. However, in many applications, the objective function is not exactly submodular. No theoretical guarantees exist in this case. While submodular minimization algorithms rely on intricate connections between submodularity and convexity, we show that these relations can be extended sufficiently to obtain approximation guarantees for non-submodular minimization. In particular, we prove how a projected subgradient method can perform well even for certain nonsubmodular functions. This includes important examples, such as objectives for structured sparse learning and variance reduction in Bayesian optimization. We also extend this result to noisy function evaluations. Our algorithm works in the value oracle model. We prove that in this model, the approximation result we obtain is the best possible with a subexponential number of queries.

##### Matthew Brennan

Universality of Computational Lower Bounds for Submatrix DetectionIn submatrix detection, the task is to detect a \(k \times k \) submatrix with entries sampled from a distribution \(P\) in an \(n \times n\) matrix of samples from \(Q\). This generalizes well-studied problems such as biclustering and community detection. These problems seem to exhibit a universal phenomenon: there is a statistical-computational gap depending on \(P\) and \(Q\) between the minimum \(k\) at which this task can be solved and at which it can be solved efficiently. We characterize this gap as a tradeoff between \(k\) and the KL divergence between \(P\) and \(Q\) through average-case reductions, for a wide class of pairs \((P, Q)\).

##### Cihan Emre Kement

Real and Reactive Power Based Optimal Privacy Protection in Smart Grid Demand ResponseFrequent metering of consumption data is crucial for demand side management in smart grids. However, the metered data can easily be processed by using nonintrusive load monitoring techniques to infer appliance usage, which gives insight about the private lives of consumers. Existing privacy protection algorithms focus on hiding or altering metered real power, whereas smart meters also collect reactive power for various purposes. In this work, we optimize the consumer privacy in a demand response scheme considering both real and reactive power data. Also, we consider the user cost and comfort as objectives, and build the optimization problem in such a way that the effects of optimizing sub-objectives on the others can be observed. Results show that hiding only real or reactive power is not enough for ensuring privacy and they need to be altered simultaneously.

##### Jinglong Zhao

The Competitive Ratio of Threshold Policies for Online Unit-density Knapsack ProblemsWe study an online knapsack problem where the items arrive sequentially and must be either immediately packed into the knapsack or irrevocably discarded. Each item has a different size and the objective is to maximize the total size of items packed. We focus on the class of randomized algorithms which initially draw a threshold from some distribution, and then pack every fitting item whose size is at least that threshold. Threshold policies satisfy many desiderata including simplicity, fairness, and incentive-alignment. We derive two optimal threshold distributions, the first of which implies a competitive ratio of 0.432 relative to the optimal offline packing, and the second of which implies a competitive ratio of 0.428 relative to the optimal fractional packing. We also consider the generalization to multiple knapsacks, where an arriving item has a different size in each knapsack and must be placed in at most one. This is equivalent to the AdWords problem where item truncation is not allowed. We derive a randomized threshold algorithm for this problem which is 0.214-competitive. We also show that any randomized algorithm for this problem cannot be more than 0.461-competitive, providing the first upper bound which is strictly less than 0.5.

##### Jason Cheuk Nam Liang

Incentive-aware Contextual Pricing with Non-parametric Market NoiseWe consider a dynamic pricing problem for repeated contextual second-price auctions with strategic buyers whose goals are to maximize their long-term time discounted utility. The seller has very limited information about buyers' overall demand curves, which depends on \(d\)-dimensional context vectors characterizing auctioned items, and a non-parametric market noise distribution that captures buyers' idiosyncratic tastes. The noise distribution and the relationship between the context vectors and buyers' demand curves are both unknown to the seller. We focus on designing the seller's learning policy to set contextual reserve prices where the seller's goal is to minimize his regret for revenue. We first propose a pricing policy when buyers are truthful and show that it achieves a \(T\)-period regret bound of \(\tilde{\mathcal{O}}(\sqrt{dT}) \) against a clairvoyant policy that has full information of the buyers' demand. Next, under the setting where buyers bid strategically to maximize their long-term discounted utility, we develop a variant of our first policy that is robust to strategic (corrupted) bids. This policy incorporates randomized "isolation" periods, during which a buyer is randomly chosen to solely participate in the auction. We show that this design allows the seller to control the number of periods in which buyers significantly corrupt their bids. Because of this nice property, our robust policy enjoys a $T$-period regret of \(\tilde{\mathcal{O}}(\sqrt{dT}) \), matching that under the truthful setting up to a constant factor that depends on the utility discount factor.

##### Rajat Talak

A Theory of Uncertainty Variables for State Estimation and InferenceProbability theory forms an overarching framework for modeling uncertainty, and by extension, also in designing state estimation and inference algorithms. While it provides a good foundation to system modeling, analysis, and an understanding of the real world, its application to algorithm design suffers from computational intractability. In this work, we develop a new framework of uncertainty variables to model uncertainty. An uncertainty variable is characterized by an uncertainty set, in which its realization is bound to lie, while the conditional uncertainty is characterized by a set map, from a given realization of a variable to a set of possible realizations of another variable. We prove Bayes' law and the law of total probability equivalents for uncertainty variables. We define a notion of independence, conditional independence, and pairwise independence for a collection of uncertainty variables, and show that this new notion of independence preserves the properties of independence defined over random variables. We then develop a graphical model, namely Bayesian uncertainty network, a Bayesian network equivalent defined over a collection of uncertainty variables, and show that all the natural conditional independence properties, expected out of a Bayesian network, hold for the Bayesian uncertainty network. We also define the notion of point estimate, and show its relation with the maximum a posteriori estimate. Probability theory starts with a distribution function (equivalently a probability measure) as a primitive and builds all other useful concepts, such as law of total probability, Bayes' law, independence, graphical models, point estimate, on it. Our work shows that it is perfectly possible to start with a set, instead of a distribution function, and retain all the useful ideas needed for state estimation and inference.

##### Yunzong Xu

Phase Transitions and Cyclic Phenomena in Bandits with Switching ConstraintsWe consider the classical stochastic multi-armed bandit problem with a constraint on the total cost incurred by switching between actions. We prove matching upper and lower bounds on regret and provide near-optimal algorithms for this problem. Surprisingly, we discover phase transitions and cyclic phenomena of the optimal regret. That is, we show that associated with the multi-armed bandit problem, there are phases defined by the number of arms and switching costs, where the regret upper and lower bounds in each phase remain the same and drop significantly between phases. The results enable us to fully characterize the trade-off between regret and incurred switching cost in the stochastic multi-armed bandit problem, contributing new insights to this fundamental problem. Under the general switching cost structure, the results reveal a deep connection between bandit problems and graph traversal problems, such as the shortest Hamiltonian path problem.

##### Yuzhou Gu

Strong data processing inequalities and reconstruction problemsWe show that strong data processing inequalities give strong impossibility results for a general class of reconstruction problems on trees. Our result apply to arbitrary trees, in contrast to previous impossibility results which often require analysis specific to regular trees or Galton-Watson trees. We achieved improved impossibility results for the Potts model on a tree. We also apply the method to the weak recovery problem of stochastic block model and improve the current best impossibility results in the assortative regime.

##### Soumya Sudhakar

Balancing Actuation Energy and Computing Energy in Motion PlanningWe study a novel class of motion planning problems, inspired by emerging low-energy robotic vehicles, such as insect-size flyers, high-endurance autonomous blimps, and chip-size satellites for which the energy consumed by computing hardware during planning a path can be as large as the energy consumed by actuation hardware during the execution of the same path. We propose a new algorithm, called Compute Energy Included Motion Planning (CEIMP). CEIMP operates similarly to any other anytime planning algorithm, except it stops when it estimates further computing will require more computing energy than potential savings in actuation energy. The algorithm relies on Bayesian inference to estimate future energy savings in order to evaluate the trade-off between computing energy required to continue sampling and the potential future actuation energy savings after such computation. We show that CEIMP has the same asymptotic computational complexity as existing sampling-based motion planning algorithms, such as PRM*. We also show that CEIMP outperforms the average baseline of using maximum computing resources, in realistic computational experiments involving 10 floor plans from MIT buildings. In one representative experiment, CEIMP outperforms the average baseline 90.6% of the time when energy to compute one more second is equal to the energy to move one more meter, and 99.7% of the time when energy to compute one more second is equal to or greater than the energy to move 3 more meters.

##### Sagar Indurkhya

Solving for Syntax: Inferring Minimalist Grammars with an SMT-SolverWe present an implemented novel procedure for inferring Minimalist Grammars (MG). Our procedure models an MG as a system of first-order logic formulae that is evaluated with the Z3 SMT solver. The input to the procedure is a sequence of sentences annotated with syntactic relations encoding predicate-argument structure and subject-verb agreement. The implementation outputs a set of MGs that can parse each input sentence, yielding the same syntactic relations as in the original input. We present and analyze how MGs inferred by this procedure align with contemporary theories of minimalist syntax.

##### Dongchan Lee

Convex Restriction and its ApplicationsIn this talk, a convex sufficient condition will be presented for solving optimization problems involving a system of nonlinear equations. Many physical and virtual systems, including dynamical systems, electric power grid, robotics, and neural networks, can be described with nonlinear equations. The problems involving nonlinear equality constraints pose challenges in finding optimal solutions and incorporating robustness against uncertain variables. Convex restriction derives a convex sufficient condition for solving a system of nonlinear equations subject to non-convex inequality constraints and provides a way to incorporate robustness. Using the proposed condition, a non-convex optimization problem can be solved as a sequence of convex optimization problems, with feasibility and robustness guarantees. We show its applications in control, energy systems and machine learning.