(I have stopped updating this page as it felt a waste of time; for the most recent drafts, please check my Google Scholar profile)
Conference papers

Qihan Wang, Chen Dun, Fangshuo Liao, Chris Jermaine, Anastasios Kyrillidis, ``LOFT: Finding Lottery Tickets through Filterwise Training”, Twentysixth International Conference on Artificial Intelligence and Statistics (AISTATS23), 2023.
Recent work on the Lottery Ticket Hypothesis (LTH) shows that there exist “winning tickets” in large neural networks. These tickets represent “sparse” versions of the full model that can be trained independently to achieve comparable accuracy with respect to the full model. However, finding the winning tickets requires one to pretrain the large model for at least a number of epochs, which can be a burdensome task, especially when the original neural network gets larger. In this paper, we explore how one can efficiently identify the emergence of such winning tickets, and use this observation to design efficient pretraining algorithms. For clarity of exposition, our focus is on convolutional neural networks (CNNs). To identify good filters, we propose a novel filter distance metric that wellrepresents the model convergence. As our theory dictates, our filter analysis behaves consistently with recent findings of neural network learning dynamics. Motivated by these observations, we present the LOttery ticket through Filterwise Training algorithm, dubbed as LoFT. LoFT is a modelparallel pretraining algorithm that partitions convolutional layers by filters to train them independently in a distributed setting, resulting in reduced memory and communication costs during pretraining. Experiments show that LoFT i) preserves and finds good lottery tickets, while ii) it achieves nontrivial computation and communication savings, and maintains comparable or even better accuracy than other pretraining methods.
@article{loft2023wang, title = {LOFT: Finding Lottery Tickets through Filterwise Training}, author = {Wang, Qihan and Dun, Chen and Liao, Fangshuo and Jermaine, Chris and Kyrillidis, Anastasios}, booktitle = {Proceedings of The 26th International Conference on Artificial Intelligence and Statistics}, pages = {64986526}, year = {2023}, editor = {Ruiz, Francisco and Dy, Jennifer and van de Meent, JanWillem}, volume = {206}, series = {Proceedings of Machine Learning Research}, month = {2527 Apr}, publisher = {PMLR} }

Chen Dun, Mirian Hipolito, Chris Jermaine, Dimitrios Dimitriadis, Anastasios Kyrillidis, ``Efficient and LightWeight Federated Learning via Asynchronous Distributed Dropout”, Twentysixth International Conference on Artificial Intelligence and Statistics (AISTATS23), 2023.
Asynchronous learning protocols have regained attention lately, especially in the Federated Learning (FL) setup, where slower clients can severely impede the learning process. Herein, we propose AsyncDrop, a novel asynchronous FL framework that utilizes dropout regularization to handle device heterogeneity in distributed settings. Overall, AsyncDrop achieves better performance compared to state of the art asynchronous methodologies, while resulting in less communication and training time overheads. The key idea revolves around creating “submodels” out of the global model, and distributing their training to workers, based on device heterogeneity. We rigorously justify that such an approach can be theoretically characterized. We implement our approach and compare it against other asynchronous baselines, both by design and by adapting existing synchronous FL algorithms to asynchronous scenarios. Empirically, AsyncDrop reduces the communication cost and training time, while matching or improving the final test accuracy in diverse noni.i.d. FL scenarios.
@article{asyncdrop2023dun, title = {Efficient and LightWeight Federated Learning via Asynchronous Distributed Dropout}, author = {Dun, Chen and Hipolito, Mirian and Jermaine, Chris and Dimitriadis, Dimitrios and Kyrillidis, Anastasios}, booktitle = {Proceedings of The 26th International Conference on Artificial Intelligence and Statistics}, pages = {66306660}, year = {2023}, editor = {Ruiz, Francisco and Dy, Jennifer and van de Meent, JanWillem}, volume = {206}, series = {Proceedings of Machine Learning Research}, month = {2527 Apr}, publisher = {PMLR} }

Zheyang Xiong, Fangshuo Liao, Anastasios Kyrillidis, ``Strong Lottery Ticket Hypothesis with perturbation”, Twentysixth International Conference on Artificial Intelligence and Statistics (AISTATS23), 2023.
The strong Lottery Ticket Hypothesis (LTH) (Ramanujan et al., 2019; Zhou et al., 2019) claims the existence of a subnetwork in a sufficiently large, randomly initialized neural network that approximates some target neural network without the need of training. We extend the theoretical guarantee of the strong LTH literature to a scenario more similar to the original LTH, by generalizing the weight change in the pretraining step to some perturbation around initialization. In particular, we focus on the following open questions: By allowing a perturbation on the random initial weights, can we reduce the overparameterization requirement for the candidate network in the strong LTH? Furthermore, does the weight change by SGD coincide with a good set of such perturbation? We answer the first question by first extending the theoretical result on subset sum problem (Lueker, 1998) to allow perturbation on the candidates. Applying this result to the neural network setting, we show that by allowing scale perturbation, we can reduce the overparameterization requirement of the strong LTH. To answer the second question, we show via experiments that the perturbed weight achieved by the projected SGD shows better performance under the strong LTH pruning.
@article{slth2023xiong, title = {Strong Lottery Ticket Hypothesis with epsilon–perturbation}, author = {Xiong, Zheyang and Liao, Fangshuo and Kyrillidis, Anastasios}, booktitle = {Proceedings of The 26th International Conference on Artificial Intelligence and Statistics}, pages = {68796902}, year = {2023}, editor = {Ruiz, Francisco and Dy, Jennifer and van de Meent, JanWillem}, volume = {206}, series = {Proceedings of Machine Learning Research}, month = {2527 Apr}, publisher = {PMLR} }

Carlos QuinteroPena, Zachary Kingston, Tianyang Pan, Rahul Shome, Anastasios Kyrillidis, and Lydia E. Kavraki, ``Optimal Grasps and Placements for Task and Motion Planning in Clutter”, IEEE International Conference on Robotics and Automation (ICRA), 2023.
Many methods that solve robot planning problems, such as task and motion planners, employ discrete symbolic search to find sequences of valid symbolic actions that are grounded with motion planning. Much of the efficacy of these planners lies in this grounding—bad placement and grasp choices can lead to inefficient planning when a problem has many geometric constraints. Moreover, grounding methods such as naive sampling often fail to find appropriate values for these choices in the presence of clutter. Towards efficient task and motion planning, we present a novel optimizationbased approach for grounding to solve cluttered problems that have many constraints that arise from geometry. Our approach finds an optimal grounding and can provide feedback to discrete search for more effective planning. We demonstrate our method against baseline methods in complex simulated environments.
@article{quinterooptimal, title={Optimal Grasps and Placements for Task and Motion Planning in Clutter}, author={QuinteroPena, Carlos and Kingston, Zachary and Pan, Tianyang and Shome, Rahul and Kyrillidis, Anastasios and Kavraki, Lydia E} }

Syed Rizvi, Chen Dun, Anastasios Kyrillidis, ``PCRIST: Variance Reduction through Periodic Centralized Training in Distributed Subnetwork Training of Residual Networks”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), in proceedings of “Timely and Private Machine Learning over Networks” workshop, 2023.
In this work, we propose Periodic Centralized ResIST (PCRIST), a distributed training protocol for residual Convolutional Neural Networks which reduces variance during training in order to recover model performance in distributed settings. PCRIST introduces short periods of centralized training into the ResIST distributed training protocol, which aims to reduce the variance among distributed workers each training subnets of the global Resnet. PCRIST improves on the performance of ResIST while maintaining many of its advantages in terms of communication efficiency compared to baseline distributed training protocols. We evaluate PCRIST on two benchmark image classification tasks, and demonstrate superior performance to the ResIST training protocol while maintaining the benefits of communication efficiency seen with the ResIST protocol.
@article{rivzi2023pcrist, title={PCRIST: Variance Reduction through Periodic Centralized Training in Distributed Subnetwork Training of Residual Networks}, author={Syed Rizvi, Chen Dun, Anastasios Kyrillidis}, journal={preprint}, year={2023} }

Junhyung Lyle Kim, Mohammad Taha Toghani, Cesar A. Uribe, Anastasios Kyrillidis, ``Local Stochastic Factored Gradient Descent for Distributed Quantum State Tomography”, IEEE Conference on Decision and Control (CDC), 2022.
We propose a distributed Quantum State Tomography (QST) protocol, named Local Stochastic Factored Gradient Descent (Local SFGD), to learn the lowrank factor of a density matrix over a set of local machines. QST is the canonical procedure to characterize the state of a quantum system, which we formulate as a stochastic nonconvex smooth optimization problem. Physically, the estimation of a lowrank density matrix helps characterizing the amount of noise introduced by quantum computation. Theoretically, we prove the local convergence of Local SFGD for a general class of restricted strongly convex/smooth loss functions, i.e., Local SFGD converges locally to a small neighborhood of the global optimum at a linear rate with a constant step size, while it locally converges exactly at a sublinear rate with diminishing step sizes. With a proper initialization, local convergence results imply global convergence. We validate our theoretical findings with numerical simulations of QST on the GreenbergerHorneZeilinger (GHZ) state.
@article{kim2022local, title={Local Stochastic Factored Gradient Descent for Distributed Quantum State Tomography}, author={Kim, Junhyung Lyle and Toghani, Mohammad Taha and Uribe, C{\'e}sar A and Kyrillidis, Anastasios}, journal={arXiv preprint arXiv:2203.11579}, year={2022} }
 Cheng Wan, Youjie Li, Cameron R. Wolfe, Anastasios Kyrillidis, Nam Sung Kim, Yingyan Lin, ``PipeGCN: Efficient FullGraph Training of Graph Convolutional Networks with Pipelined Feature Communication”, International Conference on Learning Representations (ICLR), 2022.
Graph Convolutional Networks (GCNs) is the stateoftheart method for learning graphstructured data, and training largescale GCNs requires distributed training across multiple accelerators such that each accelerator is able to hold a partitioned subgraph. However, distributed GCN training incurs prohibitive overhead of communicating node features and feature gradients among partitions for every GCN layer in each training iteration, limiting the achievable training efficiency and model scalability. To this end, we propose PipeGCN, a simpleyeteffective scheme that hides the communication overhead by pipelining interpartition communication with intrapartition computation. It is nontrivial to pipeline for efficient GCN training, as communicated node features/gradients will become stale and thus can harm the convergence, negating the pipeline benefit. Notably, little is known regarding the convergence rate of GCN training with both stale features and stale feature gradients. This work not only provides a theoretical convergence guarantee but also finds the convergence rate of PipeGCN to be close to that of the vanilla distributed GCN training without staleness. Furthermore, we develop a smoothing method to further improve PipeGCN’s convergence. Extensive experiments show that PipeGCN can largely boost training throughput (up to 2.2x) while achieving the same accuracy as its vanilla counterpart and outperforming existing fullgraph training methods. All code will be released publicly upon acceptance.
@inproceedings{wan2021pipegcn, title={PipeGCN: Efficient FullGraph Training of Graph Convolutional Networks with Pipelined Feature Communication}, author={Wan, Cheng and Li, Youjie and Wolfe, Cameron R and Kyrillidis, Anastasios and Kim, Nam Sung and Lin, Yingyan}, booktitle={International Conference on Learning Representations}, year={2022} }

Binhang Yuan, Cameron Wolfe, Chen Dun, Yuxin Tang, Anastasios Kyrillidis and Chris Jermaine, ``Distributed Learning of Deep Neural Networks Using Independent Subnet Training”, International Conference on Very Large Databases (VLDB), 2022.
Distributed machine learning (ML) can bring more computational resources to bear than singlemachine learning, thus enabling reductions in training time. Distributed learning partitions models and data over many machines, allowing model and dataset sizes beyond the available compute power and memory of a single machine. In practice though, distributed ML is challenging when distribution is mandatory, rather than chosen by the practitioner. In such scenarios, data could unavoidably be separated among workers due to limited memory capacity per worker or even because of data privacy issues. There, existing distributed methods will utterly fail due to dominant transfer costs across workers, or do not even apply. We propose a new approach to distributed fully connected neural network learning, called independent subnet training (IST), to handle these cases. In IST, the original network is decomposed into a set of narrow subnetworks with the same depth. These subnetworks are then trained locally before parameters are exchanged to produce new subnets and the training cycle repeats. Such a naturally łmodel parallelž approach limits memory usage by storing only a portion of network parameters on each device. Additionally, no requirements exist for sharing data between workers (i.e., subnet training is local and independent) and communication volume and frequency are reduced by decomposing the original network into independent subnets. These properties of IST can cope with issues due to distributed data, slow interconnects, or limited device memory, making IST a suitable approach for cases of mandatory distribution. We show experimentally that IST results in training times that are much lower than common distributed learning approaches.
@article{yuan2022distributed, title={Distributed learning of fully connected neural networks using independent subnet training}, author={Yuan, Binhang and Wolfe, Cameron R and Dun, Chen and Tang, Yuxin and Kyrillidis, Anastasios and Jermaine, Chris}, journal={Proceedings of the VLDB Endowment}, volume={15}, number={8}, pages={15811590}, year={2022}, publisher={VLDB Endowment} }

Ahmed Imtiaz Humayun, Randall Balestriero, Anastasios Kyrillidis, Richard Baraniuk, ``No More Than 6ft Apart: Robust KMeans via Radius Upper Bounds”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP22), 2022.
Centroid based clustering methods such as kmeans, kmedoids and kcenters are heavily applied as a goto tool in exploratory data analysis. In many cases, those methods are used to obtain representative centroids of the data manifold for visualization or summarization of a dataset. Real world datasets often contain inherent abnormalities e.g. repeated samples and sampling bias, that manifest imbalanced clustering. We propose to remedy such scenario by introducing a maximal radius constraint r on the clusters formed by the centroids i.e. samples from a same cluster should not be more than 2r apart in term of `2 distance. We achieve this constraint by solving a semidefinite program, followed by a linear assignment problem with quadratic constraints. Through qualitative results, we show that our proposed method is robust towards dataset imbalances and sampling artefacts. To the best of our knowledge, ours is the first constrained kmeans clustering method with hard radius constraints
@inproceedings{humayun2022no, title={No More Than 6ft Apart: Robust $k$Means via Radius Upper Bounds}, author={Humayun, Ahmed Imtiaz and Balestriero, Randall and Kyrillidis, Anastasios and Baraniuk, Richard}, booktitle={ICASSP 20222022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages={44334437}, year={2022}, organization={IEEE} }

John Chen, Cameron Wolfe, Zhao Li Anastasios Kyrillidis, ``Demon: Decaying momentum helps neural network training”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP22), 2022.
Momentum is a widely used technique for gradientbased optimizers in deep learning. Here, we propose a decaying momentum (DEMON) hyperparameter rule. We conduct largescale empirical analysis of momentum decay methods for modern neural network optimization and compare to the most popular learning rate decay schedules. Across 28 relevant combinations of models, epochs, datasets, and optimizers, DEMON achieves Top1 and Top3 performance in 39\% and 85\% of cases, respectively, almost doubling the secondplaced cosine learning rate schedule at 17\% and 60\%, respectively. DEMON consistently outperforms other widelyused schedulers including, but not limited to, the learning rate step schedule, linear schedule, OneCycle schedule, and exponential schedule. Compared with the widelyused learning rate step schedule, DEMON is less sensitive to parameter tuning, which is critical to training neural networks in practice. Results are demonstrated across a variety of settings and architectures, ncluding image classification models, generative models, and language models. DEMON is easy to implement, requires no additional tuning, and incurs almost no extra computational overhead compared to the vanilla counterparts. Code is readily available.
@inproceedings{chen2022demon, title={Demon: Improved Neural Network Training with Momentum Decay}, author={Chen, John and Wolfe, Cameron and Li, Zhao and Kyrillidis, Anastasios}, booktitle={ICASSP 20222022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages={39583962}, year={2022}, organization={IEEE} }

Chen Dun, Cameron Wolfe, Anastasios Kyrillidis, ``ResIST: LayerWise Decomposition of ResNets for Distributed Training”, Conference on Uncertainty in Artificial Intelligence (UAI22), 2022.
We propose ResIST, a novel distributed training protocol for Residual Networks (ResNets). ResIST randomly decomposes a global ResNet into several shallow subResNets that are trained independently in a distributed manner for several local iterations, before having their updates synchronized and aggregated into the global model. In the next round, new subResNets are randomly generated and the process repeats until convergence. By construction, per iteration, ResIST communicates only a small portion of network parameters to each machine and never uses the full model during training. Thus, ResIST reduces the periteration communication, memory, and time requirements of ResNet training to only a fraction of the requirements of fullmodel training. In comparison to common protocols, like dataparallel training and dataparallel training with local SGD, ResIST yields a decrease in communication and compute requirements, while being competitive with respect to model performance.
@inproceedings{dun2022resist, title={ResIST: LayerWise Decomposition of ResNets for Distributed Training}, author={Dun, Chen and Wolfe, Cameron R and Jermaine, Chris and Kyrillidis, Anastasios}, booktitle={The 38th Conference on Uncertainty in Artificial Intelligence}, year={2022} }

John Chen, Samarth Sinha, Anastasios Kyrillidis, ``StackMix: A complementary Mix algorithm”, Conference on Uncertainty in Artificial Intelligence (UAI22), 2022.
Techniques combining multiple images as input/output have proven to be effective data augmentations for training convolutional neural networks. In this paper, we present StackMix: each input is presented as a concatenation of two images, and the label is the mean of the two onehot labels. On its own, StackMix rivals other widely used methods in the Mix’’ line of work. More importantly, unlike previous work, significant gains across a variety of benchmarks are achieved by combining StackMix with existing Mix augmentation, effectively mixing more than two images. E.g., by combining StackMix with CutMix, test error in the supervised setting is improved across a variety of settings over CutMix, including 0.8\% on ImageNet, 3\% on Tiny ImageNet, 2\% on CIFAR100, 0.5\% on CIFAR10, and 1.5\% on STL10. Similar results are achieved with Mixup. We further show that gains hold for robustness to common input corruptions and perturbations at varying severities with a 0.7\% improvement on CIFAR100C, by combining StackMix with AugMix over AugMix. On its own, improvements with StackMix hold across different number of labeled samples on CIFAR100, maintaining approximately a 2\% gap in test accuracy –down to using only 5\% of the whole dataset– and is effective in the semisupervised setting with a 2\% improvement with the standard benchmark model. Finally, we perform an extensive ablation study to better understand the proposed methodology.
@inproceedings{chen2022stackmix, title={StackMix: A complementary {M}ix algorithm}, author={Chen, John and Sinha, Samarth and Kyrillidis, Anastasios}, booktitle={The 38th Conference on Uncertainty in Artificial Intelligence}, year={2022} }

John Chen, Cameron Wolfe, Anastasios Kyrillidis, ``REX: Revisiting Budgeted Training with an Improved Schedule”, 5th Conference on Machine Learning and Systems (MLSys22), 2022.
Deep learning practitioners often operate on a computational and monetary budget. Thus, it is critical to design optimization algorithms that perform well under any budget. The linear learning rate schedule is considered the best budgetaware schedule, as it outperforms most other schedules in the low budget regime. On the other hand, learning rate schedules – such as the 306090 step schedule – are known to achieve high performance when the model can be trained for many epochs. Yet, it is often not known a priori whether one’s budget will be large or small; thus, the optimal choice of learning rate schedule is made on a casebycase basis. In this paper, we frame the learning rate schedule selection problem as a combination of selecting a profile (i.e., the continuous function that models the learning rate schedule), and choosing a sampling rate (i.e., how frequently the learning rate is updated/sampled from this profile). We propose a novel profile and sampling rate combination called the Reflected Exponential (REX) schedule, which we evaluate across seven different experimental settings with both SGD and Adam optimizers. REX outperforms the linear schedule in the low budget regime, while matching or exceeding the performance of several stateoftheart learning rate schedules (linear, step, exponential, cosine, step decay on plateau, and OneCycle) in both high and low budget regimes. Furthermore, REX requires no added computation, storage, or hyperparameters.
@article{chen2022rex, title={REX: Revisiting Budgeted Training with an Improved Schedule}, author={Chen, John and Wolfe, Cameron and Kyrillidis, Tasos}, journal={Proceedings of Machine Learning and Systems}, volume={4}, pages={6476}, year={2022} }

Cameron R Wolfe, Anastasios Kyrillidis, ``iSpaSP: Structured Neural Pruning via Sparse Signal Recovery”, 4th Annual Learning for Dynamics & Control Conference (L4DC22), 2022.
We propose a novel, structured pruning algorithm for neural networks—the iterative, Sparse Structured Pruning algorithm, dubbed as iSpaSP. Inspired by ideas from sparse signal recovery, iSpaSP operates by iteratively identifying a larger set of important parameter groups (e.g., filters or neurons) within a network that contribute most to the residual between pruned and dense network output, then thresholding these groups based on a smaller, predefined pruning ratio. For both twolayer and multilayer network architectures with ReLU activations, we show the error induced by pruning with iSpaSP decays polynomially, where the degree of this polynomial becomes arbitrarily large based on the sparsity of the dense network’s hidden representations. In our experiments, iSpaSP is evaluated across a variety of datasets (i.e., MNIST, ImageNet, and XNLI) and architectures (i.e., feed forward networks, ResNet34, MobileNetV2, and BERT), where it is shown to discover highperforming subnetworks and improve upon the pruning efficiency of provable baseline methodologies by several orders of magnitude. Put simply, iSpaSP is easy to implement with automatic differentiation, achieves strong empirical results, comes with theoretical convergence guarantees, and is efficient, thus distinguishing itself as one of the few computationally efficient, practical, and provable pruning algorithms.
@inproceedings{wolfe2022spasp, title={iSpaSP: Structured Neural Pruning via Sparse Signal Recovery}, author={Wolfe, Cameron R and Kyrillidis, Anastasios}, booktitle={Learning for Dynamics and Control Conference}, pages={248262}, year={2022}, organization={PMLR} }

Junhyung Lyle Kim, Panos Toulis, Anastasios Kyrillidis, ``Convergence and Stability of the Stochastic Proximal Point Algorithm with Momentum”, 4th Annual Learning for Dynamics & Control Conference (L4DC22), 2022.
Stochastic gradient descent with momentum (SGDM) is the dominant algorithm in many optimization scenarios, including convex optimization instances and nonconvex neural network training. Yet, in the stochastic setting, momentum interferes with gradient noise, often leading to specific step size and momentum choices in order to guarantee convergence, set aside acceleration. Proximal point methods, on the other hand, have gained much attention due to their numerical stability and elasticity against imperfect tuning. Their stochastic accelerated variants though have received limited attention: how momentum interacts with the stability of (stochastic) proximal point methods remains largely unstudied. To address this, we focus on the convergence and stability of the stochastic proximal point algorithm with momentum (SPPAM), and show that SPPAM allows a faster linear convergence to a neighborhood compared to stochastic proximal point algorithm (SPPA) with a better contraction factor, under proper hyperparameter tuning. In terms of stability, we show that SPPAM depends on problem constants more favorably than SGDM, allowing a wider range of step size and momentum that lead to convergence.
@inproceedings{kim2022convergence, title={Convergence and Stability of the Stochastic Proximal Point Algorithm with Momentum}, author={Kim, Junhyung Lyle and Toulis, Panos and Kyrillidis, Anastasios}, booktitle={Learning for Dynamics and Control Conference}, pages={10341047}, year={2022}, organization={PMLR} }
 John Chen, Qihan Wang, Anastasios Kyrillidis, ``Mitigating deep double descent by concatenating inputs”, ACM International Conference on Information and Knowledge Management (CIKM21), 2021.
The double descent curve is one of the most intriguing properties of deep neural networks. It contrasts the classical biasvariance curve with the behavior of modern neural networks, occurring where the number of samples nears the number of parameters. In this work, we explore the connection between the double descent phenomena and the number of samples in the deep neural network setting. In particular, we propose a construction which augments the existing dataset by artificially increasing the number of samples. This construction empirically mitigates the double descent curve in this setting. We reproduce existing work on deep double descent, and observe a smooth descent into the overparameterized region for our construction. This occurs both with respect to the model size, and with respect to the number epochs.
@article{chen2021mitigating, title={Mitigating deep double descent by concatenating inputs}, author={Chen, John and Wang, Qihan and Kyrillidis, Anastasios}, journal={arXiv preprint arXiv:2107.00797}, year={2021} }

Carlos QuinteroPena, Anastasios Kyrillidis, Lydia E Kavraki, ``Robust Optimizationbased Motion Planning for highDOF Robots under Sensing Uncertainty”, IEEE International Conference on Robotics and Automation (ICRA21), 2021.
Motion planning for high degreeoffreedom (DOF) robots is challenging, especially when acting in complex environments under sensing uncertainty. While there is significant work on how to plan under state uncertainty for lowDOF robots, existing methods cannot be easily translated into the highDOF case, due to the complex geometry of the robot’s body and its environment. In this paper, we present a method that enhances optimizationbased motion planners to produce robust trajectories for highDOF robots for convex obstacles. Our approach introduces robustness into planners that are based on sequential convex programming: We reformulate each convex subproblem as a robust optimization problem that “protects” the solution against deviations due to sensing uncertainty. The parameters of the robust problem are estimated by sampling from the distribution of noisy obstacles, and performing a firstorder approximation of the signed distance function. The original merit function is updated to account for the new costs of the robust formulation at every step. The effectiveness of our approach is demonstrated on two simulated experiments that involve a full body square robot, that moves in randomly generated scenes, and a 7DOF Fetch robot, performing tabletop operations. The results show nearly zero probability of collision for a reasonable range of the noise parameters for Gaussian and Uniform uncertainty.
@inproceedings{quintero2021robust, title={Robust Optimizationbased Motion Planning for highDOF Robots under Sensing Uncertainty}, author={QuinteroPena, Carlos and Kyrillidis, Anastasios and Kavraki, Lydia E}, booktitle={2021 IEEE International Conference on Robotics and Automation (ICRA)}, year={2021} }

Jacky Y. Zhang, Rajiv Khanna, Anastasios Kyrillidis, and Oluwasanmi Koyejo, ``Bayesian Coresets: Revisiting the Optimization Perspective”, Twentyfourth International Conference on Artificial Intelligence and Statistics (AISTATS21), 2021.
We explore the potential of continuous local search (CLS) in SAT solving by proposing a novel approach for finding a solution of a hybrid system of Boolean constraints. The algorithm is based on CLS combined with belief propagation on binary decision diagrams (BDDs). Our framework accepts all Boolean constraints that admit compact BDDs, including symmetric Boolean constraints and smallcoefficient pseudoBoolean constraints as interesting families. We propose a novel algorithm for efficiently computing the gradient needed by CLS. We study the capabilities and limitations of our versatile CLS solver, GradSAT, by applying it on many benchmark instances. The experimental results indicate that GradSAT can be a useful addition to the portfolio of existing SAT and MaxSAT solvers for solving Boolean satisfiability and optimization problems.
@inproceedings{zhang2021bayesian, title={Bayesian Coresets: Revisiting the Optimization Perspective}, author={Zhang, Jacky and Khanna, Rajiv and Kyrillidis, Anastasios and Koyejo, Oluwasanmi}, booktitle={International Conference on Artificial Intelligence and Statistics}, year={2021} }

Anastasios Kyrillidis, Moshe Y. Vardi, and Zhiwei Zhang, ``On Continuous Local BDDBased Search for Hybrid SAT Solving”, AAAI Conference on Artificial Intelligence (AAAI21), 2021.
We explore the potential of continuous local search (CLS) in SAT solving by proposing a novel approach for finding a solution of a hybrid system of Boolean constraints. The algorithm is based on CLS combined with belief propagation on binary decision diagrams (BDDs). Our framework accepts all Boolean constraints that admit compact BDDs, including symmetric Boolean constraints and smallcoefficient pseudoBoolean constraints as interesting families. We propose a novel algorithm for efficiently computing the gradient needed by CLS. We study the capabilities and limitations of our versatile CLS solver, GradSAT, by applying it on many benchmark instances. The experimental results indicate that GradSAT can be a useful addition to the portfolio of existing SAT and MaxSAT solvers for solving Boolean satisfiability and optimization problems.
@inproceedings{kyrillidis2021continuous, title={On Continuous Local BDDBased Search for Hybrid SAT Solving}, author={Kyrillidis, Anastasios and Vardi, Moshe and Zhang, Zhiwei}, booktitle={AAAI Conference on Artificial Intelligence}, year={2021} }

John Chen, Vatsal Shah and Anastasios Kyrillidis, ``Negative sampling in semisupervised learning”, International Conference on Machine Learning (ICML20), 2020.
We introduce Negative Sampling in SemiSupervised Learning (NS3L), a simple, fast, easy to tune algorithm for semisupervised learning (SSL). NS3L is motivated by the success of negative sampling/contrastive estimation. We demonstrate that adding the NS3L loss to stateoftheart SSL algorithms, such as the Virtual Adversarial Training (VAT), significantly improves upon vanilla VAT and its variant, VAT with Entropy Minimization. By adding the NS3L loss to MixMatch, the current stateoftheart approach on semisupervised tasks, we observe significant improvements over vanilla MixMatch. We conduct extensive experiments on the CIFAR10, CIFAR100, SVHN and STL10 benchmark datasets. Finally, we perform an ablation study for NS3L regarding its hyperparameter tuning.
@inproceedings{chen2020negative, title={Negative sampling in semisupervised learning}, author={Chen, John and Shah, Vatsal and Kyrillidis, Anastasios}, booktitle={International Conference on Machine Learning}, year={2020} }

Kelly Geyer, Anastasios Kyrillidis, and Amir Kalev, ``Lowrank regularization and solution uniqueness in over parameterized matrix sensing”, Twentythird International Conference on Artificial Intelligence and Statistics (AISTATS20), 2020.
We consider the question whether algorithmic choices in overparameterized linear matrix factorization introduce implicit lowrank regularization.We focus on the noiseless matrix sensing scenario over lowrank positive semidefinite (PSD) matrices over the reals, with a sensing mechanism that satisfies restricted isometry properties.Surprisingly, it was recently argued that for recovery of PSD matrices, gradient descent over a squared, fullrank factorized space introduces implicit lowrank regularization.Thus, a clever choice of the recovery algorithm avoids the need for explicit lowrank regularization. In this contribution, we prove that in fact, under certain conditions, the PSD constraint by itself is sufficient to lead to a unique lowrank matrix recovery, without explicit or implicit regularization.Therefore, under these conditions, the set of PSD matrices that are consistent with the observed data, is a singleton, regardless of the algorithm used. Our numerical study indicates that this result is general and extends to cases beyond the those covered by the proof.
@inproceedings{geyer2020low, title={Lowrank regularization and solution uniqueness in overparameterized matrix sensing}, author={Geyer, Kelly and Kyrillidis, Anastasios and Kalev, Amir}, booktitle={International Conference on Artificial Intelligence and Statistics}, pages={930940}, year={2020} }

Anastasios Kyrillidis, Anshumali Shrivastava, Moshe Y. Vardi, Zhiwei Zhang, ``FourierSAT: A Fourier ExpansionBased Algebraic Framework for Solving Hybrid Boolean Constraints”, AAAI Conference on Artificial Intelligence (AAAI20), 2020.
The Boolean SATisfiability problem (SAT) is of central importance in computer science. Although SAT is known to be NPcomplete, progress on the engineering side, especially that of ConflictDriven Clause Learning (CDCL) and Local Search SAT solvers, has been remarkable. Yet, while SAT solvers aimed at solving industrialscale benchmarks in Conjunctive Normal Form (CNF) have become quite mature, SAT solvers that are effective on other types of constraints, e.g., cardinality constraints and XORs, are less well studied; a general approach to handling nonCNF constraints is still lacking. In addition, previous work indicated that for specific classes of benchmarks, the running time of extant SAT solvers depends heavily on properties of the formula and details of encoding, instead of the scale of the benchmarks, which adds uncertainty to expectations of running time. To address the issues above, we design FourierSAT, an incomplete SAT solver based on Fourier analysis of Boolean functions, a technique to represent Boolean functions by multilinear polynomials. By such a reduction to continuous optimization, we propose an algebraic framework for solving systems consisting of different types of constraints. The idea is to leverage gradient information to guide the search process in the direction of local improvements. Empirical results demonstrate that FourierSAT is more robust than other solvers on certain classes of benchmarks
@article{kyrillidis2019fouriersat, title={FourierSAT: A Fourier ExpansionBased Algebraic Framework for Solving Hybrid Boolean Constraints}, author={Kyrillidis, Anastasios and Shrivastava, Anshumali and Vardi, Moshe Y and Zhang, Zhiwei}, journal={arXiv preprint arXiv:1912.01032}, year={2019} }

Jacky Y. Zhang, Rajiv Khanna, Anastasios Kyrillidis, Oluwasanmi O. Koyejo, ``Learning Sparse Distributions using Iterative Hard Thresholding”, Advances in Neural Information Processing Systems (NIPS2019), 2019.
Iterative hard thresholding (IHT) is a projected gradient descent algorithm, known to achieve state of the art performance for a wide range of structured estimation problems, such as sparse inference. In this work, we consider IHT as a solution to the problem of learning sparse discrete distributions. We study the hardness of using IHT on the space of measures. As a practical alternative, we propose a greedy approximate projection which simultaneously captures appropriate notions of sparsity in distributions, while satisfying the simplex constraint, and investigate the convergence behavior of the resulting procedure in various settings. Our results show, both in theory and practice, that IHT can achieve state of the art results for learning sparse distributions.
@inproceedings{zhang2019learning, title={Learning Sparse Distributions using Iterative Hard Thresholding}, author={Zhang, Jacky Y and Khanna, Rajiv and Kyrillidis, Anastasios and Koyejo, Oluwasanmi O}, booktitle={Advances in Neural Information Processing Systems}, pages={67576766}, year={2019} }

Ryan Spring, Anastasios Kyrillidis, Vijai Mohan, Anshumali Shrivastava, ``Compressing Gradient Optimizers via CountSketches”, International Conference on Machine Learning (ICML19), 2019.
Many popular firstorder optimization methods (eg, Momentum, AdaGrad, Adam) accelerate the convergence rate of deep learning models. However, these algorithms require auxiliary parameters, which cost additional memory proportional to the number of parameters in the model. The problem is becoming more severe as deep learning models continue to grow larger in order to learn from complex, largescale datasets. Our proposed solution is to maintain a linear sketch to compress the auxiliary variables. We demonstrate that our technique has the same performance as the fullsized baseline, while using significantly less space for the auxiliary variables. Theoretically, we prove that countsketch optimization maintains the SGD convergence rate, while gracefully reducing memory usage for largemodels. On the largescale 1Billion Word dataset, we save 25% of the memory used during training (8.6 GB instead of 11.7 GB) by compressing the Adam optimizer in the Embedding and Softmax layers with negligible accuracy and performance loss.
@article{spring2019compressing, title={Compressing Gradient Optimizers via CountSketches}, author={Spring, Ryan and Kyrillidis, Anastasios and Mohan, Vijai and Shrivastava, Anshumali}, journal={arXiv preprint arXiv:1902.00179}, year={2019} }

Anastasios Kyrillidis, ``Simple and practical algorithms for $\ell_p$norm lowrank approximation”, Conference on Uncertainty in Artificial Intelligence (UAI18), 2018.
We propose practical algorithms for entrywise $\ell_p$norm lowrank approximation, for $p = 1$ or $p = \infty$. The proposed framework, which is nonconvex and gradientbased, is easy to implement and typically attains better approximations, faster, than state of the art. From a theoretical standpoint, we show that the proposed scheme can attain $(1 + \varepsilon)$OPT approximations. Our algorithms are not hyperparameterfree: they achieve the desiderata only assuming algorithm's hyperparameters are known a priorior are at least approximable. I.e., our theory indicates what problem quantities need to be known, in order to get a good solution within polynomial time, and does not contradict to recent inapproximabilty results.
@article{kyrillidis2018simple, title={Simple and practical algorithms for $\ell_p$norm lowrank approximation}, author={Anastasios Kyrillidis}, journal={Conference on Uncertainty in Artificial Intelligence (UAI18)}, year={2018} }

Rajiv Khanna and Anastasios Kyrillidis, ``IHT dies hard: Provable accelerated Iterative Hard Thresholding”, Twentyfirst International Conference on Artificial Intelligence and Statistics (AISTATS18), 2018.
We study both in theory and practice the use of momentum motions in classic iterative hard thresholding (IHT) methods. By simply modifying plain IHT, we investigate its convergence behavior on convex optimization criteria with nonconvex constraints, under standard assumptions. In diverse scenaria, we observe that acceleration in IHT leads to significant improvements, compared to state of the art projected gradient descent and FrankWolfe variants. As a byproduct of our inspection, we study the impact of selecting the momentum parameter: similar to convex settings, two modes of behavior are observed "rippling" and linear depending on the level of momentum.
@article{khanna2018IHT, title={IHT dies hard: Provable accelerated Iterative Hard Thresholding}, author={Rajiv Khanna and Anastasios Kyrillidis}, journal={Twentyfirst International Conference on Artificial Intelligence and Statistics (AISTATS18)}, year={2018} }

Tianyang Li, Liu Liu, Anastasios Kyrillidis, and Constantine Caramanis, ``Statistical inference using SGD”, ThirtySecond AAAI Conference on Artificial Intelligence (AAAI18), 2018.
We present a novel method for frequentist statistical inference in $M$estimation problems, based on stochastic gradient descent (SGD) with a fixed step size: we demonstrate that the average of such SGD sequences can be used for statistical inference, after proper scaling. An intuitive analysis using the OrnsteinUhlenbeck process suggests that such averages are asymptotically normal. From a practical perspective, our SGDbased inference procedure is a first order method, and is wellsuited for large scale problems. To show its merits, we apply it to both synthetic and real datasets, and demonstrate that its accuracy is comparable to classical statistical methods, while requiring potentially far less computation.
@article{li2018statistical, title={Statistical inference using SGD}, author={Tianyang Li, Liu Liu, Anastasios Kyrillidis, and Constantine Caramanis}, journal={ThirtySecond AAAI Conference on Artificial Intelligence (AAAI18)}, year={2018} }

Dohyung Park, Anastasios Kyrillidis, Constantine Caramanis, and Sujay Sanghavi, ``Nonsquare matrix sensing without spurious local minima via the BurerMonteiro approach”, AI & Statistics Conference (AISTATS17), 2017.
We consider the nonsquare matrix sensing problem, under restricted isometry property (RIP) assumptions. We focus on the nonconvex formulation, where any rank$r$ matrix $X \in \mathbb{R}^{m \times n}$ is represented as $UV^\top$, where $U \in \mathbb{R}^{m \times r}$ and $V \in \mathbb{R}^{n \times r}$. In this paper, we complement recent findings on the nonconvex geometry of the analogous PSD setting [5], and show that matrix factorization does not introduce any spurious local minima, under RIP.
@article{park2017non, title={Nonsquare matrix sensing without spurious local minima via the BurerMonteiro approach}, author={Dohyung Park, Anastasios Kyrillidis, Constantine Caramanis, and Sujay Sanghavi}, journal={AI & Statistics Conference (AISTATS)}, year={2017} }

Srinadh Bhojanapalli, Anastasios Kyrillidis, and Sujay Sanghavi, ``Dropping convexity for faster semidefinite optimization”, Conference on Learning Theory (COLT16), 2016.
We study the minimization of a convex function $f(X)$ over the set of $n\times n$ positive semidefinite matrices, but when the problem is recast as $\min_U g(U) := f(UU^\top)$, with $U \in \mathbb{R}^{n \times r}$ and $r\leq n$. We study the performance of gradient descent on $g$which we refer to as Factored Gradient Descent (FGD)under standard assumptions on the original function $f$. We provide a rule for selecting the step size and, with this choice, show that the local convergence rate of FGD mirrors that of standard gradient descent on the original $f$: i.e., after $k$ steps, the error is $O(1/k)$ for smooth $f$, and exponentially small in $k$ when $f$ is (restricted) strongly convex. In addition, we provide a procedure to initialize FGD for (restricted) strongly convex objectives and when one only has access to $f$ via a firstorder oracle; for several problem instances, such proper initialization leads to global convergence guarantees. FGD and similar procedures are widely used in practice for problems that can be posed as matrix factorization. To the best of our knowledge, this is the first paper to provide precise convergence rate guarantees for general convex functions under standard convex assumptions.
@article{bhojanapalli2016dropping, title={Dropping convexity for faster semidefinite optimization}, author={Srinadh Bhojanapalli, Anastasios Kyrillidis, and Sujay Sanghavi}, journal={Conference on Learning Theory (COLT)}, year={2016} }

Megasthenis Asteris, Anastasios Kyrillidis, Oluwasanmi Koyejo, and Russell Poldrack, ``A simple and provable algorithm for sparse diagonal CCA”, International Conference on Machine Learning (ICML16), 2016.
Given two sets of variables, derived from a common set of samples, sparse Canonical Correlation Analysis (CCA) seeks linear combinations of a small number of variables in each set, such that the induced canonical variables are maximally correlated. Sparse CCA is NPhard. We propose a novel combinatorial algorithm for sparse diagonal CCA, i.e., sparse CCA under the additional assumption that variables within each set are standardized and uncorrelated. Our algorithm operates on a low rank approximation of the input data and its computational complexity scales linearly with the number of input variables. It is simple to implement, and parallelizable. In contrast to most existing approaches, our algorithm administers precise control on the sparsity of the extracted canonical vectors, and comes with theoretical datadependent global approximation guarantees, that hinge on the spectrum of the input data. Finally, it can be straightforwardly adapted to other constrained variants of CCA enforcing structure beyond sparsity. We empirically evaluate the proposed scheme and apply it on a real neuroimaging dataset to investigate associations between brain activity and behavior measurements.
@article{asterix2016simple, title={A simple and provable algorithm for sparse diagonal CCA}, author={Asterix, Megasthenis and Kyrillidis, Anastasios and Koyejo, Oluwasanmi and Poldrack, Russell}, journal={arXiv preprint arXiv:1605.08961}, year={2016} }

Anastasios Kyrillidis, Bubacarr Bah, Rouzbeh Hasheminezhad, Quoc TranDinh, Luca Baldassarre, and Volkan Cevher, ``Convex blocksparse linear regression with expanders, provably”, AI & Statistics Conference (AISTATS16), 2016.
Sparse matrices are favorable objects in machine learning and optimization. When such matrices are used, in place of dense ones, the overall complexity requirements in optimization can be significantly reduced in practice, both in terms of space and runtime. Prompted by this observation, we study a convex optimization scheme for blocksparse recovery from linear measurements. To obtain linear sketches, we use expander matrices, i.e., sparse matrices containing only few nonzeros per column. Hitherto, to the best of our knowledge, such algorithmic solutions have been only studied from a nonconvex perspective. Our aim here is to theoretically characterize the performance of convex approaches under such setting. Our key novelty is the expression of the recovery error in terms of the modelbased norm, while assuring that solution lives in the model. To achieve this, we show that sparse modelbased matrices satisfy a group version of the nullspace property. Our experimental findings on synthetic and real applications support our claims for faster recovery in the convex setting  as opposed to using dense sensing matrices, while showing a competitive recovery performance.
@article{kyrillidis2016convex, title={Convex blocksparse linear regression with expandersprovably}, author={Kyrillidis, Anastasios and Bah, Bubacarr and Hasheminezhad, Rouzbeh and TranDinh, Quoc and Baldassarre, Luca and Cevher, Volkan}, journal={arXiv preprint arXiv:1603.06313}, year={2016} }

Hemant Tyagi, Anastasios Kyrillidis, Bernd Gartner, and Andreas Krause, ``Learning sparse additive models with interactions in high dimensions”, AI & Statistics Conference (AISTATS16), 2016.
A function $f: \mathbb{R}^d \rightarrow \mathbb{R}$ is referred to as a Sparse Additive Model (SPAM), if it is of the form $f(\mathbf{x}) = \sum_{l \in \mathcal{S}}\phi_{l}(x_l)$, where $\mathcal{S} \subset [d]$, $\mathcal{S} \ll d$. Assuming $\phi_l$'s and $\mathcal{S}$ to be unknown, the problem of estimating $f$ from its samples has been studied extensively. In this work, we consider a generalized SPAM, allowing for *second order* interaction terms. For some $\mathcal{S}_1 \subset [d], \mathcal{S}_2 \subset {[d] \choose 2}$, the function $f$ is assumed to be of the form: $f(\mathbf{x}) = \sum_{p \in \mathcal{S}_1}\phi_{p} (x_p) + \sum_{(l_1, l_2) \in \mathcal{S}_2}\phi_{(l_1, l_2)} \mathbf{x}_{(l_1, l_2)}.$ Assuming $\phi_{p},\phi_{(l_1, l_2)}$, $\mathcal{S}_1$ and, $\mathcal{S}_2$ to be unknown, we provide a randomized algorithm that queries $f$ and *exactly recovers* $\mathcal{S}_1,\mathcal{S}_2$. Consequently, this also enables us to estimate the underlying $\phi_p, \phi_{(l_1, l_2)}$. We derive sample complexity bounds for our scheme and also extend our analysis to include the situation where the queries are corrupted with noise  either stochastic, or arbitrary but bounded. Lastly, we provide simulation results on synthetic data, that validate our theoretical findings.
@article{kyrillidis2016convex, title={Learning Sparse Additive Models with Interactions in High Dimensions}, author={Tyagi, Hemant and Kyrillidis, Anastasios and Gartner, Bernd and Krause, Andreas}, journal={arXiv preprint arXiv:1604.05307}, year={2016} }

Megasthenis Asteris, Anastasios Kyrillidis, Dimitris Papailiopoulos, and Alex Dimakis, ``Bipartite correlation clustering: Maximizing agreements”, AI & Statistics Conference (AISTATS16), 2016.
In Bipartite Correlation Clustering (BCC) we are given a complete bipartite graph G with `+' and `' edges, and we seek a vertex clustering that maximizes the number of agreements: the number of all `+' edges within clusters plus all `' edges cut across clusters. BCC is known to be NPhard. We present a novel approximation algorithm for kBCC, a variant of BCC with an upper bound k on the number of clusters. Our algorithm outputs a kclustering that provably achieves a number of agreements within a multiplicative (1−δ)factor from the optimal, for any desired accuracy δ. It relies on solving a combinatorially constrained bilinear maximization on the biadjacency matrix of G. It runs in time exponential in k and δ−1, but linear in the size of the input. Further, we show that, in the (unconstrained) BCC setting, an (1−δ)approximation can be achieved by O(δ−1) clusters regardless of the size of the graph. In turn, our kBCC algorithm implies an Efficient PTAS for the BCC objective of maximizing agreements.
@article{asteris2016bipartite, title={Bipartite Correlation ClusteringMaximizing Agreements}, author={Asteris, Megasthenis and Kyrillidis, Anastasios and Papailiopoulos, Dimitris and Dimakis, Alexandros G}, journal={arXiv preprint arXiv:1603.02782}, year={2016} }

Megasthenis Asteris, Dimitris Papailiopoulos, Anastasios Kyrillidis, and Alex Dimakis, ``Space PCA via bipartite matchings”, Neural Information Processing Systems (NIPS15), 2015.
We consider the following multicomponent sparse PCA problem: given a set of data points, we seek to extract a small number of sparse components with disjoint supports that jointly capture the maximum possible variance. Such components can be computed one by one, repeatedly solving the singlecomponent problem and deflating the input data matrix, but this greedy procedure is suboptimal. We present a novel algorithm for sparse PCA that jointly optimizes multiple disjoint components. The extracted features capture variance that lies within a multiplicative factor arbitrarily close to $1$ from the optimal. Our algorithm is combinatorial and computes the desired components by solving multiple instances of the bipartite maximum weight matching problem. Its complexity grows as a low order polynomial in the ambient dimension of the input data, but exponentially in its rank. However, it can be effectively applied on a lowdimensional sketch of the input data. We evaluate our algorithm on real datasets and empirically demonstrate that in many cases it outperforms existing, deflationbased approaches.
@inproceedings{asteris2015sparse, title={Sparse pca via bipartite matchings}, author={Asteris, Megasthenis and Papailiopoulos, Dimitris and Kyrillidis, Anastasios and Dimakis, Alexandros G}, booktitle={Advances in Neural Information Processing Systems}, pages={766774}, year={2015} }

Megasthenis Asteris, Anastasios Kyrillidis, Alex Dimakis, HanGyol Yi and Bharath Chandrasekaran, ``Stay on path: PCA along graph paths”, International Conference on Machine Learning (ICML15), 2015.
We introduce a variant of (sparse) PCA in which the set of feasible support sets is determined by a graph. In particular, we consider the following setting: given a directed acyclic graph~$G$ on $p$ vertices corresponding to variables, the nonzero entries of the extracted principal component must coincide with vertices lying along a path in $G$. From a statistical perspective, information on the underlying network may potentially reduce the number of observations required to recover the population principal component. We consider the canonical estimator which optimally exploits the prior knowledge by solving a nonconvex quadratic maximization on the empirical covariance. We introduce a simple network and analyze the estimator under the spiked covariance model. We show that side information potentially improves the statistical complexity. We propose two algorithms to approximate the solution of the constrained quadratic maximization, and recover a component with the desired properties. We empirically evaluate our schemes on synthetic and real datasets.
@inproceedings{asteris2015stay, title={Stay on path: PCA along graph paths}, author={Asteris, Megasthenis and Kyrillidis, Anastasios and Dimakis, EDU Alexandros G and Yi, HanGyol and Chandrasekaran, Bharath}, booktitle={Proceedings of the 32nd International Conference on Machine Learning (ICML15)}, volume={37}, year={2015} }

Michail Vlachos, Francesco Fusco, Harry Mavroforakis, Anastasios Kyrillidis and Vassilis Vasileiadis, ``Improving CoCluster Quality with Application to Product Recommendations”, ACM CIKM International Conference on Information and Knowledge Management, 2014.
Businesses store an ever increasing amount of historical customer sales data. Given the availability of such information, it is advantageous to analyze past sales, both for revealing dominant buying patterns, and for providing more targeted recommendations to clients. In this context, coclustering has proved to be an important datamodeling primitive for revealing latent connections between two sets of entities, such as customers and products. In this work, we introduce a new algorithm for coclustering that is both scalable and highly resilient to noise. Our method is inspired by $k$Means and agglomerative hierarchical clustering approaches: $(i)$ first it searches for elementary coclustering structures and $(ii)$ then combines them into a better, more compact, solution. The algorithm is flexible as it does not require an explicit number of coclusters as input, and is directly applicable on large data graphs. We apply our methodology on real sales data to analyze and visualize the connections between clients and products. We showcase a real deployment of the system, and how it has been used for driving a recommendation engine. Finally, we demonstrate that the new methodology can discover coclusters of better quality and relevance than stateoftheart coclustering techniques.
@inproceedings{vlachos2014improving, title={Improving cocluster quality with application to product recommendations}, author={Vlachos, Michail and Fusco, Francesco and Mavroforakis, Charalambos and Kyrillidis, Anastasios and Vassiliadis, Vassilios G}, booktitle={Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management}, pages={679688}, year={2014}, organization={ACM} }

Dimitris Papailiopoulos, Anastasios Kyrillidis and Christos Boutsidis, ``Provable deterministic leverage scores sampling”, ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD14), 2014.
We explain theoretically a curious empirical phenomenon: “Approximating a matrix by deterministically selecting a subset of its columns with the corresponding largest leverage scores results in a good lowrank matrix surrogate”. In this work, we provide a novel theoretical analysis of deterministic leverage score sampling. We show that such sampling can be provably as accurate as its randomized counterparts, if the leverage scores follow a moderately steep powerlaw decay. We support this powerlaw assumption by providing empirical evidence that such decay laws are abundant in realworld data sets. We then demonstrate empirically the performance of deterministic leverage score sampling, which many times matches or outperforms the stateoftheart techniques.
@inproceedings{papailiopoulos2014provable, title={Provable deterministic leverage score sampling}, author={Papailiopoulos, Dimitris and Kyrillidis, Anastasios and Boutsidis, Christos}, booktitle={Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining}, pages={9971006}, year={2014}, organization={ACM} }

Anastasios Kyrillidis, Rabeeh Karimi Mahabadi, Quoc TranDinh and Volkan Cevher, ``Scalable sparse covariance estimation via selfconcordance”, AAAI Conference on Artificial Intelligence (AAAI14), 2014.
We explain theoretically a curious empirical phenomenon: “Approximating a matrix by deterministically selecting a subset of its columns with the corresponding largest leverage scores results in a good lowrank matrix surrogate”. In this work, we provide a novel theoretical analysis of deterministic leverage score sampling. We show that such sampling can be provably as accurate as its randomized counterparts, if the leverage scores follow a moderately steep powerlaw decay. We support this powerlaw assumption by providing empirical evidence that such decay laws are abundant in realworld data sets. We then demonstrate empirically the performance of deterministic leverage score sampling, which many times matches or outperforms the stateoftheart techniques.
@inproceedings{kyrillidis2014scalable, title={Scalable Sparse Covariance Estimation via SelfConcordance}, author={Kyrillidis, Anastasios and Mahabadi, Rabeeh Karimi and Dinh, Quoc Tran and Cevher, Volkan}, booktitle={TwentyEighth AAAI Conference on Artificial Intelligence}, year={2014} }

Anastasios Kyrillidis, Michail Vlachos and Anastasios Zouzias, ``Approximate matrix multiplication with application to linear embeddings”, IEEE ISIT Symposium, 2014.
In this paper, we study the problem of approximately computing the product of two real matrices. In particular, we analyze a dimensionalityreductionbased approximation algorithm due to Sarlos, introducing the notion of nuclear rank as the ratio of the nuclear norm over the spectral norm. The presented bound has improved dependence with respect to the approximation error (as compared to previous approaches), whereas the subspace – on which we project the input matrices – has dimensions proportional to the maximum of their nuclear rank and it is independent of the input dimensions. In addition, we provide an application of this result to linear lowdimensional embeddings. Namely, we show that any Euclidean pointset with bounded nuclear rank is amenable to projection onto number of dimensions that is independent of the input dimensionality, while achieving additive error guarantees.
@inproceedings{kyrillidis2014approximate, title={Approximate matrix multiplication with application to linear embeddings}, author={Kyrillidis, Anastasios and Vlachos, Michail and Zouzias, Anastasios}, booktitle={2014 IEEE International Symposium on Information Theory}, pages={21822186}, year={2014}, organization={IEEE} }

Anastasios Kyrillidis and Anastasios Zouzias, ``Nonuniform feature sampling in decision tree ensembles”, IEEE ICASSP, 2014.
We study the effectiveness of nonuniform randomized feature selection in decision tree classification. We experimentally evaluate two feature selection methodologies, based on information extracted from the provided dataset: $(i)$ leverage scoresbased and $(ii)$ normbased feature selection. Experimental evaluation of the proposed feature selection techniques indicate that such approaches might be more effective compared to naive uniform feature selection and moreover having comparable performance to the random forest algorithm.
@inproceedings{kyrillidis2014non, title={Nonuniform feature sampling for decision tree ensembles}, author={Kyrillidis, Anastasios and Zouzias, Anastasios}, booktitle={2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages={45484552}, year={2014}, organization={IEEE} }

George Skoumas, Dieter Pfoser and Anastasios Kyrillidis, ``On quantifying qualitative geospatial data: A probabilistic approach”, ACM GEOCROWD, 2013.
Living in the era of data deluge, we have witnessed a web content explosion, largely due to the massive availability of UserGenerated Content (UGC). In this work, we specifically consider the problem of geospatial information extraction and representation, where one can exploit diverse sources of information (such as image and audio data, text data, etc), going beyond traditional volunteered geographic information. Our ambition is to include available narrative information in an effort to better explain geospatial relationships: with spatial reasoning being a basic form of human cognition, narratives expressing such experiences typically contain qualitative spatial data, i.e., spatial objects and spatial relationships. To this end, we formulate a quantitative approach for the representation of qualitative spatial relations extracted from UGC in the form of texts. The proposed method quantifies such relations based on multiple text observations. Such observations provide distance and orientation features which are utilized by a greedy Expectation Maximizationbased (EM) algorithm to infer a probability distribution over predefined spatial relationships; the latter represent the quantified relationships under userdefined probabilistic assumptions. We evaluate the applicability and quality of the proposed approach using real UGC data originating from an actual travel blog text corpus. To verify the result quality, we generate gridbased “maps” visualizing the spatial extent of the various relations.
@inproceedings{skoumas2013quantifying, title={On quantifying qualitative geospatial data: a probabilistic approach}, author={Skoumas, Georgios and Pfoser, Dieter and Kyrillidis, Anastasios}, booktitle={Proceedings of the Second ACM SIGSPATIAL International Workshop on Crowdsourced and Volunteered Geographic Information}, pages={7178}, year={2013}, organization={ACM} }

Stephen Becker, Volkan Cevher and Anastasios Kyrillidis, ``Randomized lowmemory singular value projection”, 10th International Conference on Sampling Theory and Applications (SampTA), 2013. (Authors listed in alphabetical order.)
Affine rank minimization algorithms typically rely on calculating the gradient of a data error followed by a singular value decomposition at every iteration. Because these two steps are expensive, heuristic approximations are often used to reduce computational burden. To this end, we propose a recovery scheme that merges the two steps with randomized approximations, and as a result, operates on space proportional to the degrees of freedom in the problem. We theoretically establish the estimation guarantees of the algorithm as a function of approximation tolerance. While the theoretical approximation requirements are overly pessimistic, we demonstrate that in practice the algorithm performs well on the quantum tomography recovery problem.
@inproceedings{becker2013randomized, title={Randomized LowMemory Singular Value Projection}, author={Becker, Stephen and Cevher, Volkan and Kyrillidis, Anastasios}, booktitle={10th International Conference on Sampling Theory and Applications (Sampta)}, year={2013} }

Anastasios Kyrillidis, Stephen Becker, Volkan Cevher and Christoph Koch, ``Sparse projections onto the simplex”, International Conference on Machine Learning (ICML13), 2013.
Most learning methods with rank or sparsity constraints use convex relaxations, which lead to optimization with the nuclear norm or the $\ell_1$norm. However, several important learning applications cannot benefit from this approach as they feature these convex norms as constraints in addition to the nonconvex rank and sparsity constraints. In this setting, we derive efficient sparse projections onto the simplex and its extension, and illustrate how to use them to solve highdimensional learning problems in quantum tomography, sparse density estimation and portfolio selection with nonconvex constraints.
@inproceedings{kyrillidis2013sparse, title={Sparse projections onto the simplex}, author={Kyrillidis, Anastasios and Becker, Stephen and Cevher, Volkan and Koch, Christoph}, booktitle={Proceedings of The 30th International Conference on Machine Learning}, pages={235243}, year={2013} }

Quoc Tran Dinh, Anastasios Kyrillidis and Volkan Cevher, ``A proximal Newton framework for composite minimization: Graph learning without Cholesky decompositions and matrix inversions”, International Conference on Machine Learning (ICML13), 2013.
We propose an algorithmic framework for convex minimization problems of composite functions with two terms: a selfconcordant part and a possibly nonsmooth regularization part. Our method is a new proximal Newton algorithm with local quadratic convergence rate. As a specific problem instance, we consider sparse precision matrix estimation problems in graph learning. Via a careful dual formulation and a novel analytic stepsize selection, we instantiate an algorithm within our framework for graph learning that avoids Cholesky decompositions and matrix inversions, making it attractive for parallel and distributed implementations.
@inproceedings{dinh2013proximal, title={A proximal Newton framework for composite minimization: Graph learning without Cholesky decompositions and matrix inversions}, author={Dinh, Quoc T and Kyrillidis, Anastasios and Cevher, Volkan}, booktitle={Proceedings of the 30th International Conference on Machine Learning (ICML13)}, pages={271279}, year={2013} }

Anastasios Kyrillidis and Volkan Cevher, ``Fast proximal algorithms for selfconcordant minimization with application to sparse graph selection”, IEEE ICASSP, 2013.
The convex $\ell_1$regularized log det divergence criterion has been shown to produce theoretically consistent graph learning. However, this objective function is challenging since the $\ell_1$regularization is nonsmooth, the log det objective is not globally Lipschitz gradient function, and the problem is highdimensional. Using the selfconcordant property of the objective, we propose a new adaptive step size selection and present the (F)PS ((F)ast Proximal algorithms for Selfconcordant functions) algorithmic framework which has linear convergence and exhibits superior empirical results as compared to stateoftheart first order methods.
@inproceedings{kyrillidis2013fast, title={Fast proximal algorithms for selfconcordant function minimization with application to sparse graph selection}, author={Kyrillidis, Anastasios and Cevher, Volkan}, booktitle={2013 IEEE International Conference on Acoustics, Speech and Signal Processing}, pages={65856589}, year={2013}, organization={IEEE} }

Anastasios Kyrillidis and Volkan Cevher, ``Matrix ALPS: Accelerated low rank and sparse matrix reconstruction”, IEEE SSP, 2012.
We propose Matrix ALPS for recovering a sparse plus lowrank decomposition of a matrix given its corrupted and incomplete linear measurements. Our approach is a firstorder projected gradient method over nonconvex sets, and it exploits a wellknown memorybased acceleration technique. We theoretically characterize the convergence properties of Matrix ALPS using the stable embedding properties of the linear measurement operator. We then numerically illustrate that our algorithm outperforms the existing convex as well as nonconvex stateoftheart algorithms in computational efficiency without sacrificing stability.
@inproceedings{kyrillidis2012matrix, title={Matrix {ALPS}: {A}ccelerated low rank and sparse matrix reconstruction}, author={Kyrillidis, Anastasios and Cevher, Volkan}, booktitle={2012 IEEE Statistical Signal Processing Workshop (SSP)}, pages={185188}, year={2012}, organization={IEEE} }

Anastasios Kyrillidis and Volkan Cevher, ``Combinatorial selection and least absolute shrinkage via the CLASH algorithm”, IEEE ISIT, 2012.
The least absolute shrinkage and selection operator (LASSO) for linear regression exploits the geometric interplay of the $\ell_2$data error objective and the $\ell_1$norm constraint to arbitrarily select sparse models. Guiding this uninformed selection process with sparsity models has been precisely the center of attention over the last decade in order to improve learning performance. To this end, we alter the selection process of LASSO to explicitly leverage combinatorial sparsity models (CSMs) via the combinatorial selection and least absolute shrinkage (CLASH) operator. We provide concrete guidelines how to leverage combinatorial constraints within CLASH, and characterize CLASH’s guarantees as a function of the set restricted isometry constants of the sensing matrix. Finally, our experimental results show that CLASH can outperform both LASSO and modelbased compressive sensing in sparse estimation.
@inproceedings{kyrillidis2012combinatorial, title={Combinatorial selection and least absolute shrinkage via the {CLASH} algorithm}, author={Kyrillidis, Anastasios and Cevher, Volkan}, booktitle={Information Theory Proceedings (ISIT), 2012 IEEE International Symposium on}, pages={22162220}, year={2012}, organization={IEEE} }

Anastasios Kyrillidis, Gilles Puy and Volkan Cevher, ``Hard thresholding with norm constraints”, IEEE ICASSP, 2012.
We introduce a new sparse recovery paradigm, called Normed Pursuits, where efficient algorithms from combinatorial and convex optimization interface for interpretable and modelbased solutions. Synthetic and real data experiments illustrate that Normed Pursuits can significantly enhance the performance of both hard thresholding methods and convex solvers in sparse recovery.
@inproceedings{kyrillidis2012hard, title={Hard thresholding with norm constraints}, author={Kyrillidis, Anastasios and Puy, Gilles and Cevher, Volkan}, booktitle={2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages={36453648}, year={2012}, organization={IEEE} }

Anastasios Kyrillidis and Volkan Cevher, ``Recipes on hard thresholding methods”, 4th IEEE CAMSAP, 2011.
Compressive sensing (CS) is a data acquisition and recovery technique for finding sparse solutions to linear inverse problems from subNyquist measurements. CS features a wide range of computationally efficient and robust signal recovery methods, based on sparsity seeking optimization. In this paper, we present and analyze a class of sparse recovery algorithms, known as hard thresholding methods. We provide optimal strategies on how to set up these algorithms via basic “ingredients” for different configurations to achieve complexity vs. accuracy tradeoffs. Simulation results demonstrate notable performance improvements compared to stateoftheart algorithms both in terms of data reconstruction and computational complexity.
@inproceedings{kyrillidis2011recipes, title={Recipes on hard thresholding methods}, author={Kyrillidis, Anastasios and Cevher, Volkan}, booktitle={4th IEEE International Workshop on Computational Advances in MultiSensor Adaptive Processing (CAMSAP)}, year={2011} }

Anastasios Kyrillidis and George. N. Karystinos, ``Rankdeficient quadraticform maximization over Mphase alphabet: Polynomialcomplexity solvability and algorithmic developments”, IEEE ICASSP, 2011.
The maximization of a positive (semi)definite complex quadratic form over a finite alphabet is NPhard and achieved through exhaustive search when the form has full rank. However, if the form is rankdeficient, the optimal solution can be computed with only polynomial complexity in the length $N$ of the maximizing vector. In this work, we consider the general case of a rank$D$ positive (semi)definite complex quadratic form and develop a method that maximizes the form with respect to a $M$phase vector with polynomial complexity. The proposed method efficiently reduces the size of the feasible set from exponential to polynomial. We also develop an algorithm that constructs the polynomialsize candidate set in polynomial time and observe that it is fully parallelizable and rankscalable.
@inproceedings{kyrillidis2011rank, title={Rankdeficient quadraticform maximization over $M$phase alphabet: Polynomialcomplexity solvability and algorithmic developments}, author={Kyrillidis, Anastasios and Karystinos, George}, booktitle={2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages={38563859}, year={2011}, organization={IEEE} }
Journals

Junhyung Lyle Kim, George Kollias, Amir Kalev, Ken X Wei, Anastasios Kyrillidis, ``Fast quantum state reconstruction via accelerated nonconvex programming”, Photonics, 10(2), MDPI, 2023
We propose a new quantum state reconstruction method that combines ideas from compressed sensing, nonconvex optimization, and acceleration methods. The algorithm, called MomentumInspired Factored Gradient Descent (MiFGD), extends the applicability of quantum tomography for larger systems. Despite being a nonconvex method, MiFGD converges provably close to the true density matrix at an accelerated linear rate asymptotically in the absence of experimental and statistical noise, under common assumptions. With this manuscript, we present the method, prove its convergence property and provide the Frobenius norm bound guarantees with respect to the true density matrix. From a practical point of view, we benchmark the algorithm performance with respect to other existing methods, in both synthetic and real (noisy) experiments, performed on the IBM’s quantum processing unit. We find that the proposed algorithm performs orders of magnitude faster than the stateoftheart approaches, with similar or better accuracy. In both synthetic and real experiments, we observed accurate and robust reconstruction, despite the presence of experimental and statistical noise in the tomographic data. Finally, we provide a readytouse code for state tomography of multiqubit systems.
@inproceedings{kim2023fast, title={Fast quantum state reconstruction via accelerated nonconvex programming}, author={Kim, Junhyung Lyle and Kollias, George and Kalev, Amir and Wei, Ken X and Kyrillidis, Anastasios}, booktitle={Photonics}, volume={10}, number={2}, pages={116}, year={2023}, organization={MDPI} }

Cameron R Wolfe, Jingkang Yang, Arindam Chowdhury, Chen Dun, Artun Bayer, Santiago Segarra, Anastasios Kyrillidis, ``GIST: Distributed Training for LargeScale Graph Convolutional Network”, Journal of Applied and Computational Topology, Special issue on Data Science on Graphs, Springer, 2023.
The graph convolutional network (GCN) is a goto solution for machine learning on graphs, but its training is notoriously difficult to scale both in terms of graph size and the number of model parameters. Although some work has explored training on largescale graphs (e.g., GraphSAGE, ClusterGCN, etc.), we pioneer efficient training of largescale GCN models (i.e., ultrawide, overparameterized models) with the proposal of a novel, distributed training framework. Our proposed training methodology, called GIST, disjointly partitions the parameters of a GCN model into several, smaller subGCNs that are trained independently and in parallel. In addition to being compatible with all GCN architectures and existing sampling techniques for efficient GCN training, GIST i) improves model performance, ii) scales to training on arbitrarily large graphs, iii) decreases wallclock training time, and iv) enables the training of markedly overparameterized GCN models. Remarkably, with GIST, we train an astonishglywide 32,768dimensional GraphSAGE model, which exceeds the capacity of a single GPU by a factor of 8x, to SOTA performance on the Amazon2M dataset.
@article{wolfe2021gist, title={GIST: Distributed training for largescale graph convolutional networks}, author={Wolfe, Cameron R and Yang, Jingkang and Chowdhury, Arindam and Dun, Chen and Bayer, Artun and Segarra, Santiago and Kyrillidis, Anastasios}, journal={arXiv preprint arXiv:2102.10424}, year={2021} }

Fangshuo Liao, Anastasios Kyrillidis, ``On the Convergence of Shallow Neural Network Training with Randomly Masked Neurons”, Transactions on Machine Learning Research (TMLR), 2022.
With the motive of training all the parameters of a neural network, we study why and when one can achieve this by iteratively creating, training, and combining randomly selected subnetworks. Such scenarios have either implicitly or explicitly emerged in the recent literature: see e.g., the Dropout family of regularization techniques, or some distributed ML training protocols that reduce communication/computation complexities, such as the Independent Subnet Training protocol. While these methods are studied empirically and utilized in practice, they often enjoy partial or no theoretical support, especially when applied on neural networkbased objectives. In this manuscript, our focus is on overparameterized single hidden layer neural networks with ReLU activations in the lazy training regime. By carefully analyzing i) the subnetworks’ neural tangent kernel, ii) the surrogate functions’ gradient, and iii) how we sample and combine the surrogate functions, we prove linear convergence rate of the training error –up to a neighborhood around the optimal point– for an overparameterized singlehidden layer perceptron with a regression loss. Our analysis reveals a dependency of the size of the neighborhood around the optimal point on the number of surrogate models and the number of local training steps for each selected subnetwork. Moreover, the considered framework generalizes and provides new insights on dropout training, multisample dropout training, as well as Independent Subnet Training; for each case, we provide convergence results as corollaries of our main theorem.
@article{liao2021convergence, title={On the convergence of shallow neural network training with randomly masked neurons}, author={Liao, Fangshuo and Kyrillidis, Anastasios}, journal={arXiv preprint arXiv:2112.02668}, year={2021} }
 Junhyung Lyle Kim, Mohammad Taha Toghani, Cesar A. Uribe, Anastasios Kyrillidis, ``Local Stochastic Factored Gradient Descent for Distributed Quantum State Tomography”, IEEE Control Systems Letters (LCSS), 2022.
We propose a distributed Quantum State Tomography (QST) protocol, named Local Stochastic Factored Gradient Descent (Local SFGD), to learn the lowrank factor of a density matrix over a set of local machines. QST is the canonical procedure to characterize the state of a quantum system, which we formulate as a stochastic nonconvex smooth optimization problem. Physically, the estimation of a lowrank density matrix helps characterizing the amount of noise introduced by quantum computation. Theoretically, we prove the local convergence of Local SFGD for a general class of restricted strongly convex/smooth loss functions, i.e., Local SFGD converges locally to a small neighborhood of the global optimum at a linear rate with a constant step size, while it locally converges exactly at a sublinear rate with diminishing step sizes. With a proper initialization, local convergence results imply global convergence. We validate our theoretical findings with numerical simulations of QST on the GreenbergerHorneZeilinger (GHZ) state.
@article{kim2022local, title={Local Stochastic Factored Gradient Descent for Distributed Quantum State Tomography}, author={Kim, Junhyung Lyle and Toghani, Mohammad Taha and Uribe, C{\'e}sar A and Kyrillidis, Anastasios}, journal={arXiv preprint arXiv:2203.11579}, year={2022} }
 Anastasios Kyrillidis, Anshumali Shrivastava, Moshe Y Vardi, Zhiwei Zhang, ``Solving hybrid Boolean constraints in continuous space via multilinear Fourier expansions”, Artificial Intelligence, Elsevier, 2021.
The Boolean SATisfiability problem (SAT) is of central importance in computer science. Although SAT is known to be NPcomplete, progress on the engineering side—especially that of ConflictDriven Clause Learning (CDCL) and Local Search SAT solvers—has been remarkable. Yet, while SAT solvers, aimed at solving industrialscale benchmarks in Conjunctive Normal Form (CNF), have become quite mature, SAT solvers that are effective on other types of constraints (e.g., cardinality constraints and XORs) are less wellstudied; a general approach to handling nonCNF constraints is still lacking. To address the issue above, we design FourierSAT,1 an incomplete SAT solver based on Fourier Analysis (also known as WalshFourier Transform) of Boolean functions, a technique to represent Boolean functions by multilinear polynomials. By such a reduction to continuous optimization, we propose an algebraic framework for solving systems consisting of different types of constraints. The idea is to leverage gradient information to guide the search process in the direction of local improvements. We show this reduction enjoys interesting theoretical properties. Empirical results demonstrate that FourierSAT can be a useful complement to other solvers on certain classes of benchmarks.
@article{kyrillidis2021solving, title={Solving hybrid Boolean constraints in continuous space via multilinear Fourier expansions}, author={Kyrillidis, Anastasios and Shrivastava, Anshumali and Vardi, Moshe Y and Zhang, Zhiwei}, journal={Artificial Intelligence}, volume={299}, pages={103559}, year={2021}, publisher={Elsevier} }

Amir Kalev, Anastasios Kyrillidis, and Norbert M Linke, ``Validating and certifying stabilizer states”, Physical Review A, 2019.
We propose a measurement scheme that validates the preparation of an $n$qubit stabilizer state. The scheme involves a measurement of $n$ Pauli observables, a priori determined from the stabilizer state and which can be realized using singlequbit gates. Based on the proposed validation scheme, we derive an explicit expression for the worstcase fidelity, i.e., the minimum fidelity between the stabilizer state and any other state consistent with the measured data. We also show that the worstcase fidelity can be certified, with high probability, using $O(n^2)$ copies of the state.
@article{PhysRevA.99.042337, title = {Validating and certifying stabilizer states}, author = {Kalev, Amir and Kyrillidis, Anastasios and Linke, Norbert M.}, journal = {Phys. Rev. A}, volume = {99}, issue = {4}, pages = {042337}, numpages = {6}, year = {2019}, month = {Apr}, publisher = {American Physical Society} }

YaPing Hsieh, YuChun Kao, Rabeeh Karimi Mahabadi, Yurtsever Alp, Anastasios Kyrillidis, Volkan Cevher, ``A NonEuclidean Gradient Descent Framework for NonConvex Matrix Factorization”, IEEE Transactions on Signal Processing, 2018.
We study convex optimization problems that feature lowrank matrix solutions. In such scenarios, nonconvex methods offer significant advantages over convex methods due to their lower space complexity, as well as practical faster convergence. Under mild assumptions, these methods feature global convergence guarantees. In this paper, we extend the results on this matter by following a different path. We derive a nonEuclidean optimization framework in the nonconvex setting that takes nonlinear gradient steps on the factors. Our framework enables the possibility to further exploit the underlying problem structures, such as sparsity or lowrankness on the factorized domain, or better dimensional dependence of the smoothness parameters of the objectives. We prove that the nonEuclidean methods enjoy the same rigorous guarantees as their Euclidean counterparts under appropriate assumptions. Numerical evidence with Fourier ptychography and FastText applications, using real data, shows that our approach can enhance solution quality, as well as convergence speed over the standard nonconvex approaches.
@article{hsieh2018non, title={A nonEuclidean gradient descent framework for nonconvex matrix factorization}, author={Hsieh, YaPing and Kao, YuChun and Mahabadi, Rabeeh Karimi and Yurtsever, Alp and Kyrillidis, Anastasios and Cevher, Volkan}, journal={IEEE Transactions on Signal Processing}, volume={66}, number={22}, pages={59175926}, year={2018}, publisher={IEEE} }

Dohyung Park, Anastasios Kyrillidis, Constantine Caramanis, and Sujay Sanghavi, ``Finding lowrank solutions via nonconvex matrix factorization, efficiently and provably”, SIAM Journal on Imaging Sciences, 2018.
A rank$r$ matrix $X \in \mathbb{R}^{m \times n}$ can be written as a product $U V^\top$, where $U \in \mathbb{R}^{m \times r}$ and $V \in \mathbb{R}^{n \times r}$. One could exploit this observation in optimization: e.g., consider the minimization of a convex function $f(X)$ over rank$r$ matrices, where the set of lowrank matrices is modeled via $UV^\top$. Though such parameterization reduces the number of variables and is more computationally efficient (of particular interest is the case $r \ll \min\{m, n\}$), it comes at a cost: $f(U V^\top)$ becomes a nonconvex function w.r.t. $U$ and $V$. We study such parameterization on generic convex objectives $f$ and focus on firstorder, gradient descent algorithms. We propose the bifactored gradient descent (BFGD) algorithm, an efficient firstorder method that operates directly on the $U, V$ factors. We show that when $f$ is (restricted) smooth, BFGD has local sublinear convergence; when $f$ is both (restricted) smooth and (restricted) strongly convex, it has local linear convergence. For several applications, we provide simple and efficient initialization schemes that provide initial conditions, good enough for the above convergence results to hold, globally. Extensive experimental results support our arguments that BFGD is an efficient and accurate nonconvex method, compared to stateoftheart approaches.
@article{park2018finding, title={Finding lowrank solutions via nonconvex matrix factorization, efficiently and provably}, author={Park, Dohyung and Kyrillidis, Anastasios and Caramanis, Constantine and Sanghavi, Sujay}, journal={SIAM Journal on Imaging Sciences}, volume={11}, number={4}, pages={21652204}, year={2018}, publisher={SIAM} }

Anastasios Kyrillidis, Amir Kalev, Dohyung Park, Srinadh Bhojanapalli, Constantine Caramanis and Sujay Sanghavi, ``Provable compressed sensing quantum state tomography via nonconvex methods”, Nature Journal on Quantum Information, 2018.
With nowadays steadily growing quantum processors, it is required to develop new quantum tomography tools that are tailored for highdimensional systems. In this work, we describe such a computational tool, based on recent ideas from nonconvex optimization. The algorithm excels in the compressed sensing setting, where only a few data points are measured from a lowrank or highlypure quantum state of a highdimensional system. We show that the algorithm can practically be used in quantum tomography problems that are beyond the reach of convex solvers, and, moreover, is faster and more accurate than other stateoftheart nonconvex approaches. Crucially, we prove that, despite being a nonconvex program, under mild conditions, the algorithm is guaranteed to converge to the global minimum of the quantum state tomography problem; thus, it constitutes a provable quantum state tomography protocol.
@article{kyrillidis2018provable, title={Provable compressed sensing quantum state tomography via nonconvex methods}, author={Kyrillidis, Anastasios and Kalev, Amir and Park, Dohyung and Bhojanapalli, Srinadh and Caramanis, Constantine and Sanghavi, Sujay}, journal={npj Quantum Information}, volume={4}, number={1}, pages={36}, year={2018}, publisher={Nature Publishing Group} }

Quoc Tran Dinh, Anastasios Kyrillidis, and Volkan Cevher, ``A singlephase, proximal pathfollowing framework”, Mathematics of Operations Research, 2017.
We propose a new proximal, pathfollowing framework for a class of constrained convex problems. We consider settings where the nonlinear—and possibly nonsmooth—objective part is endowed with a proximity operator, and the constraint set is equipped with a selfconcordant barrier. Our approach relies on the following two main ideas. First, we reparameterize the optimality condition as an auxiliary problem, such that a good initial point is available; by doing so, a family of alternative paths towards the optimum is generated. Second, we combine the proximal operator with pathfollowing ideas to design a singlephase, proximal, pathfollowing algorithm. We prove that our scheme has the same $O(\sqrt{\nu} \log(1/\varepsilon))$ worstcase iterationcomplexity with standard approaches [34, 38] without requiring an initial phase, where $\nu$ is the barrier parameter and $\varepsilon$ is a desired accuracy. Our framework allows errors in the calculation of proximal Newton directions, without sacrificing the worstcase iteration complexity. We demonstrate the merits of our algorithm via three numerical examples, where proximal operators play a key role.
@article{tran2016single, title={A singlephase, proximal pathfollowing framework}, author={TranDinh, Quoc and Kyrillidis, Anastasios and Cevher, Volkan}, journal={arXiv preprint arXiv:1603.01681}, year={2016} }

Hemant Tyagi, Anastasios Kyrillidis, Bernd Gartner, and Andreas Krause, ``Algorithms for Learning Sparse Additive Models with Interactions in High Dimensions”, Information and Inference: A journal of the IMA, 2017.
A function $f: \mathbb{R}^d \rightarrow \mathbb{R}$ is a Sparse Additive Model (SPAM), if it is of the form $f(\mathbf{x}) = \sum_{l \in \mathcal{S}}\phi_{l}(x_l)$ where $\mathcal{S} \subset [d]$, $\mathcal{S} \ll d$. Assuming $\phi$'s, $\mathcal{S}$ to be unknown, there exists extensive work for estimating $f$ from its samples. In this work, we consider a generalized version of SPAMs, that also allows for the presence of a sparse number of second order interaction terms. For some $\mathcal{S}_1 \subset [d], \mathcal{S}_2 \subset {[d] \choose 2}$, with $\mathcal{S}_1 \ll d, \mathcal{S}_2 \ll d^2$, the function $f$ is now assumed to be of the form: $\sum_{p \in \mathcal{S}_1}\phi_{p} (x_p) + \sum_{(l,l^{\prime}) \in \mathcal{S}_2}\phi_{(l,l^{\prime})} (x_l,x_{l^{\prime}})$. Assuming we have the freedom to query $f$ anywhere in its domain, we derive efficient algorithms that provably recover $\mathcal{S}_1,\mathcal{S}_2$ with finite sample bounds. Our analysis covers the noiseless setting where exact samples of $f$ are obtained, and also extends to the noisy setting where the queries are corrupted with noise. For the noisy setting in particular, we consider two noise models namely: i.i.d Gaussian noise and arbitrary but bounded noise. Our main methods for identification of $\mathcal{S}_2$ essentially rely on estimation of sparse Hessian matrices, for which we provide two novel compressed sensing based schemes. Once $\mathcal{S}_1, \mathcal{S}_2$ are known, we show how the individual components $\phi_p$, $\phi_{(l,l^{\prime})}$ can be estimated via additional queries of $f$, with uniform error bounds. Lastly, we provide simulation results on synthetic data that validate our theoretical findings.
@article{tyagi2017algorithms, title={Algorithms for Learning Sparse Additive Models with Interactions in High Dimensions}, author={Tyagi, Hemant and Kyrillidis, Anastasios and G{\"a}rtner, Bernd and Krause, Andreas}, journal={Information and Inference: A journal of the IMA}, year={2017} }

Luca Baldassarre, Nirav Bhan, Volkan Cevher, Anastasios Kyrillidis and Siddhartha Satpathi, ``Groupsparse model selection: Hardness and relaxations”, IEEE Trans. on Information Theory, 2016. (Authors listed in alphabetical order.)
Groupbased sparsity models are instrumental in linear and nonlinear regression problems. The main premise of these models is the recovery of “interpretable” signals through the identification of their constituent groups, which can also provably translate in substantial savings in the number of measurements for linear models in compressive sensing. In this paper, we establish a combinatorial framework for groupmodel selection problems and highlight the underlying tractability issues. In particular, we show that the groupmodel selection problem is equivalent to the wellknown NPhard weighted maximum coverage problem (WMC). Leveraging a graphbased understanding of group models, we describe group structures which enable correct model selection in polynomial time via dynamic programming. Furthermore, we show that popular groups structures can be explained by linear inequalities involving totally unimodular matrices, which afford other polynomial time algorithms based on relaxations. We also present a generalization of the groupmodel that allows for within group sparsity, which can be used to model hierarchical sparsity. Finally, we study the Pareto frontier between approximation error and sparsity budget of group sparse approximations for two tractable models, among which the tree sparsity model, and illustrate selection and computation tradeoffs between our framework and the existing convex relaxations.
@article{baldassarre2013group, title={Groupsparse model selection: Hardness and relaxations}, author={Baldassarre, Luca and Bhan, Nirav and Cevher, Volkan and Kyrillidis, Anastasios and Satpathi, Siddhartha}, journal={arXiv preprint arXiv:1303.3207}, year={2013} }

Georgios Skoumas, Dieter Pfoser, Anastasios Kyrillidis and Timos Sellis, ``Location estimation using crowdsourced spatial relations”, ACM Transactions on Spatial Algorithms and Systems, vol. 2, issue 2, 2016.
The “crowd” has become a very important geospatial data provider. Specifically, nonexpert users have been providing a wealth of quantitative geospatial data (e.g., geotagged tweets or photos, online). With spatial reasoning being a basic form of human cognition, textual narratives expressing user travel experiences (e.g., travel blogs) would provide an even bigger source of geospatial data. Narratives typically contain qualitative geospatial data in the form of objects and spatial relations (e.g., “St. John’s church is to the North of the Acropolis museum.” The scope of this work is (i) to extract these spatial relations from textual narratives, (ii) to quantify (model) them, and (iii) to reason about object locations based only on the quantified spatial relations. We use information extraction methods to identify toponyms and spatial relations, and we formulate a quantitative approach based on distance and orientation features to represent the latter. Probability density functions (PDFs) for spatial relations are determined by means of a greedy expectation maximization (EM)based algorithm. These PDFs are then used to estimate unknown object locations. Experiments using a text corpus harvested from travel blog sites establish the considerable location estimation accuracy of the proposed approach on synthetic and realworld scenarios.
@article{skoumas2016location, author = {Skoumas, Georgios and Pfoser, Dieter and Kyrillidis, Anastasios and Sellis, Timos}, title = {Location Estimation Using Crowdsourced Spatial Relations}, journal = {ACM Trans. Spatial Algorithms Syst.}, volume = {2}, number = {2}, year = {2016}, pages = {5:15:23}, publisher = {ACM}, }

Quoc Tran Dinh, Anastasios Kyrillidis and Volkan Cevher, ``Composite selfconcordant minimization”, Journal of Machine Learning Research, 16(Mar):371−416, 2015.
We propose a variable metric framework for minimizing the sum of a selfconcordant function and a possibly nonsmooth convex function, endowed with an easily computable proximal operator. We theoretically establish the convergence of our framework without relying on the usual Lipschitz gradient assumption on the smooth part. An important highlight of our work is a new set of analytic stepsize selection and correction procedures based on the structure of the problem. We describe concrete algorithmic instances of our framework for several interesting applications and demonstrate them numerically on both synthetic and real data.
@article{tran2015composite, title={Composite SelfConcordant Minimization}, author={TranDinh, Quoc and Kyrillidis, Anastasios and Cevher, Volkan}, journal={Journal of Machine Learning Research}, volume={16}, pages={371416}, year={2015} }

Michail Vlachos, Nikolaos Freris and Anastasios Kyrillidis, ``Compressive mining: fast and optimal data mining in the compressed domain”, Very Large Data Bases (VLDB) Journal, Volume 24 Issue 1, February 2015.
Realworld data typically contain repeated and periodic patterns. This suggests that they can be effectively represented and compressed using only a few coefficients of an appropriate basis (e.g., Fourier and wavelets). However, distance estimation when the data are represented using different sets of coefficients is still a largely unexplored area. This work studies the optimization problems related to obtaining the tightest lower/upper bound on Euclidean distances when each data object is potentially compressed using a different set of orthonormal coefficients. Our technique leads to tighter distance estimates, which translates into more accurate search, learning and mining operations directly in the compressed domain. We formulate the problem of estimating lower/upper distance bounds as an optimization problem. We establish the properties of optimal solutions and leverage the theoretical analysis to develop a fast algorithm to obtain an exact solution to the problem. The suggested solution provides the tightest estimation of the Euclidean norm or the correlation. We show that typical data analysis operations, such as nearestneighbor search or kMeans clustering, can operate more accurately using the proposed compression and distance reconstruction technique. We compare it with many other prevalent compression and reconstruction techniques, including random projections and PCAbased techniques. We highlight a surprising result, namely that when the data are highly sparse in some basis, our technique may even outperform PCAbased compression. The contributions of this work are generic as our methodology is applicable to any sequential or highdimensional data as well as to any orthogonal data transformation used for the underlying data compression scheme.
@article{vlachos2015compressive, title={Compressive mining: fast and optimal data mining in the compressed domain}, author={Vlachos, Michail and Freris, Nikolaos M and Kyrillidis, Anastasios}, journal={The VLDB Journal—The International Journal on Very Large Data Bases}, volume={24}, number={1}, pages={124}, year={2015}, publisher={SpringerVerlag New York, Inc.} }

Quoc TranDinh, Anastasios Kyrillidis and Volkan Cevher, ``An inexact proximal pathfollowing algorithm for constrained convex minimization”, SIAM Journal on Optimization (SIOPT), vol. 24, num. 4, p. 17181745, 2014.
Many scientific and engineering applications feature nonsmooth convex minimization problems over convex sets. In this paper, we address an important instance of this broad class where we assume that the nonsmooth objective is equipped with a tractable proximity operator and that the convex constraint set affords a selfconcordant barrier. We provide a new joint treatment of proximal and selfconcordant barrier concepts and illustrate that such problems can be efficiently solved, without the need of lifting the problem dimensions, as in disciplined convex optimization approach. We propose an inexact pathfollowing algorithmic framework and theoretically characterize the worstcase analytical complexity of this framework when the proximal subproblems are solved inexactly. To show the merits of our framework, we apply its instances to both synthetic and realworld applications, where it shows advantages over standard interior point methods. As a byproduct, we describe how our framework can obtain points on the Pareto frontier of regularized problems with selfconcordant objectives in a tuning free fashion.
@article{tran2014inexact, title={An inexact proximal pathfollowing algorithm for constrained convex minimization}, author={TranDinh, Quoc and Kyrillidis, Anastasios and Cevher, Volkan}, journal={SIAM Journal on Optimization}, volume={24}, number={4}, pages={17181745}, year={2014}, publisher={SIAM} }

Anastasios Kyrillidis and George. N. Karystinos, ``Fixedrank Rayleigh quotient maximization by an MPSK sequence”, IEEE Trans. on Communications, Volume:62, Issue:3, pages 961975, 2014.
Certain optimization problems in communication systems, such as limitedfeedback constantenvelope beamforming or noncoherent Mary phaseshift keying (MPSK) sequence detection, result in the maximization of a fixedrank positive semidefinite quadratic form over the MPSK alphabet. This form is a special case of the Rayleigh quotient of a matrix and, in general, its maximization by an MPSK sequence is NPhard. However, if the rank of the matrix is not a function of its size, then the optimal solution can be computed with polynomial complexity in the matrix size. In this work, we develop a new technique to efficiently solve this problem by utilizing auxiliary continuousvalued angles and partitioning the resulting continuous space of solutions into a polynomialsize set of regions, each of which corresponds to a distinct MPSK sequence. The sequence that maximizes the Rayleigh quotient is shown to belong to this polynomialsize set of sequences, thus efficiently reducing the size of the feasible set from exponential to polynomial. Based on this analysis, we also develop an algorithm that constructs this set in polynomial time and show that it is fully parallelizable, memory efficient, and rank scalable. The proposed algorithm compares favorably with other solvers for this problem that have appeared recently in the literature.
@article{kyrillidis2014fixed, title={Fixedrank Rayleigh quotient maximization by an MPSK sequence}, author={Kyrillidis, Anastasios and Karystinos, George N}, journal={Communications, IEEE Transactions on}, volume={62}, number={3}, pages={961975}, year={2014}, publisher={IEEE} }

Anastasios Kyrillidis and Volkan Cevher, ``Matrix recipes for hard thresholding methods”, Journal of Mathematical Imaging and Vision (JMIV), April 2013, Springer.
In this paper, we present and analyze a new set of lowrank recovery algorithms for linear inverse problems within the class of hard thresholding methods. We provide strategies on how to set up these algorithms via basic ingredients for different configurations to achieve complexity vs. accuracy tradeoffs. Moreover, we study acceleration schemes via memorybased techniques and randomized, ϵapproximate matrix projections to decrease the computational costs in the recovery process. For most of the configurations, we present theoretical analysis that guarantees convergence under mild problem conditions. Simulation results demonstrate notable performance improvements as compared to stateoftheart algorithms both in terms of reconstruction accuracy and computational complexity.
@article{kyrillidis2014matrix, title={Matrix recipes for hard thresholding methods}, author={Kyrillidis, Anastasios and Cevher, Volkan}, journal={Journal of mathematical imaging and vision}, volume={48}, number={2}, pages={235265}, year={2014}, publisher={Springer} }

Nikolaos D. Sidiropoulos and Anastasios Kyrillidis, ``Multiway compressed sensing for sparse low rank tensors”, IEEE Signal Processing Letters, 19(11):757760, Oct. 2012.
For linear models, compressed sensing theory and methods enable recovery of sparse signals of interest from few measurements  in the order of the number of nonzero entries as opposed to the length of the signal of interest. Results of similar flavor have more recently emerged for bilinear models, but no results are available for multilinear models of tensor data. In this contribution, we consider compressed sensing for sparse and lowrank tensors. More specifically, we consider lowrank tensors synthesized as sums of outer products of sparse loading vectors, and a special class of linear dimensionalityreducing transformations that reduce each mode individually. We prove interesting `oracle' properties showing that it is possible to identify the uncompressed sparse loadings directly from the compressed tensor data. The proofs naturally suggest a twostep recovery process: fitting a lowrank model in compressed domain, followed by permode $\ell_0$ / $\ell_1$ decompression. This twostep process is also appealing from a computational complexity and memory capacity point of view, especially for big tensor datasets.
@article{sidiropoulos2012multi, title={Multiway compressed sensing for sparse lowrank tensors}, author={Sidiropoulos, Nicholas D and Kyrillidis, Anastasios}, journal={Signal Processing Letters, IEEE}, volume={19}, number={11}, pages={757760}, year={2012}, publisher={IEEE} }
Book chapters

Volkan Cevher, Sina Jafarpour and Anastasios Kyrillidis, ``Linear inverse problems with norm and sparsity constraints”, in Practical Applications of Sparse Modeling, Sept. 2014, MIT Press, (Authors listed in alphabetical order).
We describe two nonconventional algorithms for linear regression, called GAME and CLASH. The salient characteristics of these approaches is that they exploit the convex $\ell_1$ball and nonconvex $\ell_0$sparsity constraints jointly in sparse recovery. To establish the theoretical approximation guarantees of GAME and CLASH, we cover an interesting range of topics from game theory, convex and combinatorial optimization. We illustrate that these approaches lead to improved theoretical guarantees and empirical performance beyond convex and nonconvex solvers alone.
@article{cevher2014linear, title={Linear inverse problems with norm and sparsity constraints}, author={Cevher, Volkan and Jafarpour, Sina and Kyrillidis, Anastasios}, journal={Practical Applications of Sparse Modeling}, pages={179}, year={2014}, publisher={MIT Press} }

Anastasios Kyrillidis, Luca Baldassarre, Marwa ElHalabi, Quoc TranDinh and Volkan Cevher, ``Structured sparsity: discrete and convex approaches”, in Compressed sensing and its application, Springer, 2014.
Compressive sensing (CS) exploits sparsity to recover sparse or compressible signals from dimensionality reducing, nonadaptive sensing mechanisms. Sparsity is also used to enhance interpretability in machine learning and statistics applications: While the ambient dimension is vast in modern data analysis problems, the relevant information therein typically resides in a much lower dimensional space. However, many solutions proposed nowadays do not leverage the true underlying structure. Recent results in CS extend the simple sparsity idea to more sophisticated structured sparsity models, which describe the interdependency between the nonzero components of a signal, allowing to increase the interpretability of the results and lead to better recovery performance. In order to better understand the impact of structured sparsity, in this chapter we analyze the connections between the discrete models and their convex relaxations, highlighting their relative advantages. We start with the general group sparse model and then elaborate on two important special cases: the dispersive and the hierarchical models. For each, we present the models in their discrete nature, discuss how to solve the ensuing discrete problems and then describe convex relaxations. We also consider more general structures as defined by set functions and present their convex proxies. Further, we discuss efficient optimization solutions for structured sparsity problems and illustrate structured sparsity in action via three applications.
@incollection{kyrillidis2015structured, title={Structured sparsity: Discrete and convex approaches}, author={Kyrillidis, Anastasios and Baldassarre, Luca and El Halabi, Marwa and TranDinh, Quoc and Cevher, Volkan}, booktitle={Compressed Sensing and its Applications}, pages={341387}, year={2015}, publisher={Springer} }
Theses
 Anastasios Kyrillidis, ``Rigorous optimization recipes for sparse and low rank inverse problems with applications in data sciences”, Ph.D. Thesis, School of Computer and Communications, EPFL, September 2014.
 Anastasios Kyrillidis, ``Polynomialcomplexity computation of the Mphase vector that maximizes a rankdeficient quadratic form”, M.Sc. Thesis, Dept. Electronic Engineering and Computer Science, Technical University of Crete, July 2010.