Conference papers

Fangshuo Liao, Anastasios Kyrillidis, ``Guided by the Experts: Provable Feature Learning Dynamic of Soft-Routed Mixture-of-Experts”, Twenty-ninth International Conference on Artificial Intelligence and Statistics (AISTATS-26), 2026.

Mixture-of-Experts (MoE) architectures have emerged as a cornerstone of modern AI systems. In particular, MoEs route inputs dynamically to specialized experts whose outputs are aggregated through weighted summation. Despite their widespread application, theoretical understanding of MoE training dynamics remains limited to either separate expert-router optimization or only top-1 routing scenarios with carefully constructed datasets. This paper advances MoE theory by providing convergence guarantees for joint training of soft-routed MoE models with non-linear routers and experts in a student-teacher framework. We prove that, with moderate over-parameterization, the student network undergoes a feature learning phase, where the router's learning process is "guided" by the experts, that recovers the teacher's parameters. Moreover, we show that a post-training pruning can effectively eliminate redundant neurons, followed by a provably convergent fine-tuning process that reaches global optimality. To our knowledge, our analysis is the first to bring novel insights in understanding the optimization landscape of the MoE architecture.
```
  @inproceedings{liao2026guided,
title = {Guided by the Experts: Provable Feature Learning Dynamic of Soft-Routed Mixture-of-Experts},
author = {Liao, Fangshuo and Kyrillidis, Anastasios},
booktitle = {Proceedings of The 29th International Conference on Artificial Intelligence and Statistics},
year = {2026},
series = {Proceedings of Machine Learning Research},
publisher = {PMLR}
}
```
Junhyung Lyle Kim, Nai-Hui Chia, Anastasios Kyrillidis, ``A Catalyst Framework for the Quantum Linear System Problem via the Proximal Point Algorithm”, Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), 2026.

Solving systems of linear equations is a fundamental problem, but it can be computationally intensive for classical algorithms in high dimensions. Existing quantum algorithms can achieve exponential speedups for the quantum linear system problem (QLSP) in terms of the problem dimension, but the advantage is bottlenecked by condition number of the coefficient matrix. In this work, we propose a new quantum algorithm for QLSP inspired by the classical proximal point algorithm (PPA). Our proposed method can be viewed as a meta-algorithm that allows inverting a modified matrix via an existing QLSP solver, thereby directly approximating the solution vector instead of approximating the inverse of the coefficient matrix. By carefully choosing the step size, the proposed algorithm can effectively precondition the linear system to mitigate the dependence on condition numbers that hindered the applicability of previous approaches. Importantly, this is the first iterative framework for QLSP where a tunable parameter and initialization allows controlling the trade-off between the runtime and approximation error.
```
  @inproceedings{kim2026catalyst,
title = {A Catalyst Framework for the Quantum Linear System Problem via the Proximal Point Algorithm},
author = {Kim, Junhyung Lyle and Chia, Nai-Hui and Kyrillidis, Anastasios},
booktitle = {Fortieth {AAAI} Conference on Artificial Intelligence},
pages = {22582--22590},
year = {2026},
publisher = {AAAI} Press},
doi = {10.1609/aaai.v40i27.39418}
}
```
Yehya Farhat, Hamza ElMokhtar Shili, Fangshuo Liao, Chen Dun, Mirian Hipolito Garcia, Guoqing Zheng, Ahmed Awadallah, Robert Sim, Dimitrios Dimitriadis, Anastasios Kyrillidis, ``Learning to Specialize: Joint Gating-Expert Training for Adaptive MoEs in Decentralized Settings”, Advances in Neural Information Processing Systems (NeurIPS-25), 2025.

Mixture-of-Experts (MoEs) achieve scalability by dynamically activating subsets of their components. Yet, understanding how expertise emerges through joint training of gating mechanisms and experts remains incomplete, especially in scenarios without clear task partitions. Motivated by inference costs and data heterogeneity, we study how joint training of gating functions and experts can dynamically allocate domain-specific expertise across multiple underlying data distributions. As an outcome of our framework, we develop an instance tailored specifically to decentralized training scenarios, introducing Dynamically Decentralized Orchestration of MoEs or DDOME. DDOME leverages heterogeneity emerging from distributional shifts across decentralized data sources to specialize experts dynamically. By integrating a pretrained common expert to inform a gating function, DDOME achieves personalized expert subset selection on-the-fly, facilitating just-in-time personalization. We empirically validate DDOME within a Federated Learning (FL) context: DDOME attains from 4% up to an 24% accuracy improvement over state-of-the-art FL baselines in image and text classification tasks, while maintaining competitive zero-shot generalization capabilities. Furthermore, we provide theoretical insights confirming that the joint gating-experts training is critical for achieving meaningful expert specialization.
```
  @inproceedings{farhat2025learning,
title = {Learning to Specialize: Joint Gating-Expert Training for Adaptive MoEs in Decentralized Settings},
author = {Farhat, Yehya and Shili, Hamza ElMokhtar and Liao, Fangshuo and Dun, Chen and Garcia, Mirian Hipolito and Zheng, Guoqing and Awadallah, Ahmed and Sim, Robert and Dimitriadis, Dimitrios and Kyrillidis, Anastasios},
booktitle = {Advances in Neural Information Processing Systems 38 (NeurIPS)},
year = {2025}
}
```
Chen Dun, Mirian del Carmen Hipolito Garcia, Guoqing Zheng, Ahmed Hassan Awadallah, Robert Sim, Anastasios Kyrillidis, ``Sweeping Heterogeneity with Smart MoPs: Mixture of Prompts for LLM Task Adaptation”, Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), 2025.

Large Language Models demonstrate versatility across diverse tasks like summarization and mathematical problem-solving without task-specific training. However, current approaches rely on prompt instruction tuning for individual downstream tasks. We address how to scale prompt tuning across multiple heterogeneous tasks and data distributions simultaneously. We introduce Mixture of Prompts (MoPs) with smart gating that identifies relevant skills in different prompt groups and dynamically combines experts based on target tasks. We show that MoPs work regardless of model compression techniques, data sources, or task composition. The approach substantially reduces perplexity -- from 9% up to 70% in non-i.i.d. distributed cases and from 3% up to 30% in centralized cases compared to baselines.
```
  @inproceedings{dun2025sweeping,
title = {Sweeping Heterogeneity with Smart MoPs: Mixture of Prompts for {LLM} Task Adaptation},
author = {Dun, Chen and Garcia, Mirian del Carmen Hipolito and Zheng, Guoqing and Awadallah, Ahmed Hassan and Sim, Robert and Kyrillidis, Anastasios},
booktitle = {Thirty-Ninth {AAAI} Conference on Artificial Intelligence},
pages = {16426--16434},
year = {2025},
publisher = {AAAI} Press},
doi = {10.1609/aaai.v39i16.33804}
}
```
Fangshuo Liao, Wenyi Su, Anastasios Kyrillidis, ``Provable Model-Parallel Distributed Principal Component Analysis with Parallel Deflation”, Conference on Parsimony and Learning (CPAL-25), 2025.

We study a distributed Principal Component Analysis (PCA) framework where each worker targets a distinct eigenvector and refines its solution by updating from intermediate solutions provided by peers deemed as "superior". Drawing intuition from the deflation method in centralized eigenvalue problems, our approach breaks the sequential dependency in the deflation steps and allows asynchronous updates of workers, while incurring only a small communication cost. To our knowledge, a gap in the literature -- the theoretical underpinning of such distributed, dynamic interactions among workers -- has remained unaddressed. This paper offers a theoretical analysis explaining why, how, and when these intermediate, hierarchical updates lead to practical and provable convergence in distributed environments. Despite being a theoretical work, our prototype implementation demonstrates that such a distributed PCA algorithm converges effectively and in scalable way: through experiments, our proposed framework offers comparable performance to EigenGame-mu, the state-of-the-art model-parallel PCA solver.
```
  @inproceedings{liao2025provable,
title = {Provable Model-Parallel Distributed Principal Component Analysis with Parallel Deflation},
author = {Liao, Fangshuo and Su, Wenyi and Kyrillidis, Anastasios},
booktitle = {Conference on Parsimony and Learning (CPAL)},
series = {Proceedings of Machine Learning Research},
volume = {280},
pages = {392--416},
year = {2025},
publisher = {PMLR}
}
```
David A. Quiroga, Jason Han, Anastasios Kyrillidis, ``Quantum EigenGame for excited state calculation”, Conference on Parsimony and Learning (CPAL-25), 2025.

Computing the excited states of a given Hamiltonian is computationally hard for large systems, but methods that do so using quantum computers scale tractably. This problem is equivalent to the PCA problem where we are interested in decomposing a matrix into a collection of principal components. Classically, PCA is a well-studied problem setting, for which both centralized and distributed approaches have been developed. On the distributed side, one recent approach is that of EigenGame, a game-theoretic approach to finding eigenvectors where each eigenvector reaches a Nash equilibrium either sequentially or in parallel. With this work, we extend the EigenGame algorithm for both a 0th-order approach and for quantum computers, and harness the framework that quantum computing provides in computing excited states. Results show that using the Quantum EigenGame allows us to converge to excited states of a given Hamiltonian without the need of a deflation step. We also develop theory on error accumulation for finite-differences and parameterized approaches.
```
  @inproceedings{quiroga2025quantum,
title = {Quantum EigenGame for excited state calculation},
author = {Quiroga, David A. and Han, Jason and Kyrillidis, Anastasios},
booktitle = {Conference on Parsimony and Learning (CPAL)},
series = {Proceedings of Machine Learning Research},
volume = {280},
pages = {837--864},
year = {2025},
publisher = {PMLR}
}
```
Tom Pan, Evan Dramko, Mitchell D. Miller, George N. Phillips Jr., Anastasios Kyrillidis, ``RecCrysFormer: Refined Protein Structural Prediction from 3D Patterson Maps via Recycling Training Runs”, Conference on Parsimony and Learning (CPAL-25), 2025.

Determining protein structures at an atomic level remains a significant challenge in structural biology. We introduce RecCrysFormer, a hybrid model that exploits the strengths of transformers with the aim of integrating experimental and ML approaches to protein structure determination from crystallographic data. RecCrysFormer leverages Patterson maps and incorporates known standardized partial structures of amino acid residues to directly predict electron density maps, which are essential for constructing detailed atomic models through crystallographic refinement processes. RecCrysFormer benefits from a "recycling" training regimen that iteratively incorporates results from crystallographic refinements and previous training runs as additional inputs in the form of template maps. Using a preliminary dataset of synthetic peptide fragments based on Protein Data Bank, RecCrysFormer achieves good accuracy in structural predictions and shows robustness against variations in crystal parameters, such as unit cell dimensions and angles.
```
  @inproceedings{pan2025reccrysformer,
title = {RecCrysFormer: Refined Protein Structural Prediction from 3D Patterson Maps via Recycling Training Runs},
author = {Pan, Tom and Dramko, Evan and Miller, Mitchell D. and Phillips Jr., George N. and Kyrillidis, Anastasios},
booktitle = {Conference on Parsimony and Learning (CPAL)},
series = {Proceedings of Machine Learning Research},
volume = {280},
pages = {897--912},
year = {2025},
publisher = {PMLR}
}
```
Junhyung Lyle Kim, Mohammad Taha Toghani, Cesar A. Uribe, Anastasios Kyrillidis, ``Adaptive Federated Learning with Auto-Tuned Clients”, Twelfth International Conference on Learning Representations (ICLR-24), 2024.

Federated learning (FL) is a distributed machine learning framework where the global model of a central server is trained via multiple collaborative steps by participating clients without sharing their data. While being a flexible framework, where the distribution of local data, participation rate, and computing power of each client can greatly vary, such flexibility gives rise to many new challenges, especially in the hyperparameter tuning on the client side. We propose Delta-SGD, a simple step size rule for SGD that enables each client to use its own step size by adapting to the local smoothness of the function each client is optimizing. We provide theoretical and empirical results where the benefit of the client adaptivity is shown in various FL scenarios.
```
  @inproceedings{kim2024adaptive,
title = {Adaptive Federated Learning with Auto-Tuned Clients},
author = {Kim, Junhyung Lyle and Toghani, Mohammad Taha and Uribe, C{\'e}sar A. and Kyrillidis, Anastasios},
booktitle = {The Twelfth International Conference on Learning Representations (ICLR)},
year = {2024},
publisher = {OpenReview.net}
}
```
Fangshuo Liao, Junhyung Lyle Kim, Cruz Barnum, Anastasios Kyrillidis, ``On the Error-Propagation of Inexact Hotelling’s Deflation for Principal Component Analysis”, Forty-first International Conference on Machine Learning (ICML-24), 2024.

Principal Component Analysis (PCA) aims to find subspaces spanned by the so-called principal components that best represent the variance in the dataset. The deflation method is a popular meta-algorithm that sequentially finds individual principal components, starting from the most important ones and working towards the less important ones. However, as deflation proceeds, numerical errors from the imprecise estimation of principal components propagate due to its sequential nature. This paper mathematically characterizes the error propagation of the inexact Hotelling's deflation method. We consider two scenarios: i) when the sub-routine for finding the leading eigenvector is abstract and can represent various algorithms; and ii) when power iteration is used as the sub-routine. In the latter case, the additional directional information from power iteration allows us to obtain a tighter error bound than the sub-routine agnostic case. For both scenarios, we explicitly characterize how the errors progress and affect subsequent principal component estimations.
```
  @inproceedings{liao2024error,
title = {On the Error-Propagation of Inexact Hotelling's Deflation for Principal Component Analysis},
author = {Liao, Fangshuo and Kim, Junhyung Lyle and Barnum, Cruz and Kyrillidis, Anastasios},
booktitle = {Forty-first International Conference on Machine Learning (ICML)},
pages = {29720--29747},
year = {2024},
publisher = {PMLR}
}
```
Fangshuo Liao, Anastasios Kyrillidis, ``Provable Accelerated Convergence of Nesterov’s Momentum for Deep ReLU Neural Networks”, International Conference on Algorithmic Learning Theory (ALT-24), 2024.

Current state-of-the-art analyses on the convergence of gradient descent for training neural networks focus on characterizing properties of the loss landscape, such as the Polyak-Lojaciewicz (PL) condition and the restricted strong convexity. While gradient descent converges linearly under such conditions, it remains an open question whether Nesterov's momentum enjoys accelerated convergence under similar settings and assumptions. In this work, we consider a new class of objective functions, where only a subset of the parameters satisfies strong convexity, and show Nesterov's momentum achieves acceleration in theory for this objective class. We provide two realizations of the problem class, one of which is deep ReLU networks, which -- to the best of our knowledge -- constitutes this work the first that proves accelerated convergence rate for non-trivial neural network architectures.
```
  @inproceedings{liao2024provable,
title = {Provable Accelerated Convergence of Nesterov's Momentum for Deep ReLU Neural Networks},
author = {Liao, Fangshuo and Kyrillidis, Anastasios},
booktitle = {International Conference on Algorithmic Learning Theory (ALT)},
pages = {732--784},
year = {2024},
publisher = {PMLR}
}
```
Emmanouil Kariotakis, Grigorios Tsagkatakis, Panagiotis Tsakalides, Anastasios Kyrillidis, ``Leveraging Sparse Input and Sparse Models: Efficient Distributed Learning in Resource-Constrained Environments”, Conference on Parsimony and Learning (CPAL-24), 2024.

Optimizing for reduced computational and bandwidth resources enables model training in less-than-ideal environments and paves the way for practical and accessible AI solutions. We develop a system leveraging sparsity across neural network input and intermediate layers for distributed image classification. Using transfer learning with a pre-trained feature extractor and the Independent Subnetwork Training algorithm, the approach subsamples input and intermediate features during training. Testing on CIFAR-10, NWPU-RESISC45, and aerial imagery datasets demonstrates that the model achieves comparable or even superior accuracy with only a fraction (50% or less) of the original image, particularly valuable for bandwidth-constrained deployments. The findings underscore the robustness of Vision Transformer features for sparse, distributed learning scenarios.
```
  @inproceedings{kariotakis2024leveraging,
title = {Leveraging Sparse Input and Sparse Models: Efficient Distributed Learning in Resource-Constrained Environments},
author = {Kariotakis, Emmanouil and Tsagkatakis, Grigorios and Tsakalides, Panagiotis and Kyrillidis, Anastasios},
booktitle = {Conference on Parsimony and Learning (CPAL)},
series = {Proceedings of Machine Learning Research},
volume = {234},
pages = {554--569},
year = {2024},
publisher = {PMLR}
}
```
Carlos Quintero-Pena, Wil Thomason, Zachary Kingston, Anastasios Kyrillidis, Lydia E. Kavraki, ``Stochastic Implicit Neural Signed Distance Functions for Safe Motion Planning under Sensing Uncertainty”, IEEE International Conference on Robotics and Automation (ICRA-24), 2024.

Motion planning under sensing uncertainty is critical for robots in unstructured environments to guarantee safety for both the robot and any nearby humans. Most work on planning under uncertainty does not scale to high-dimensional robots such as manipulators, assumes simplified geometry of the robot or environment, or requires per-object knowledge of noise. Instead, we propose a method that directly models sensor-specific aleatoric uncertainty to find safe motions for high-dimensional systems in complex environments, without exact knowledge of environment geometry. We combine a novel implicit neural model of stochastic signed distance functions with a hierarchical optimization-based motion planner to plan low-risk motions without sacrificing path quality. Our method also explicitly bounds the risk of the path, offering trustworthiness. We empirically validate that our method produces safe motions and accurate risk bounds and is safer than baseline approaches.
```
  @inproceedings{quintero2024stochastic,
title = {Stochastic Implicit Neural Signed Distance Functions for Safe Motion Planning under Sensing Uncertainty},
author = {Quintero-Pe{\~{n}a, Carlos and Thomason, Wil and Kingston, Zachary and Kyrillidis, Anastasios and Kavraki, Lydia E.},
booktitle = {IEEE} International Conference on Robotics and Automation (ICRA)},
pages = {2360--2367},
year = {2024},
publisher = {IEEE},
doi = {10.1109/ICRA57147.2024.10610773}
}
```
John Chen, Chen Dun, Anastasios Kyrillidis, ``Fast FixMatch: Faster Semi-Supervised Learning with Curriculum Batch Size”, IEEE International Symposium on Information Theory (ISIT-24), 2024.

Advances in Semi-Supervised Learning have substantially reduced the performance gap with supervised learning while using fewer labels. However, these improvements have come with significantly increased computational costs during training. We propose Curriculum Batch Size (CBS), an unlabeled batch size curriculum which exploits the natural training dynamics of deep neural networks. This approach uses smaller unlabeled batches initially, then gradually increases them throughout training. Fast FixMatch achieves 2.1x-3.4x reduced training computations on CIFAR-10 compared to vanilla FixMatch while attaining the same error rate, with similar improvements on CIFAR-100, SVHN, and STL-10, and 2.6x-3.3x reduced training computations in federated and online learning scenarios.
```
  @inproceedings{chen2024fast,
title = {Fast FixMatch: Faster Semi-Supervised Learning with Curriculum Batch Size},
author = {Chen, John and Dun, Chen and Kyrillidis, Anastasios},
booktitle = {IEEE} International Symposium on Information Theory (ISIT)},
pages = {1836--1841},
year = {2024},
publisher = {IEEE},
doi = {10.1109/ISIT57864.2024.10619518}
}
```
Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, Anshumali Shrivastava, ``Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time”, Advances in Neural Information Processing Systems (NeurIPS-23), 2023.

Large language models (LLMs) have sparked a new wave of exciting AI applications. Hosting these models at scale requires significant memory resources. One crucial memory bottleneck for the deployment stems from the context window. It is commonly recognized that model weights are memory hungry; however, the size of key-value embedding stored during the generation process (KV cache) can easily surpass the model size. The enormous size of the KV cache puts constraints on the inference batch size, which is crucial for high throughput inference workload. Inspired by an interesting observation of the attention scores, we hypothesize the persistence of importance: only pivotal tokens, which had a substantial influence at one step, will significantly influence future generations. Based on our empirical verification and theoretical analysis around this hypothesis, we propose Scissorhands, a system that maintains the memory usage of the KV cache at a fixed budget without finetuning the model. In essence, Scissorhands manages the KV cache by storing the pivotal tokens with a higher probability. We validate that Scissorhands reduces the inference memory usage of the KV cache by up to 5X without compromising model quality. We further demonstrate that Scissorhands can be combined with 4-bit quantization, traditionally used to compress model weights, to achieve up to 20X compression.
```
  @inproceedings{liu2023scissorhands,
title = {Scissorhands: Exploiting the Persistence of Importance Hypothesis for {LLM} {KV} Cache Compression at Test Time},
author = {Liu, Zichang and Desai, Aditya and Liao, Fangshuo and Wang, Weitao and Xie, Victor and Xu, Zhaozhuo and Kyrillidis, Anastasios and Shrivastava, Anshumali},
booktitle = {Advances in Neural Information Processing Systems 36 (NeurIPS)},
year = {2023}
}
```
Erdong Hu, Yuxin Tang, Anastasios Kyrillidis, Chris Jermaine, ``Federated Learning Over Images: Vertical Decompositions and Pre-Trained Backbones Are Difficult to Beat”, IEEE/CVF International Conference on Computer Vision (ICCV-23), 2023.

We carefully evaluate a number of algorithms for learning in a federated environment, and test their utility for a variety of image classification tasks. We consider many issues that have not been adequately considered before: whether learning over data sets that do not have diverse sets of images affects the results; whether to use a pre-trained feature extraction "backbone"; how to evaluate learner performance (we argue that classification accuracy is not enough), among others. Overall, across a wide variety of settings, we find that vertically decomposing a neural network seems to give the best results, and outperforms more standard reconciliation-based methods.
```
  @inproceedings{hu2023federated,
title = {Federated Learning Over Images: Vertical Decompositions and Pre-Trained Backbones Are Difficult to Beat},
author = {Hu, Erdong and Tang, Yuxin and Kyrillidis, Anastasios and Jermaine, Chris},
booktitle = {IEEE/CVF} International Conference on Computer Vision (ICCV)},
pages = {19328--19339},
year = {2023},
publisher = {IEEE}
}
```
Cameron R. Wolfe, Anastasios Kyrillidis, ``Better Schedules for Low Precision Training of Deep Neural Networks”, Asian Conference on Machine Learning (ACML-23), 2023.

Low precision training can significantly reduce the computational overhead of training deep neural networks (DNNs). Though many such techniques exist, cyclic precision training (CPT), which dynamically adjusts precision throughout training according to a cyclic schedule, achieves particularly impressive improvements in training efficiency, while actually improving DNN performance. Existing CPT implementations take common learning rate schedules and use them for low precision training without adequate comparisons to alternative scheduling options. We define a diverse suite of CPT schedules and analyze their performance across a variety of DNN training regimes, some of which are unexplored in the low precision training literature. From these experiments, we discover alternative CPT schedules that offer further improvements in training efficiency and model performance, as well as derive a set of best practices for choosing CPT schedules.
```
  @inproceedings{wolfe2023better,
title = {Better Schedules for Low Precision Training of Deep Neural Networks},
author = {Wolfe, Cameron R. and Kyrillidis, Anastasios},
booktitle = {Asian Conference on Machine Learning (ACML)},
series = {Proceedings of Machine Learning Research},
volume = {222},
year = {2023},
publisher = {PMLR}
}
```
David A. Quiroga, Anastasios Kyrillidis, ``Using non-convex optimization in quantum process tomography: Factored gradient descent is tough to beat”, IEEE International Conference on Rebooting Computing (ICRC-23), 2023. (Best Poster Award)

We propose a non-convex optimization algorithm, based on the Burer-Monteiro (BM) factorization, for the quantum process tomography problem, in order to estimate a low-rank process matrix for near-unitary quantum gates. We compare our approach against state of the art convex optimization approaches based on gradient descent. We use a reduced set of initial states and measurement operators that require 2*8^n circuit settings, as well as O(4^n) measurements for an underdetermined setting. We find our algorithm converges faster and achieves higher fidelities than state of the art, both in terms of measurement settings, as well as in terms of noise tolerance, in the cases of depolarizing and Gaussian noise models.
```
  @inproceedings{quiroga2023nonconvex,
title = {Using non-convex optimization in quantum process tomography: Factored gradient descent is tough to beat},
author = {Quiroga, David A. and Kyrillidis, Anastasios},
booktitle = {IEEE} International Conference on Rebooting Computing (ICRC)},
pages = {1--10},
year = {2023},
publisher = {IEEE},
doi = {10.1109/ICRC60800.2023.10386455}
}
```
Qihan Wang, Chen Dun, Fangshuo Liao, Chris Jermaine, Anastasios Kyrillidis, ``LOFT: Finding Lottery Tickets through Filter-wise Training”, Twenty-sixth International Conference on Artificial Intelligence and Statistics (AISTATS-23), 2023.

Recent work on the Lottery Ticket Hypothesis (LTH) shows that there exist “winning tickets” in large neural networks. These tickets represent “sparse” versions of the full model that can be trained independently to achieve comparable accuracy with respect to the full model. However, finding the winning tickets requires one to pretrain the large model for at least a number of epochs, which can be a burdensome task, especially when the original neural network gets larger. In this paper, we explore how one can efficiently identify the emergence of such winning tickets, and use this observation to design efficient pretraining algorithms. For clarity of exposition, our focus is on convolutional neural networks (CNNs). To identify good filters, we propose a novel filter distance metric that well-represents the model convergence. As our theory dictates, our filter analysis behaves consistently with recent findings of neural network learning dynamics. Motivated by these observations, we present the LOttery ticket through Filter-wise Training algorithm, dubbed as LoFT. LoFT is a model-parallel pretraining algorithm that partitions convolutional layers by filters to train them independently in a distributed setting, resulting in reduced memory and communication costs during pretraining. Experiments show that LoFT i) preserves and finds good lottery tickets, while ii) it achieves non-trivial computation and communication savings, and maintains comparable or even better accuracy than other pretraining methods.
```
  		
  @article{loft2023wang,
title = 	 {LOFT: Finding Lottery Tickets through Filter-wise Training},
author =       {Wang, Qihan and Dun, Chen and Liao, Fangshuo and Jermaine, Chris and Kyrillidis, Anastasios},
booktitle = 	 {Proceedings of The 26th International Conference on Artificial Intelligence and Statistics},
pages = 	 {6498--6526},
year = 	 {2023},
editor = 	 {Ruiz, Francisco and Dy, Jennifer and van de Meent, Jan-Willem},
volume = 	 {206},
series = 	 {Proceedings of Machine Learning Research},
month = 	 {25--27 Apr},
publisher =    {PMLR}
}
```
Chen Dun, Mirian Hipolito, Chris Jermaine, Dimitrios Dimitriadis, Anastasios Kyrillidis, ``Efficient and Light-Weight Federated Learning via Asynchronous Distributed Dropout”, Twenty-sixth International Conference on Artificial Intelligence and Statistics (AISTATS-23), 2023.

Asynchronous learning protocols have regained attention lately, especially in the Federated Learning (FL) setup, where slower clients can severely impede the learning process. Herein, we propose AsyncDrop, a novel asynchronous FL framework that utilizes dropout regularization to handle device heterogeneity in distributed settings. Overall, AsyncDrop achieves better performance compared to state of the art asynchronous methodologies, while resulting in less communication and training time overheads. The key idea revolves around creating “submodels” out of the global model, and distributing their training to workers, based on device heterogeneity. We rigorously justify that such an approach can be theoretically characterized. We implement our approach and compare it against other asynchronous baselines, both by design and by adapting existing synchronous FL algorithms to asynchronous scenarios. Empirically, AsyncDrop reduces the communication cost and training time, while matching or improving the final test accuracy in diverse non-i.i.d. FL scenarios.
```
  		
  @article{asyncdrop2023dun,
title = 	 {Efficient and Light-Weight Federated Learning via Asynchronous Distributed Dropout},
author =       {Dun, Chen and Hipolito, Mirian and Jermaine, Chris and Dimitriadis, Dimitrios and Kyrillidis, Anastasios},
booktitle = 	 {Proceedings of The 26th International Conference on Artificial Intelligence and Statistics},
pages = 	 {6630--6660},
year = 	 {2023},
editor = 	 {Ruiz, Francisco and Dy, Jennifer and van de Meent, Jan-Willem},
volume = 	 {206},
series = 	 {Proceedings of Machine Learning Research},
month = 	 {25--27 Apr},
publisher =    {PMLR}
 }
```
Zheyang Xiong, Fangshuo Liao, Anastasios Kyrillidis, ``Strong Lottery Ticket Hypothesis with perturbation”, Twenty-sixth International Conference on Artificial Intelligence and Statistics (AISTATS-23), 2023.

The strong Lottery Ticket Hypothesis (LTH) (Ramanujan et al., 2019; Zhou et al., 2019) claims the existence of a subnetwork in a sufficiently large, randomly initialized neural network that approximates some target neural network without the need of training. We extend the theoretical guarantee of the strong LTH literature to a scenario more similar to the original LTH, by generalizing the weight change in the pre-training step to some perturbation around initialization. In particular, we focus on the following open questions: By allowing a perturbation on the random initial weights, can we reduce the over-parameterization requirement for the candidate network in the strong LTH? Furthermore, does the weight change by SGD coincide with a good set of such perturbation? We answer the first question by first extending the theoretical result on subset sum problem (Lueker, 1998) to allow perturbation on the candidates. Applying this result to the neural network setting, we show that by allowing scale perturbation, we can reduce the over-parameterization requirement of the strong LTH. To answer the second question, we show via experiments that the perturbed weight achieved by the projected SGD shows better performance under the strong LTH pruning.
```
  		
  @article{slth2023xiong,
title = 	 {Strong Lottery Ticket Hypothesis with epsilon–perturbation},
author =       {Xiong, Zheyang and Liao, Fangshuo and Kyrillidis, Anastasios},
booktitle = 	 {Proceedings of The 26th International Conference on Artificial Intelligence and Statistics},
pages = 	 {6879--6902},
year = 	 {2023},
editor = 	 {Ruiz, Francisco and Dy, Jennifer and van de Meent, Jan-Willem},
volume = 	 {206},
series = 	 {Proceedings of Machine Learning Research},
month = 	 {25--27 Apr},
publisher =    {PMLR}
}
```
Carlos Quintero-Pena, Zachary Kingston, Tianyang Pan, Rahul Shome, Anastasios Kyrillidis, and Lydia E. Kavraki, ``Optimal Grasps and Placements for Task and Motion Planning in Clutter”, IEEE International Conference on Robotics and Automation (ICRA), 2023.

Many methods that solve robot planning problems, such as task and motion planners, employ discrete symbolic search to find sequences of valid symbolic actions that are grounded with motion planning. Much of the efficacy of these planners lies in this grounding—bad placement and grasp choices can lead to inefficient planning when a problem has many geometric constraints. Moreover, grounding methods such as naive sampling often fail to find appropriate values for these choices in the presence of clutter. Towards efficient task and motion planning, we present a novel optimization-based approach for grounding to solve cluttered problems that have many constraints that arise from geometry. Our approach finds an optimal grounding and can provide feedback to discrete search for more effective planning. We demonstrate our method against baseline methods in complex simulated environments.
```
		@article{quinterooptimal,
title={Optimal Grasps and Placements for Task and Motion Planning in Clutter},
author={Quintero-Pena, Carlos and Kingston, Zachary and Pan, Tianyang and Shome, Rahul and Kyrillidis, Anastasios and Kavraki, Lydia E}
}
```
Syed Rizvi, Chen Dun, Anastasios Kyrillidis, ``PCRIST: Variance Reduction through Periodic Centralized Training in Distributed Subnetwork Training of Residual Networks”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), in proceedings of “Timely and Private Machine Learning over Networks” workshop, 2023.

In this work, we propose Periodic Centralized ResIST (PCRIST), a distributed training protocol for residual Convolutional Neural Networks which reduces variance during training in order to recover model performance in distributed settings. PCRIST introduces short periods of centralized training into the ResIST distributed training protocol, which aims to reduce the variance among distributed workers each training subnets of the global Resnet. PCRIST improves on the performance of ResIST while maintaining many of its advantages in terms of communication efficiency compared to baseline distributed training protocols. We evaluate PCRIST on two benchmark image classification tasks, and demonstrate superior performance to the ResIST training protocol while maintaining the benefits of communication efficiency seen with the ResIST protocol.
```
		@article{rivzi2023pcrist,
title={PCRIST: Variance Reduction through Periodic Centralized Training in Distributed Subnetwork Training of Residual Networks},
author={Syed Rizvi, Chen Dun, Anastasios Kyrillidis},
journal={preprint},
year={2023}
}
```
Junhyung Lyle Kim, Mohammad Taha Toghani, Cesar A. Uribe, Anastasios Kyrillidis, ``Local Stochastic Factored Gradient Descent for Distributed Quantum State Tomography”, IEEE Conference on Decision and Control (CDC), 2022.

We propose a distributed Quantum State Tomography (QST) protocol, named Local Stochastic Factored Gradient Descent (Local SFGD), to learn the low-rank factor of a density matrix over a set of local machines. QST is the canonical procedure to characterize the state of a quantum system, which we formulate as a stochastic nonconvex smooth optimization problem. Physically, the estimation of a low-rank density matrix helps characterizing the amount of noise introduced by quantum computation. Theoretically, we prove the local convergence of Local SFGD for a general class of restricted strongly convex/smooth loss functions, i.e., Local SFGD converges locally to a small neighborhood of the global optimum at a linear rate with a constant step size, while it locally converges exactly at a sub-linear rate with diminishing step sizes. With a proper initialization, local convergence results imply global convergence. We validate our theoretical findings with numerical simulations of QST on the Greenberger-Horne-Zeilinger (GHZ) state.


  		@article{kim2022local,
  title={Local Stochastic Factored Gradient Descent for Distributed Quantum State Tomography},
  author={Kim, Junhyung Lyle and Toghani, Mohammad Taha and Uribe, C{\'e}sar A and Kyrillidis, Anastasios},
  journal={arXiv preprint arXiv:2203.11579},
  year={2022}
}

Cheng Wan, Youjie Li, Cameron R. Wolfe, Anastasios Kyrillidis, Nam Sung Kim, Yingyan Lin, ``PipeGCN: Efficient Full-Graph Training of Graph Convolutional Networks with Pipelined Feature Communication”, International Conference on Learning Representations (ICLR), 2022.

Graph Convolutional Networks (GCNs) is the state-of-the-art method for learning graph-structured data, and training large-scale GCNs requires distributed training across multiple accelerators such that each accelerator is able to hold a partitioned subgraph. However, distributed GCN training incurs prohibitive overhead of communicating node features and feature gradients among partitions for every GCN layer in each training iteration, limiting the achievable training efficiency and model scalability. To this end, we propose PipeGCN, a simple-yet-effective scheme that hides the communication overhead by pipelining inter-partition communication with intra-partition computation. It is non-trivial to pipeline for efficient GCN training, as communicated node features/gradients will become stale and thus can harm the convergence, negating the pipeline benefit. Notably, little is known regarding the convergence rate of GCN training with both stale features and stale feature gradients. This work not only provides a theoretical convergence guarantee but also finds the convergence rate of PipeGCN to be close to that of the vanilla distributed GCN training without staleness. Furthermore, we develop a smoothing method to further improve PipeGCN’s convergence. Extensive experiments show that PipeGCN can largely boost training throughput (up to 2.2x) while achieving the same accuracy as its vanilla counterpart and outperforming existing full-graph training methods. All code will be released publicly upon acceptance.


  		@inproceedings{wan2021pipegcn,
  title={PipeGCN: Efficient Full-Graph Training of Graph Convolutional Networks with Pipelined Feature Communication},
  author={Wan, Cheng and Li, Youjie and Wolfe, Cameron R and Kyrillidis, Anastasios and Kim, Nam Sung and Lin, Yingyan},
  booktitle={International Conference on Learning Representations},
  year={2022}
}

Binhang Yuan, Cameron Wolfe, Chen Dun, Yuxin Tang, Anastasios Kyrillidis and Chris Jermaine, ``Distributed Learning of Deep Neural Networks Using Independent Subnet Training”, International Conference on Very Large Databases (VLDB), 2022.

Distributed machine learning (ML) can bring more computational resources to bear than single-machine learning, thus enabling reductions in training time. Distributed learning partitions models and data over many machines, allowing model and dataset sizes beyond the available compute power and memory of a single machine. In practice though, distributed ML is challenging when distribution is mandatory, rather than chosen by the practitioner. In such scenarios, data could unavoidably be separated among workers due to limited memory capacity per worker or even because of data privacy issues. There, existing distributed methods will utterly fail due to dominant transfer costs across workers, or do not even apply. We propose a new approach to distributed fully connected neural network learning, called independent subnet training (IST), to handle these cases. In IST, the original network is decomposed into a set of narrow subnetworks with the same depth. These subnetworks are then trained locally before parameters are exchanged to produce new subnets and the training cycle repeats. Such a naturally łmodel parallelž approach limits memory usage by storing only a portion of network parameters on each device. Additionally, no requirements exist for sharing data between workers (i.e., subnet training is local and independent) and communication volume and frequency are reduced by decomposing the original network into independent subnets. These properties of IST can cope with issues due to distributed data, slow interconnects, or limited device memory, making IST a suitable approach for cases of mandatory distribution. We show experimentally that IST results in training times that are much lower than common distributed learning approaches.
```
		@article{yuan2022distributed,
title={Distributed learning of fully connected neural networks using independent subnet training},
author={Yuan, Binhang and Wolfe, Cameron R and Dun, Chen and Tang, Yuxin and Kyrillidis, Anastasios and Jermaine, Chris},
journal={Proceedings of the VLDB Endowment},
volume={15},
number={8},
pages={1581--1590},
year={2022},
publisher={VLDB Endowment}
}
```
Ahmed Imtiaz Humayun, Randall Balestriero, Anastasios Kyrillidis, Richard Baraniuk, ``No More Than 6ft Apart: Robust K-Means via Radius Upper Bounds”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-22), 2022.

Centroid based clustering methods such as k-means, kmedoids and k-centers are heavily applied as a go-to tool in exploratory data analysis. In many cases, those methods are used to obtain representative centroids of the data manifold for visualization or summarization of a dataset. Real world datasets often contain inherent abnormalities e.g. repeated samples and sampling bias, that manifest imbalanced clustering. We propose to remedy such scenario by introducing a maximal radius constraint r on the clusters formed by the centroids i.e. samples from a same cluster should not be more than 2r apart in term of `2 distance. We achieve this constraint by solving a semi-definite program, followed by a linear assignment problem with quadratic constraints. Through qualitative results, we show that our proposed method is robust towards dataset imbalances and sampling artefacts. To the best of our knowledge, ours is the first constrained k-means clustering method with hard radius constraints
```
		@inproceedings{humayun2022no,
title={No More Than 6ft Apart: Robust $k$-Means via Radius Upper Bounds},
author={Humayun, Ahmed Imtiaz and Balestriero, Randall and Kyrillidis, Anastasios and Baraniuk, Richard},
booktitle={ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={4433--4437},
year={2022},
organization={IEEE}
}
```
John Chen, Cameron Wolfe, Zhao Li Anastasios Kyrillidis, ``Demon: Decaying momentum helps neural network training”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-22), 2022.

Momentum is a widely used technique for gradient-based optimizers in deep learning. Here, we propose a decaying momentum (DEMON) hyperparameter rule. We conduct large-scale empirical analysis of momentum decay methods for modern neural network optimization and compare to the most popular learning rate decay schedules. Across 28 relevant combinations of models, epochs, datasets, and optimizers, DEMON achieves Top-1 and Top-3 performance in 39\% and 85\% of cases, respectively, almost doubling the second-placed cosine learning rate schedule at 17\% and 60\%, respectively. DEMON consistently outperforms other widely-used schedulers including, but not limited to, the learning rate step schedule, linear schedule, OneCycle schedule, and exponential schedule. Compared with the widely-used learning rate step schedule, DEMON is less sensitive to parameter tuning, which is critical to training neural networks in practice. Results are demonstrated across a variety of settings and architectures, ncluding image classification models, generative models, and language models. DEMON is easy to implement, requires no additional tuning, and incurs almost no extra computational overhead compared to the vanilla counterparts. Code is readily available.
```
		@inproceedings{chen2022demon,
title={Demon: Improved Neural Network Training with Momentum Decay},
author={Chen, John and Wolfe, Cameron and Li, Zhao and Kyrillidis, Anastasios},
booktitle={ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={3958--3962},
year={2022},
organization={IEEE}
}
```
Chen Dun, Cameron Wolfe, Anastasios Kyrillidis, ``ResIST: Layer-Wise Decomposition of ResNets for Distributed Training”, Conference on Uncertainty in Artificial Intelligence (UAI-22), 2022.

We propose ResIST, a novel distributed training protocol for Residual Networks (ResNets). ResIST randomly decomposes a global ResNet into several shallow sub-ResNets that are trained independently in a distributed manner for several local iterations, before having their updates synchronized and aggregated into the global model. In the next round, new sub-ResNets are randomly generated and the process repeats until convergence. By construction, per iteration, ResIST communicates only a small portion of network parameters to each machine and never uses the full model during training. Thus, ResIST reduces the per-iteration communication, memory, and time requirements of ResNet training to only a fraction of the requirements of full-model training. In comparison to common protocols, like data-parallel training and data-parallel training with local SGD, ResIST yields a decrease in communication and compute requirements, while being competitive with respect to model performance.
```
		@inproceedings{dun2022resist,
title={ResIST: Layer-Wise Decomposition of ResNets for Distributed Training},
author={Dun, Chen and Wolfe, Cameron R and Jermaine, Chris and Kyrillidis, Anastasios},
booktitle={The 38th Conference on Uncertainty in Artificial Intelligence},
year={2022}
}
```
John Chen, Samarth Sinha, Anastasios Kyrillidis, ``StackMix: A complementary Mix algorithm”, Conference on Uncertainty in Artificial Intelligence (UAI-22), 2022.

Techniques combining multiple images as input/output have proven to be effective data augmentations for training convolutional neural networks. In this paper, we present StackMix: each input is presented as a concatenation of two images, and the label is the mean of the two one-hot labels. On its own, StackMix rivals other widely used methods in the Mix’’ line of work. More importantly, unlike previous work, significant gains across a variety of benchmarks are achieved by combining StackMix with existing Mix augmentation, effectively mixing more than two images. E.g., by combining StackMix with CutMix, test error in the supervised setting is improved across a variety of settings over CutMix, including 0.8\% on ImageNet, 3\% on Tiny ImageNet, 2\% on CIFAR-100, 0.5\% on CIFAR-10, and 1.5\% on STL-10. Similar results are achieved with Mixup. We further show that gains hold for robustness to common input corruptions and perturbations at varying severities with a 0.7\% improvement on CIFAR-100-C, by combining StackMix with AugMix over AugMix. On its own, improvements with StackMix hold across different number of labeled samples on CIFAR-100, maintaining approximately a 2\% gap in test accuracy –down to using only 5\% of the whole dataset– and is effective in the semi-supervised setting with a 2\% improvement with the standard benchmark -model. Finally, we perform an extensive ablation study to better understand the proposed methodology.
```
		@inproceedings{chen2022stackmix,
title={StackMix: A complementary {M}ix algorithm},
author={Chen, John and Sinha, Samarth and Kyrillidis, Anastasios},
booktitle={The 38th Conference on Uncertainty in Artificial Intelligence},
year={2022}
}
```
John Chen, Cameron Wolfe, Anastasios Kyrillidis, ``REX: Revisiting Budgeted Training with an Improved Schedule”, 5th Conference on Machine Learning and Systems (MLSys-22), 2022.

Deep learning practitioners often operate on a computational and monetary budget. Thus, it is critical to design optimization algorithms that perform well under any budget. The linear learning rate schedule is considered the best budget-aware schedule, as it outperforms most other schedules in the low budget regime. On the other hand, learning rate schedules – such as the 30-60-90 step schedule – are known to achieve high performance when the model can be trained for many epochs. Yet, it is often not known a priori whether one’s budget will be large or small; thus, the optimal choice of learning rate schedule is made on a case-by-case basis. In this paper, we frame the learning rate schedule selection problem as a combination of selecting a profile (i.e., the continuous function that models the learning rate schedule), and choosing a sampling rate (i.e., how frequently the learning rate is updated/sampled from this profile). We propose a novel profile and sampling rate combination called the Reflected Exponential (REX) schedule, which we evaluate across seven different experimental settings with both SGD and Adam optimizers. REX outperforms the linear schedule in the low budget regime, while matching or exceeding the performance of several state-of-the-art learning rate schedules (linear, step, exponential, cosine, step decay on plateau, and OneCycle) in both high and low budget regimes. Furthermore, REX requires no added computation, storage, or hyperparameters.


  		@article{chen2022rex,
  title={REX: Revisiting Budgeted Training with an Improved Schedule},
  author={Chen, John and Wolfe, Cameron and Kyrillidis, Tasos},
  journal={Proceedings of Machine Learning and Systems},
  volume={4},
  pages={64--76},
  year={2022}
}

Cameron R Wolfe, Anastasios Kyrillidis, ``i-SpaSP: Structured Neural Pruning via Sparse Signal Recovery”, 4th Annual Learning for Dynamics & Control Conference (L4DC-22), 2022.

We propose a novel, structured pruning algorithm for neural networks—the iterative, Sparse Structured Pruning algorithm, dubbed as i-SpaSP. Inspired by ideas from sparse signal recovery, i-SpaSP operates by iteratively identifying a larger set of important parameter groups (e.g., filters or neurons) within a network that contribute most to the residual between pruned and dense network output, then thresholding these groups based on a smaller, pre-defined pruning ratio. For both two-layer and multi-layer network architectures with ReLU activations, we show the error induced by pruning with i-SpaSP decays polynomially, where the degree of this polynomial becomes arbitrarily large based on the sparsity of the dense network’s hidden representations. In our experiments, i-SpaSP is evaluated across a variety of datasets (i.e., MNIST, ImageNet, and XNLI) and architectures (i.e., feed forward networks, ResNet34, MobileNetV2, and BERT), where it is shown to discover high-performing sub-networks and improve upon the pruning efficiency of provable baseline methodologies by several orders of magnitude. Put simply, i-SpaSP is easy to implement with automatic differentiation, achieves strong empirical results, comes with theoretical convergence guarantees, and is efficient, thus distinguishing itself as one of the few computationally efficient, practical, and provable pruning algorithms.
```
		@inproceedings{wolfe2022spasp,
title={i-SpaSP: Structured Neural Pruning via Sparse Signal Recovery},
author={Wolfe, Cameron R and Kyrillidis, Anastasios},
booktitle={Learning for Dynamics and Control Conference},
pages={248--262},
year={2022},
organization={PMLR}
}
```
Junhyung Lyle Kim, Panos Toulis, Anastasios Kyrillidis, ``Convergence and Stability of the Stochastic Proximal Point Algorithm with Momentum”, 4th Annual Learning for Dynamics & Control Conference (L4DC-22), 2022.

Stochastic gradient descent with momentum (SGDM) is the dominant algorithm in many optimization scenarios, including convex optimization instances and non-convex neural network training. Yet, in the stochastic setting, momentum interferes with gradient noise, often leading to specific step size and momentum choices in order to guarantee convergence, set aside acceleration. Proximal point methods, on the other hand, have gained much attention due to their numerical stability and elasticity against imperfect tuning. Their stochastic accelerated variants though have received limited attention: how momentum interacts with the stability of (stochastic) proximal point methods remains largely unstudied. To address this, we focus on the convergence and stability of the stochastic proximal point algorithm with momentum (SPPAM), and show that SPPAM allows a faster linear convergence to a neighborhood compared to stochastic proximal point algorithm (SPPA) with a better contraction factor, under proper hyperparameter tuning. In terms of stability, we show that SPPAM depends on problem constants more favorably than SGDM, allowing a wider range of step size and momentum that lead to convergence.
```
		@inproceedings{kim2022convergence,
title={Convergence and Stability of the Stochastic Proximal Point Algorithm with Momentum},
author={Kim, Junhyung Lyle and Toulis, Panos and Kyrillidis, Anastasios},
booktitle={Learning for Dynamics and Control Conference},
pages={1034--1047},
year={2022},
organization={PMLR}
}
```
- John Chen, Qihan Wang, Anastasios Kyrillidis, ``Mitigating deep double descent by concatenating inputs”, ACM International Conference on Information and Knowledge Management (CIKM-21), 2021.
The double descent curve is one of the most intriguing properties of deep neural networks. It contrasts the classical bias-variance curve with the behavior of modern neural networks, occurring where the number of samples nears the number of parameters. In this work, we explore the connection between the double descent phenomena and the number of samples in the deep neural network setting. In particular, we propose a construction which augments the existing dataset by artificially increasing the number of samples. This construction empirically mitigates the double descent curve in this setting. We reproduce existing work on deep double descent, and observe a smooth descent into the overparameterized region for our construction. This occurs both with respect to the model size, and with respect to the number epochs.
```
		@article{chen2021mitigating,
title={Mitigating deep double descent by concatenating inputs},
author={Chen, John and Wang, Qihan and Kyrillidis, Anastasios},
journal={arXiv preprint arXiv:2107.00797},
year={2021}
}
```
Carlos Quintero-Pena, Anastasios Kyrillidis, Lydia E Kavraki, ``Robust Optimization-based Motion Planning for high-DOF Robots under Sensing Uncertainty”, IEEE International Conference on Robotics and Automation (ICRA-21), 2021.

Motion planning for high degree-of-freedom (DOF) robots is challenging, especially when acting in complex environments under sensing uncertainty. While there is significant work on how to plan under state uncertainty for low-DOF robots, existing methods cannot be easily translated into the high-DOF case, due to the complex geometry of the robot’s body and its environment. In this paper, we present a method that enhances optimization-based motion planners to produce robust trajectories for high-DOF robots for convex obstacles. Our approach introduces robustness into planners that are based on sequential convex programming: We reformulate each convex subproblem as a robust optimization problem that “protects” the solution against deviations due to sensing uncertainty. The parameters of the robust problem are estimated by sampling from the distribution of noisy obstacles, and performing a first-order approximation of the signed distance function. The original merit function is updated to account for the new costs of the robust formulation at every step. The effectiveness of our approach is demonstrated on two simulated experiments that involve a full body square robot, that moves in randomly generated scenes, and a 7-DOF Fetch robot, performing tabletop operations. The results show nearly zero probability of collision for a reasonable range of the noise parameters for Gaussian and Uniform uncertainty.
```
		@inproceedings{quintero2021robust,
title={Robust Optimization-based Motion Planning for high-DOF Robots under Sensing Uncertainty},
author={Quintero-Pena, Carlos and Kyrillidis, Anastasios and Kavraki, Lydia E},
booktitle={2021 IEEE International Conference on Robotics and Automation (ICRA)},
year={2021}
}
```
Jacky Y. Zhang, Rajiv Khanna, Anastasios Kyrillidis, and Oluwasanmi Koyejo, ``Bayesian Coresets: Revisiting the Optimization Perspective”, Twenty-fourth International Conference on Artificial Intelligence and Statistics (AISTATS-21), 2021.

We explore the potential of continuous local search (CLS) in SAT solving by proposing a novel approach for finding a solution of a hybrid system of Boolean constraints. The algorithm is based on CLS combined with belief propagation on binary decision diagrams (BDDs). Our framework accepts all Boolean constraints that admit compact BDDs, including symmetric Boolean constraints and small-coefficient pseudo-Boolean constraints as interesting families. We propose a novel algorithm for efficiently computing the gradient needed by CLS. We study the capabilities and limitations of our versatile CLS solver, GradSAT, by applying it on many benchmark instances. The experimental results indicate that GradSAT can be a useful addition to the portfolio of existing SAT and MaxSAT solvers for solving Boolean satisfiability and optimization problems.
```
		@inproceedings{zhang2021bayesian,
title={Bayesian Coresets: Revisiting the Optimization Perspective},
author={Zhang, Jacky and Khanna, Rajiv and Kyrillidis, Anastasios and Koyejo, Oluwasanmi},
booktitle={International Conference on Artificial Intelligence and Statistics},
year={2021}
}
```
Anastasios Kyrillidis, Moshe Y. Vardi, and Zhiwei Zhang, ``On Continuous Local BDD-Based Search for Hybrid SAT Solving”, AAAI Conference on Artificial Intelligence (AAAI-21), 2021.

We explore the potential of continuous local search (CLS) in SAT solving by proposing a novel approach for finding a solution of a hybrid system of Boolean constraints. The algorithm is based on CLS combined with belief propagation on binary decision diagrams (BDDs). Our framework accepts all Boolean constraints that admit compact BDDs, including symmetric Boolean constraints and small-coefficient pseudo-Boolean constraints as interesting families. We propose a novel algorithm for efficiently computing the gradient needed by CLS. We study the capabilities and limitations of our versatile CLS solver, GradSAT, by applying it on many benchmark instances. The experimental results indicate that GradSAT can be a useful addition to the portfolio of existing SAT and MaxSAT solvers for solving Boolean satisfiability and optimization problems.
```
		@inproceedings{kyrillidis2021continuous,
title={On Continuous Local BDD-Based Search for Hybrid SAT Solving},
author={Kyrillidis, Anastasios and Vardi, Moshe and Zhang, Zhiwei},
booktitle={AAAI Conference on Artificial Intelligence},
year={2021}
}
```
John Chen, Vatsal Shah and Anastasios Kyrillidis, ``Negative sampling in semi-supervised learning”, International Conference on Machine Learning (ICML-20), 2020.

We introduce Negative Sampling in Semi-Supervised Learning (NS3L), a simple, fast, easy to tune algorithm for semi-supervised learning (SSL). NS3L is motivated by the success of negative sampling/contrastive estimation. We demonstrate that adding the NS3L loss to state-of-the-art SSL algorithms, such as the Virtual Adversarial Training (VAT), significantly improves upon vanilla VAT and its variant, VAT with Entropy Minimization. By adding the NS3L loss to MixMatch, the current state-of-the-art approach on semi-supervised tasks, we observe significant improvements over vanilla MixMatch. We conduct extensive experiments on the CIFAR10, CIFAR100, SVHN and STL10 benchmark datasets. Finally, we perform an ablation study for NS3L regarding its hyperparameter tuning.
```
		@inproceedings{chen2020negative,
title={Negative sampling in semi-supervised learning},
author={Chen, John and Shah, Vatsal and Kyrillidis, Anastasios},
booktitle={International Conference on Machine Learning},
year={2020}
}
```
Kelly Geyer, Anastasios Kyrillidis, and Amir Kalev, ``Low-rank regularization and solution uniqueness in over- parameterized matrix sensing”, Twenty-third International Conference on Artificial Intelligence and Statistics (AISTATS-20), 2020.

We consider the question whether algorithmic choices in over-parameterized linear matrix factorization introduce implicit low-rank regularization.We focus on the noiseless matrix sensing scenario over low-rank positive semi-definite (PSD) matrices over the reals, with a sensing mechanism that satisfies restricted isometry properties.Surprisingly, it was recently argued that for recovery of PSD matrices, gradient descent over a squared, full-rank factorized space introduces implicit low-rank regularization.Thus, a clever choice of the recovery algorithm avoids the need for explicit low-rank regularization. In this contribution, we prove that in fact, under certain conditions, the PSD constraint by itself is sufficient to lead to a unique low-rank matrix recovery, without explicit or implicit regularization.Therefore, under these conditions, the set of PSD matrices that are consistent with the observed data, is a singleton, regardless of the algorithm used. Our numerical study indicates that this result is general and extends to cases beyond the those covered by the proof.
```
		@inproceedings{geyer2020low,
title={Low-rank regularization and solution uniqueness in over-parameterized matrix sensing},
author={Geyer, Kelly and Kyrillidis, Anastasios and Kalev, Amir},
booktitle={International Conference on Artificial Intelligence and Statistics},
pages={930--940},
year={2020}
}
```
Anastasios Kyrillidis, Anshumali Shrivastava, Moshe Y. Vardi, Zhiwei Zhang, ``FourierSAT: A Fourier Expansion-Based Algebraic Framework for Solving Hybrid Boolean Constraints”, AAAI Conference on Artificial Intelligence (AAAI-20), 2020.

The Boolean SATisfiability problem (SAT) is of central importance in computer science. Although SAT is known to be NP-complete, progress on the engineering side, especially that of Conflict-Driven Clause Learning (CDCL) and Local Search SAT solvers, has been remarkable. Yet, while SAT solvers aimed at solving industrial-scale benchmarks in Conjunctive Normal Form (CNF) have become quite mature, SAT solvers that are effective on other types of constraints, e.g., cardinality constraints and XORs, are less well studied; a general approach to handling non-CNF constraints is still lacking. In addition, previous work indicated that for specific classes of benchmarks, the running time of extant SAT solvers depends heavily on properties of the formula and details of encoding, instead of the scale of the benchmarks, which adds uncertainty to expectations of running time. To address the issues above, we design FourierSAT, an incomplete SAT solver based on Fourier analysis of Boolean functions, a technique to represent Boolean functions by multilinear polynomials. By such a reduction to continuous optimization, we propose an algebraic framework for solving systems consisting of different types of constraints. The idea is to leverage gradient information to guide the search process in the direction of local improvements. Empirical results demonstrate that FourierSAT is more robust than other solvers on certain classes of benchmarks
```
		@article{kyrillidis2019fouriersat,
title={FourierSAT: A Fourier Expansion-Based Algebraic Framework for Solving Hybrid Boolean Constraints},
author={Kyrillidis, Anastasios and Shrivastava, Anshumali and Vardi, Moshe Y and Zhang, Zhiwei},
journal={arXiv preprint arXiv:1912.01032},
year={2019}
}
```
Jacky Y. Zhang, Rajiv Khanna, Anastasios Kyrillidis, Oluwasanmi O. Koyejo, ``Learning Sparse Distributions using Iterative Hard Thresholding”, Advances in Neural Information Processing Systems (NIPS-2019), 2019.

Iterative hard thresholding (IHT) is a projected gradient descent algorithm, known to achieve state of the art performance for a wide range of structured estimation problems, such as sparse inference. In this work, we consider IHT as a solution to the problem of learning sparse discrete distributions. We study the hardness of using IHT on the space of measures. As a practical alternative, we propose a greedy approximate projection which simultaneously captures appropriate notions of sparsity in distributions, while satisfying the simplex constraint, and investigate the convergence behavior of the resulting procedure in various settings. Our results show, both in theory and practice, that IHT can achieve state of the art results for learning sparse distributions.
```
		@inproceedings{zhang2019learning,
title={Learning Sparse Distributions using Iterative Hard Thresholding},
author={Zhang, Jacky Y and Khanna, Rajiv and Kyrillidis, Anastasios and Koyejo, Oluwasanmi O},
booktitle={Advances in Neural Information Processing Systems},
pages={6757--6766},
year={2019}
}
```
Ryan Spring, Anastasios Kyrillidis, Vijai Mohan, Anshumali Shrivastava, ``Compressing Gradient Optimizers via Count-Sketches”, International Conference on Machine Learning (ICML-19), 2019.

Many popular first-order optimization methods (eg, Momentum, AdaGrad, Adam) accelerate the convergence rate of deep learning models. However, these algorithms require auxiliary parameters, which cost additional memory proportional to the number of parameters in the model. The problem is becoming more severe as deep learning models continue to grow larger in order to learn from complex, large-scale datasets. Our proposed solution is to maintain a linear sketch to compress the auxiliary variables. We demonstrate that our technique has the same performance as the full-sized baseline, while using significantly less space for the auxiliary variables. Theoretically, we prove that count-sketch optimization maintains the SGD convergence rate, while gracefully reducing memory usage for large-models. On the large-scale 1-Billion Word dataset, we save 25% of the memory used during training (8.6 GB instead of 11.7 GB) by compressing the Adam optimizer in the Embedding and Softmax layers with negligible accuracy and performance loss.
```
		@article{spring2019compressing,
title={Compressing Gradient Optimizers via Count-Sketches},
author={Spring, Ryan and Kyrillidis, Anastasios and Mohan, Vijai and Shrivastava, Anshumali},
journal={arXiv preprint arXiv:1902.00179},
year={2019}
}
```
Anastasios Kyrillidis, ``Simple and practical algorithms for $\ell_p$-norm low-rank approximation”, Conference on Uncertainty in Artificial Intelligence (UAI-18), 2018.

We propose practical algorithms for entrywise $\ell_p$-norm low-rank approximation, for $p = 1$ or $p = \infty$. The proposed framework, which is non-convex and gradient-based, is easy to implement and typically attains better approximations, faster, than state of the art. From a theoretical standpoint, we show that the proposed scheme can attain $(1 + \varepsilon)$-OPT approximations. Our algorithms are not hyperparameter-free: they achieve the desiderata only assuming algorithm's hyperparameters are known a priori---or are at least approximable. I.e., our theory indicates what problem quantities need to be known, in order to get a good solution within polynomial time, and does not contradict to recent inapproximabilty results.
```
		@article{kyrillidis2018simple,
				title={Simple and practical algorithms for $\ell_p$-norm low-rank approximation},
              author={Anastasios Kyrillidis},
              journal={Conference on Uncertainty in Artificial Intelligence (UAI-18)},
          	year={2018}
              }
```
Rajiv Khanna and Anastasios Kyrillidis, ``IHT dies hard: Provable accelerated Iterative Hard Thresholding”, Twenty-first International Conference on Artificial Intelligence and Statistics (AISTATS-18), 2018.

We study --both in theory and practice-- the use of momentum motions in classic iterative hard thresholding (IHT) methods. By simply modifying plain IHT, we investigate its convergence behavior on convex optimization criteria with non-convex constraints, under standard assumptions. In diverse scenaria, we observe that acceleration in IHT leads to significant improvements, compared to state of the art projected gradient descent and Frank-Wolfe variants. As a byproduct of our inspection, we study the impact of selecting the momentum parameter: similar to convex settings, two modes of behavior are observed --"rippling" and linear-- depending on the level of momentum.
```
		@article{khanna2018IHT,
				title={IHT dies hard: Provable accelerated Iterative Hard Thresholding},
              author={Rajiv Khanna and Anastasios Kyrillidis},
              journal={Twenty-first International Conference on Artificial Intelligence and Statistics (AISTATS-18)},
          	year={2018}
              }
```
Tianyang Li, Liu Liu, Anastasios Kyrillidis, and Constantine Caramanis, ``Statistical inference using SGD”, Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), 2018.

We present a novel method for frequentist statistical inference in $M$-estimation problems, based on stochastic gradient descent (SGD) with a fixed step size: we demonstrate that the average of such SGD sequences can be used for statistical inference, after proper scaling. An intuitive analysis using the Ornstein-Uhlenbeck process suggests that such averages are asymptotically normal. From a practical perspective, our SGD-based inference procedure is a first order method, and is well-suited for large scale problems. To show its merits, we apply it to both synthetic and real datasets, and demonstrate that its accuracy is comparable to classical statistical methods, while requiring potentially far less computation.
```
		@article{li2018statistical,
				title={Statistical inference using SGD},
              author={Tianyang Li, Liu Liu, Anastasios Kyrillidis, and Constantine Caramanis},
              journal={Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)},
          	year={2018}
              }
```
Dohyung Park, Anastasios Kyrillidis, Constantine Caramanis, and Sujay Sanghavi, ``Non-square matrix sensing without spurious local minima via the Burer-Monteiro approach”, AI & Statistics Conference (AISTATS-17), 2017.

We consider the non-square matrix sensing problem, under restricted isometry property (RIP) assumptions. We focus on the non-convex formulation, where any rank-$r$ matrix $X \in \mathbb{R}^{m \times n}$ is represented as $UV^\top$, where $U \in \mathbb{R}^{m \times r}$ and $V \in \mathbb{R}^{n \times r}$. In this paper, we complement recent findings on the non-convex geometry of the analogous PSD setting [5], and show that matrix factorization does not introduce any spurious local minima, under RIP.
```
		@article{park2017non,
				title={Non-square matrix sensing without spurious local minima via the Burer-Monteiro approach},
              author={Dohyung Park, Anastasios Kyrillidis, Constantine Caramanis, and Sujay Sanghavi},
              journal={AI & Statistics Conference (AISTATS)},
          	year={2017}
              }
```
Srinadh Bhojanapalli, Anastasios Kyrillidis, and Sujay Sanghavi, ``Dropping convexity for faster semi-definite optimization”, Conference on Learning Theory (COLT-16), 2016.

We study the minimization of a convex function $f(X)$ over the set of $n\times n$ positive semi-definite matrices, but when the problem is recast as $\min_U g(U) := f(UU^\top)$, with $U \in \mathbb{R}^{n \times r}$ and $r\leq n$. We study the performance of gradient descent on $g$---which we refer to as Factored Gradient Descent (FGD)---under standard assumptions on the original function $f$. We provide a rule for selecting the step size and, with this choice, show that the local convergence rate of FGD mirrors that of standard gradient descent on the original $f$: i.e., after $k$ steps, the error is $O(1/k)$ for smooth $f$, and exponentially small in $k$ when $f$ is (restricted) strongly convex. In addition, we provide a procedure to initialize FGD for (restricted) strongly convex objectives and when one only has access to $f$ via a first-order oracle; for several problem instances, such proper initialization leads to global convergence guarantees. FGD and similar procedures are widely used in practice for problems that can be posed as matrix factorization. To the best of our knowledge, this is the first paper to provide precise convergence rate guarantees for general convex functions under standard convex assumptions.
```
		@article{bhojanapalli2016dropping,
				title={Dropping convexity for faster semi-definite optimization},
              author={Srinadh Bhojanapalli, Anastasios Kyrillidis, and Sujay Sanghavi},
              journal={Conference on Learning Theory (COLT)},
          	year={2016}
              }
```
Megasthenis Asteris, Anastasios Kyrillidis, Oluwasanmi Koyejo, and Russell Poldrack, ``A simple and provable algorithm for sparse diagonal CCA”, International Conference on Machine Learning (ICML-16), 2016.

Given two sets of variables, derived from a common set of samples, sparse Canonical Correlation Analysis (CCA) seeks linear combinations of a small number of variables in each set, such that the induced canonical variables are maximally correlated. Sparse CCA is NP-hard. We propose a novel combinatorial algorithm for sparse diagonal CCA, i.e., sparse CCA under the additional assumption that variables within each set are standardized and uncorrelated. Our algorithm operates on a low rank approximation of the input data and its computational complexity scales linearly with the number of input variables. It is simple to implement, and parallelizable. In contrast to most existing approaches, our algorithm administers precise control on the sparsity of the extracted canonical vectors, and comes with theoretical data-dependent global approximation guarantees, that hinge on the spectrum of the input data. Finally, it can be straightforwardly adapted to other constrained variants of CCA enforcing structure beyond sparsity. We empirically evaluate the proposed scheme and apply it on a real neuroimaging dataset to investigate associations between brain activity and behavior measurements.
```
		@article{asterix2016simple,
			title={A simple and provable algorithm for sparse diagonal CCA},
          author={Asterix, Megasthenis and Kyrillidis, Anastasios and Koyejo, Oluwasanmi and Poldrack, Russell},
          journal={arXiv preprint arXiv:1605.08961},
          year={2016}
      }
```
Anastasios Kyrillidis, Bubacarr Bah, Rouzbeh Hasheminezhad, Quoc Tran-Dinh, Luca Baldassarre, and Volkan Cevher, ``Convex block-sparse linear regression with expanders, provably”, AI & Statistics Conference (AISTATS-16), 2016.

Sparse matrices are favorable objects in machine learning and optimization. When such matrices are used, in place of dense ones, the overall complexity requirements in optimization can be significantly reduced in practice, both in terms of space and run-time. Prompted by this observation, we study a convex optimization scheme for block-sparse recovery from linear measurements. To obtain linear sketches, we use expander matrices, i.e., sparse matrices containing only few non-zeros per column. Hitherto, to the best of our knowledge, such algorithmic solutions have been only studied from a non-convex perspective. Our aim here is to theoretically characterize the performance of convex approaches under such setting. Our key novelty is the expression of the recovery error in terms of the model-based norm, while assuring that solution lives in the model. To achieve this, we show that sparse model-based matrices satisfy a group version of the null-space property. Our experimental findings on synthetic and real applications support our claims for faster recovery in the convex setting -- as opposed to using dense sensing matrices, while showing a competitive recovery performance.
```
		@article{kyrillidis2016convex,
			title={Convex block-sparse linear regression with expanders--provably},
          author={Kyrillidis, Anastasios and Bah, Bubacarr and Hasheminezhad, Rouzbeh and Tran-Dinh, Quoc and Baldassarre, Luca and Cevher, Volkan},
          journal={arXiv preprint arXiv:1603.06313},
          year={2016}
      }
```
Hemant Tyagi, Anastasios Kyrillidis, Bernd Gartner, and Andreas Krause, ``Learning sparse additive models with interactions in high dimensions”, AI & Statistics Conference (AISTATS-16), 2016.

A function $f: \mathbb{R}^d \rightarrow \mathbb{R}$ is referred to as a Sparse Additive Model (SPAM), if it is of the form $f(\mathbf{x}) = \sum_{l \in \mathcal{S}}\phi_{l}(x_l)$, where $\mathcal{S} \subset [d]$, $|\mathcal{S}| \ll d$. Assuming $\phi_l$'s and $\mathcal{S}$ to be unknown, the problem of estimating $f$ from its samples has been studied extensively. In this work, we consider a generalized SPAM, allowing for *second order* interaction terms. For some $\mathcal{S}_1 \subset [d], \mathcal{S}_2 \subset {[d] \choose 2}$, the function $f$ is assumed to be of the form: $f(\mathbf{x}) = \sum_{p \in \mathcal{S}_1}\phi_{p} (x_p) + \sum_{(l_1, l_2) \in \mathcal{S}_2}\phi_{(l_1, l_2)} \mathbf{x}_{(l_1, l_2)}.$ Assuming $\phi_{p},\phi_{(l_1, l_2)}$, $\mathcal{S}_1$ and, $\mathcal{S}_2$ to be unknown, we provide a randomized algorithm that queries $f$ and *exactly recovers* $\mathcal{S}_1,\mathcal{S}_2$. Consequently, this also enables us to estimate the underlying $\phi_p, \phi_{(l_1, l_2)}$. We derive sample complexity bounds for our scheme and also extend our analysis to include the situation where the queries are corrupted with noise -- either stochastic, or arbitrary but bounded. Lastly, we provide simulation results on synthetic data, that validate our theoretical findings.
```
		@article{kyrillidis2016convex,
			title={Learning Sparse Additive Models with Interactions in High Dimensions},
          author={Tyagi, Hemant and Kyrillidis, Anastasios and Gartner, Bernd and Krause, Andreas},
          journal={arXiv preprint arXiv:1604.05307},
          year={2016}
      }
```
Megasthenis Asteris, Anastasios Kyrillidis, Dimitris Papailiopoulos, and Alex Dimakis, ``Bipartite correlation clustering: Maximizing agreements”, AI & Statistics Conference (AISTATS-16), 2016.

In Bipartite Correlation Clustering (BCC) we are given a complete bipartite graph G with `+' and `-' edges, and we seek a vertex clustering that maximizes the number of agreements: the number of all `+' edges within clusters plus all `-' edges cut across clusters. BCC is known to be NP-hard. We present a novel approximation algorithm for k-BCC, a variant of BCC with an upper bound k on the number of clusters. Our algorithm outputs a k-clustering that provably achieves a number of agreements within a multiplicative (1−δ)-factor from the optimal, for any desired accuracy δ. It relies on solving a combinatorially constrained bilinear maximization on the bi-adjacency matrix of G. It runs in time exponential in k and δ−1, but linear in the size of the input. Further, we show that, in the (unconstrained) BCC setting, an (1−δ)-approximation can be achieved by O(δ−1) clusters regardless of the size of the graph. In turn, our k-BCC algorithm implies an Efficient PTAS for the BCC objective of maximizing agreements.
```
		@article{asteris2016bipartite,
			title={Bipartite Correlation Clustering--Maximizing Agreements},
			author={Asteris, Megasthenis and Kyrillidis, Anastasios and Papailiopoulos, Dimitris and Dimakis, Alexandros G},
			journal={arXiv preprint arXiv:1603.02782},
			year={2016}
      }
```
Megasthenis Asteris, Dimitris Papailiopoulos, Anastasios Kyrillidis, and Alex Dimakis, ``Space PCA via bipartite matchings”, Neural Information Processing Systems (NIPS-15), 2015.

We consider the following multi-component sparse PCA problem: given a set of data points, we seek to extract a small number of sparse components with disjoint supports that jointly capture the maximum possible variance. Such components can be computed one by one, repeatedly solving the single-component problem and deflating the input data matrix, but this greedy procedure is suboptimal. We present a novel algorithm for sparse PCA that jointly optimizes multiple disjoint components. The extracted features capture variance that lies within a multiplicative factor arbitrarily close to $1$ from the optimal. Our algorithm is combinatorial and computes the desired components by solving multiple instances of the bipartite maximum weight matching problem. Its complexity grows as a low order polynomial in the ambient dimension of the input data, but exponentially in its rank. However, it can be effectively applied on a low-dimensional sketch of the input data. We evaluate our algorithm on real datasets and empirically demonstrate that in many cases it outperforms existing, deflation-based approaches.
```
		@inproceedings{asteris2015sparse,
			title={Sparse pca via bipartite matchings},
			author={Asteris, Megasthenis and Papailiopoulos, Dimitris and Kyrillidis, Anastasios and Dimakis, Alexandros G},
			booktitle={Advances in Neural Information Processing Systems},
			pages={766--774},
			year={2015}
      }
```
Megasthenis Asteris, Anastasios Kyrillidis, Alex Dimakis, Han-Gyol Yi and Bharath Chandrasekaran, ``Stay on path: PCA along graph paths”, International Conference on Machine Learning (ICML-15), 2015.

We introduce a variant of (sparse) PCA in which the set of feasible support sets is determined by a graph. In particular, we consider the following setting: given a directed acyclic graph~$G$ on $p$ vertices corresponding to variables, the non-zero entries of the extracted principal component must coincide with vertices lying along a path in $G$. From a statistical perspective, information on the underlying network may potentially reduce the number of observations required to recover the population principal component. We consider the canonical estimator which optimally exploits the prior knowledge by solving a non-convex quadratic maximization on the empirical covariance. We introduce a simple network and analyze the estimator under the spiked covariance model. We show that side information potentially improves the statistical complexity. We propose two algorithms to approximate the solution of the constrained quadratic maximization, and recover a component with the desired properties. We empirically evaluate our schemes on synthetic and real datasets.
```
		@inproceedings{asteris2015stay,
			title={Stay on path: PCA along graph paths},
			author={Asteris, Megasthenis and Kyrillidis, Anastasios and Dimakis, EDU Alexandros G and Yi, Han-Gyol and Chandrasekaran, Bharath},
			booktitle={Proceedings of the 32nd International Conference on Machine Learning (ICML-15)},
			volume={37},
			year={2015}
      }	
```
Michail Vlachos, Francesco Fusco, Harry Mavroforakis, Anastasios Kyrillidis and Vassilis Vasileiadis, ``Improving Co-Cluster Quality with Application to Product Recommendations”, ACM CIKM International Conference on Information and Knowledge Management, 2014.

Businesses store an ever increasing amount of historical customer sales data. Given the availability of such information, it is advantageous to analyze past sales, both for revealing dominant buying patterns, and for providing more targeted recommendations to clients. In this context, co-clustering has proved to be an important datamodeling primitive for revealing latent connections between two sets of entities, such as customers and products. In this work, we introduce a new algorithm for co-clustering that is both scalable and highly resilient to noise. Our method is inspired by $k$-Means and agglomerative hierarchical clustering approaches: $(i)$ first it searches for elementary co-clustering structures and $(ii)$ then combines them into a better, more compact, solution. The algorithm is flexible as it does not require an explicit number of co-clusters as input, and is directly applicable on large data graphs. We apply our methodology on real sales data to analyze and visualize the connections between clients and products. We showcase a real deployment of the system, and how it has been used for driving a recommendation engine. Finally, we demonstrate that the new methodology can discover co-clusters of better quality and relevance than state-of-the-art co-clustering techniques.
```
		@inproceedings{vlachos2014improving,
			title={Improving co-cluster quality with application to product recommendations},
          author={Vlachos, Michail and Fusco, Francesco and Mavroforakis, Charalambos and Kyrillidis, Anastasios and Vassiliadis, Vassilios G},
			booktitle={Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management},
			pages={679--688},
			year={2014},
			organization={ACM}
      }	
```
Dimitris Papailiopoulos, Anastasios Kyrillidis and Christos Boutsidis, ``Provable deterministic leverage scores sampling”, ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD-14), 2014.

We explain theoretically a curious empirical phenomenon: “Approximating a matrix by deterministically selecting a subset of its columns with the corresponding largest leverage scores results in a good low-rank matrix surrogate”. In this work, we provide a novel theoretical analysis of deterministic leverage score sampling. We show that such sampling can be provably as accurate as its randomized counterparts, if the leverage scores follow a moderately steep power-law decay. We support this power-law assumption by providing empirical evidence that such decay laws are abundant in real-world data sets. We then demonstrate empirically the performance of deterministic leverage score sampling, which many times matches or outperforms the state-of-the-art techniques.
```
		@inproceedings{papailiopoulos2014provable,
			title={Provable deterministic leverage score sampling},
			author={Papailiopoulos, Dimitris and Kyrillidis, Anastasios and Boutsidis, Christos},
			booktitle={Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining},
			pages={997--1006},
			year={2014},
			organization={ACM}
      }
```
Anastasios Kyrillidis, Rabeeh Karimi Mahabadi, Quoc Tran-Dinh and Volkan Cevher, ``Scalable sparse covariance estimation via self-concordance”, AAAI Conference on Artificial Intelligence (AAAI-14), 2014.

We explain theoretically a curious empirical phenomenon: “Approximating a matrix by deterministically selecting a subset of its columns with the corresponding largest leverage scores results in a good low-rank matrix surrogate”. In this work, we provide a novel theoretical analysis of deterministic leverage score sampling. We show that such sampling can be provably as accurate as its randomized counterparts, if the leverage scores follow a moderately steep power-law decay. We support this power-law assumption by providing empirical evidence that such decay laws are abundant in real-world data sets. We then demonstrate empirically the performance of deterministic leverage score sampling, which many times matches or outperforms the state-of-the-art techniques.
```
		@inproceedings{kyrillidis2014scalable,
			title={Scalable Sparse Covariance Estimation via Self-Concordance},
			author={Kyrillidis, Anastasios and Mahabadi, Rabeeh Karimi and Dinh, Quoc Tran and Cevher, Volkan},
			booktitle={Twenty-Eighth AAAI Conference on Artificial Intelligence},
			year={2014}
      }
```
Anastasios Kyrillidis, Michail Vlachos and Anastasios Zouzias, ``Approximate matrix multiplication with application to linear embeddings”, IEEE ISIT Symposium, 2014.

In this paper, we study the problem of approximately computing the product of two real matrices. In particular, we analyze a dimensionality-reduction-based approximation algorithm due to Sarlos, introducing the notion of nuclear rank as the ratio of the nuclear norm over the spectral norm. The presented bound has improved dependence with respect to the approximation error (as compared to previous approaches), whereas the subspace – on which we project the input matrices – has dimensions proportional to the maximum of their nuclear rank and it is independent of the input dimensions. In addition, we provide an application of this result to linear low-dimensional embeddings. Namely, we show that any Euclidean point-set with bounded nuclear rank is amenable to projection onto number of dimensions that is independent of the input dimensionality, while achieving additive error guarantees.
```
		@inproceedings{kyrillidis2014approximate,
			title={Approximate matrix multiplication with application to linear embeddings},
			author={Kyrillidis, Anastasios and Vlachos, Michail and Zouzias, Anastasios},
			booktitle={2014 IEEE International Symposium on Information Theory},
			pages={2182--2186},
			year={2014},
			organization={IEEE}
      }
```
Anastasios Kyrillidis and Anastasios Zouzias, ``Non-uniform feature sampling in decision tree ensembles”, IEEE ICASSP, 2014.

We study the effectiveness of non-uniform randomized feature selection in decision tree classification. We experimentally evaluate two feature selection methodologies, based on information extracted from the provided dataset: $(i)$ leverage scores-based and $(ii)$ norm-based feature selection. Experimental evaluation of the proposed feature selection techniques indicate that such approaches might be more effective compared to naive uniform feature selection and moreover having comparable performance to the random forest algorithm.
```
		@inproceedings{kyrillidis2014non,
			title={Non-uniform feature sampling for decision tree ensembles},
			author={Kyrillidis, Anastasios and Zouzias, Anastasios},
			booktitle={2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
			pages={4548--4552},
			year={2014},
			organization={IEEE}
      }
```
George Skoumas, Dieter Pfoser and Anastasios Kyrillidis, ``On quantifying qualitative geospatial data: A probabilistic approach”, ACM GEOCROWD, 2013.

Living in the era of data deluge, we have witnessed a web content explosion, largely due to the massive availability of User-Generated Content (UGC). In this work, we specifically consider the problem of geospatial information extraction and representation, where one can exploit diverse sources of information (such as image and audio data, text data, etc), going beyond traditional volunteered geographic information. Our ambition is to include available narrative information in an effort to better explain geospatial relationships: with spatial reasoning being a basic form of human cognition, narratives expressing such experiences typically contain qualitative spatial data, i.e., spatial objects and spatial relationships. To this end, we formulate a quantitative approach for the representation of qualitative spatial relations extracted from UGC in the form of texts. The proposed method quantifies such relations based on multiple text observations. Such observations provide distance and orientation features which are utilized by a greedy Expectation Maximization-based (EM) algorithm to infer a probability distribution over predefined spatial relationships; the latter represent the quantified relationships under user-defined probabilistic assumptions. We evaluate the applicability and quality of the proposed approach using real UGC data originating from an actual travel blog text corpus. To verify the result quality, we generate grid-based “maps” visualizing the spatial extent of the various relations.
```
		@inproceedings{skoumas2013quantifying,
			title={On quantifying qualitative geospatial data: a probabilistic approach},
			author={Skoumas, Georgios and Pfoser, Dieter and Kyrillidis, Anastasios},
			booktitle={Proceedings of the Second ACM SIGSPATIAL International Workshop on Crowdsourced and Volunteered Geographic Information},
			pages={71--78},
			year={2013},
			organization={ACM}
      }
```
Stephen Becker, Volkan Cevher and Anastasios Kyrillidis, ``Randomized low-memory singular value projection”, 10th International Conference on Sampling Theory and Applications (SampTA), 2013. (Authors listed in alphabetical order.)

Affine rank minimization algorithms typically rely on calculating the gradient of a data error followed by a singular value decomposition at every iteration. Because these two steps are expensive, heuristic approximations are often used to reduce computational burden. To this end, we propose a recovery scheme that merges the two steps with randomized approximations, and as a result, operates on space proportional to the degrees of freedom in the problem. We theoretically establish the estimation guarantees of the algorithm as a function of approximation tolerance. While the theoretical approximation requirements are overly pessimistic, we demonstrate that in practice the algorithm performs well on the quantum tomography recovery problem.
```
		@inproceedings{becker2013randomized,
			title={Randomized Low-Memory Singular Value Projection},
			author={Becker, Stephen and Cevher, Volkan and Kyrillidis, Anastasios},
			booktitle={10th International Conference on Sampling Theory and Applications (Sampta)},
			year={2013}
      }
```
Anastasios Kyrillidis, Stephen Becker, Volkan Cevher and Christoph Koch, ``Sparse projections onto the simplex”, International Conference on Machine Learning (ICML-13), 2013.

Most learning methods with rank or sparsity constraints use convex relaxations, which lead to optimization with the nuclear norm or the $\ell_1$-norm. However, several important learning applications cannot benefit from this approach as they feature these convex norms as constraints in addition to the non-convex rank and sparsity constraints. In this setting, we derive efficient sparse projections onto the simplex and its extension, and illustrate how to use them to solve high-dimensional learning problems in quantum tomography, sparse density estimation and portfolio selection with non-convex constraints.
```
		@inproceedings{kyrillidis2013sparse,
			title={Sparse projections onto the simplex},
			author={Kyrillidis, Anastasios and Becker, Stephen and Cevher, Volkan and Koch, Christoph},
			booktitle={Proceedings of The 30th International Conference on Machine Learning},
			pages={235--243},
			year={2013}
          }
```
Quoc Tran Dinh, Anastasios Kyrillidis and Volkan Cevher, ``A proximal Newton framework for composite minimization: Graph learning without Cholesky decompositions and matrix inversions”, International Conference on Machine Learning (ICML-13), 2013.

We propose an algorithmic framework for convex minimization problems of composite functions with two terms: a self-concordant part and a possibly nonsmooth regularization part. Our method is a new proximal Newton algorithm with local quadratic convergence rate. As a specific problem instance, we consider sparse precision matrix estimation problems in graph learning. Via a careful dual formulation and a novel analytic stepsize selection, we instantiate an algorithm within our framework for graph learning that avoids Cholesky decompositions and matrix inversions, making it attractive for parallel and distributed implementations.
```
		@inproceedings{dinh2013proximal,
			title={A proximal Newton framework for composite minimization: Graph learning without Cholesky decompositions and matrix inversions},
			author={Dinh, Quoc T and Kyrillidis, Anastasios and Cevher, Volkan},
			booktitle={Proceedings of the 30th International Conference on Machine Learning (ICML-13)},
			pages={271--279},
			year={2013}
              }
```
Anastasios Kyrillidis and Volkan Cevher, ``Fast proximal algorithms for self-concordant minimization with application to sparse graph selection”, IEEE ICASSP, 2013.

The convex $\ell_1$-regularized log det divergence criterion has been shown to produce theoretically consistent graph learning. However, this objective function is challenging since the $\ell_1$-regularization is nonsmooth, the log det objective is not globally Lipschitz gradient function, and the problem is high-dimensional. Using the selfconcordant property of the objective, we propose a new adaptive step size selection and present the (F)PS ((F)ast Proximal algorithms for Self-concordant functions) algorithmic framework which has linear convergence and exhibits superior empirical results as compared to state-of-the-art first order methods.
```
		@inproceedings{kyrillidis2013fast,
			title={Fast proximal algorithms for self-concordant function minimization with application to sparse graph selection},
			author={Kyrillidis, Anastasios and Cevher, Volkan},
			booktitle={2013 IEEE International Conference on Acoustics, Speech and Signal Processing},
			pages={6585--6589},
			year={2013},
			organization={IEEE}
          }
```
Anastasios Kyrillidis and Volkan Cevher, ``Matrix ALPS: Accelerated low rank and sparse matrix reconstruction”, IEEE SSP, 2012.

We propose Matrix ALPS for recovering a sparse plus low-rank decomposition of a matrix given its corrupted and incomplete linear measurements. Our approach is a first-order projected gradient method over non-convex sets, and it exploits a well-known memory-based acceleration technique. We theoretically characterize the convergence properties of Matrix ALPS using the stable embedding properties of the linear measurement operator. We then numerically illustrate that our algorithm outperforms the existing convex as well as non-convex state-of-the-art algorithms in computational efficiency without sacrificing stability.
```
		@inproceedings{kyrillidis2012matrix,
			title={Matrix {ALPS}: {A}ccelerated low rank and sparse matrix reconstruction},
			author={Kyrillidis, Anastasios and Cevher, Volkan},
			booktitle={2012 IEEE Statistical Signal Processing Workshop (SSP)},
			pages={185--188},
			year={2012},
 	 		organization={IEEE}
          }
```
Anastasios Kyrillidis and Volkan Cevher, ``Combinatorial selection and least absolute shrinkage via the CLASH algorithm”, IEEE ISIT, 2012.

The least absolute shrinkage and selection operator (LASSO) for linear regression exploits the geometric interplay of the $\ell_2$-data error objective and the $\ell_1$-norm constraint to arbitrarily select sparse models. Guiding this uninformed selection process with sparsity models has been precisely the center of attention over the last decade in order to improve learning performance. To this end, we alter the selection process of LASSO to explicitly leverage combinatorial sparsity models (CSMs) via the combinatorial selection and least absolute shrinkage (CLASH) operator. We provide concrete guidelines how to leverage combinatorial constraints within CLASH, and characterize CLASH’s guarantees as a function of the set restricted isometry constants of the sensing matrix. Finally, our experimental results show that CLASH can outperform both LASSO and model-based compressive sensing in sparse estimation.
```
		@inproceedings{kyrillidis2012combinatorial,
			title={Combinatorial selection and least absolute shrinkage via the {CLASH} algorithm},
			author={Kyrillidis, Anastasios and Cevher, Volkan},
			booktitle={Information Theory Proceedings (ISIT), 2012 IEEE International Symposium on},
			pages={2216--2220},
			year={2012},
			organization={IEEE}
              }
```
Anastasios Kyrillidis, Gilles Puy and Volkan Cevher, ``Hard thresholding with norm constraints”, IEEE ICASSP, 2012.

We introduce a new sparse recovery paradigm, called Normed Pursuits, where efficient algorithms from combinatorial and convex optimization interface for interpretable and model-based solutions. Synthetic and real data experiments illustrate that Normed Pursuits can significantly enhance the performance of both hard thresholding methods and convex solvers in sparse recovery.
```
		@inproceedings{kyrillidis2012hard,
			title={Hard thresholding with norm constraints},
			author={Kyrillidis, Anastasios and Puy, Gilles and Cevher, Volkan},
			booktitle={2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
			pages={3645--3648},
			year={2012},
			organization={IEEE}
              }
```
Anastasios Kyrillidis and Volkan Cevher, ``Recipes on hard thresholding methods”, 4th IEEE CAMSAP, 2011.

Compressive sensing (CS) is a data acquisition and recovery technique for finding sparse solutions to linear inverse problems from sub-Nyquist measurements. CS features a wide range of computationally efficient and robust signal recovery methods, based on sparsity seeking optimization. In this paper, we present and analyze a class of sparse recovery algorithms, known as hard thresholding methods. We provide optimal strategies on how to set up these algorithms via basic “ingredients” for different configurations to achieve complexity vs. accuracy tradeoffs. Simulation results demonstrate notable performance improvements compared to state-of-the-art algorithms both in terms of data reconstruction and computational complexity.
```
		@inproceedings{kyrillidis2011recipes,
			title={Recipes on hard thresholding methods},
			author={Kyrillidis, Anastasios and Cevher, Volkan},
			booktitle={4th IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP)},
			year={2011}
              }
```
Anastasios Kyrillidis and George. N. Karystinos, ``Rank-deficient quadratic-form maximization over M-phase alphabet: Polynomial-complexity solvability and algorithmic developments”, IEEE ICASSP, 2011.

The maximization of a positive (semi)definite complex quadratic form over a finite alphabet is NP-hard and achieved through exhaustive search when the form has full rank. However, if the form is rank-deficient, the optimal solution can be computed with only polynomial complexity in the length $N$ of the maximizing vector. In this work, we consider the general case of a rank-$D$ positive (semi)definite complex quadratic form and develop a method that maximizes the form with respect to a $M$-phase vector with polynomial complexity. The proposed method efficiently reduces the size of the feasible set from exponential to polynomial. We also develop an algorithm that constructs the polynomial-size candidate set in polynomial time and observe that it is fully parallelizable and rank-scalable.
```
		@inproceedings{kyrillidis2011rank,
			title={Rank-deficient quadratic-form maximization over $M$-phase alphabet: Polynomial-complexity solvability and algorithmic developments},
			author={Kyrillidis, Anastasios and Karystinos, George},
			booktitle={2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
			pages={3856--3859},
			year={2011},
			organization={IEEE}
              }
```

Journals

Mahtab Alizadeh Vandchali, Fangshuo (Jasper) Liao, Anastasios Kyrillidis, ``One Rank at a Time: Cascading Error Dynamics in Sequential Learning”, Transactions on Machine Learning Research (TMLR), 2026 (accepted, in press).

Sequential learning — where complex tasks are broken down into simpler, hierarchical components — has emerged as a paradigm in AI. This paper views sequential learning through the lens of low-rank linear regression, focusing on how errors propagate when learning rank-1 subspaces sequentially. We present an analysis framework that decomposes the learning process into a series of rank-1 estimation problems, where each subsequent estimation depends on the accuracy of previous steps. Our aim is explanatory rather than comparative: we analyze error propagation and derive compute-allocation guidance without claiming superiority over joint or one-shot training. The main contribution is a characterization of error propagation in this sequential process, establishing bounds on how per-step numerical errors — caused by limited computational budgets and finite precision — affect overall model accuracy. We prove that these errors compound in predictable ways, with implications for both algorithmic design and stability guarantees. Synthetic linear-regression experiments validate the theory; exploratory LoRA-based vision (MNIST/CIFAR) and language (DistilBERT/SST-2) probes show the qualitative cascade pattern transfers across domains.
```
  @article{vandchali2026onerank,
title = {One Rank at a Time: Cascading Error Dynamics in Sequential Learning},
author = {Vandchali, Mahtab Alizadeh and Liao, Fangshuo and Kyrillidis, Anastasios},
journal = {Transactions on Machine Learning Research (TMLR)},
year = {2026},
note = {Accepted, in press. arXiv:2505.22602}
}
```
github.com/akyrillidis/ORAT

AI-OWLS blog post
Yuqian Huo, David A. Quiroga, Anastasios Kyrillidis, Tirthak Patel, ``Three Birds with One Stone: Improving Performance, Convergence, and System Throughput with NEST”, Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS/SIGMETRICS), 9(3), 2025.

Variational quantum algorithms (VQAs) have the potential to demonstrate quantum utility on near-term quantum computers. However, these algorithms often get executed on the highest-fidelity qubits and computers to achieve the best performance, causing low system throughput. Recent efforts have shown that VQAs can be run on low-fidelity qubits initially and high-fidelity qubits later on to still achieve good performance. We take this effort forward and show that carefully varying the qubit fidelity map of the VQA over its execution using our technique, NEST, does not just improve performance (i.e., help achieve close to optimal results), but also leads to faster convergence. We also use NEST to co-locate multiple VQAs concurrently on the same computer, thus increasing the system throughput, and therefore, balancing and optimizing three conflicting metrics simultaneously.
```
  @article{huo2025nest,
title = {Three Birds with One Stone: Improving Performance, Convergence, and System Throughput with {NEST},
author = {Huo, Yuqian and Quiroga, David A. and Kyrillidis, Anastasios and Patel, Tirthak},
journal = {Proceedings of the ACM on Measurement and Analysis of Computing Systems},
volume = {9},
number = {3},
pages = {65:1--65:26},
year = {2025},
doi = {10.1145/3771580}
}
```
Cameron R. Wolfe, Anastasios Kyrillidis, ``Better schedules for low precision training of deep neural networks”, Machine Learning, 113(6):3569-3587, Springer, 2024.

Low precision training can significantly reduce the computational overhead of training deep neural networks (DNNs). Though many such techniques exist, cyclic precision training (CPT), which dynamically adjusts precision throughout training according to a cyclic schedule, achieves particularly impressive improvements in training efficiency, while actually improving DNN performance. Existing CPT implementations take common learning rate schedules and use them for low precision training without adequate comparisons to alternative scheduling options. We define a diverse suite of CPT schedules and analyze their performance across a variety of DNN training regimes, some of which are unexplored in the low precision training literature. From these experiments, we discover alternative CPT schedules that offer further improvements in training efficiency and model performance, as well as derive a set of best practices for choosing CPT schedules. Going further, we find that a correlation exists between model performance and training cost, and that changing the underlying CPT schedule can control the tradeoff between these two variables. To explain the direct correlation between model performance and training cost, we draw a connection between quantized training and critical learning periods, suggesting that aggressive quantization is a form of learning impairment that can permanently damage model performance.
```
  @article{wolfe2024better,
title = {Better schedules for low precision training of deep neural networks},
author = {Wolfe, Cameron R. and Kyrillidis, Anastasios},
journal = {Machine Learning},
volume = {113},
number = {6},
pages = {3569--3587},
year = {2024},
publisher = {Springer},
doi = {10.1007/s10994-023-06480-0}
}
```
Junhyung Lyle Kim, Gauthier Gidel, Anastasios Kyrillidis, Fabian Pedregosa, ``When is Momentum Extragradient Optimal? A Polynomial-Based Analysis”, Transactions on Machine Learning Research (TMLR), 2024.

The extragradient method has gained popularity due to its robust convergence properties for differentiable games. Unlike single-objective optimization, game dynamics involve complex interactions reflected by the eigenvalues of the game vector field's Jacobian scattered across the complex plane. This complexity can cause the simple gradient method to diverge, even for bilinear games, while the extragradient method achieves convergence. Building on the recently proven accelerated convergence of the momentum extragradient method for bilinear games, we use a polynomial-based analysis to identify three distinct scenarios where this method exhibits further accelerated convergence. These scenarios encompass situations where the eigenvalues reside on the (positive) real line, lie on the real line alongside complex conjugates, or exist solely as complex conjugates. Furthermore, we derive the hyperparameters for each scenario that achieve the fastest convergence rate.
```
  @article{kim2024momentum,
title = {When is Momentum Extragradient Optimal? {A} Polynomial-Based Analysis},
author = {Kim, Junhyung Lyle and Gidel, Gauthier and Kyrillidis, Anastasios and Pedregosa, Fabian},
journal = {Transactions on Machine Learning Research},
year = {2024}
}
```
Cameron R. Wolfe, Fangshuo Liao, Qihan Wang, Junhyung Lyle Kim, Anastasios Kyrillidis, ``How Much Pre-training Is Enough to Discover a Good Subnetwork?”, Transactions on Machine Learning Research (TMLR), 2024.

Neural network pruning is useful for discovering efficient, high-performing subnetworks within pre-trained, dense network architectures. More often than not, it involves a three-step process -- pre-training, pruning, and re-training -- that is computationally expensive, as the dense model must be fully pre-trained. While previous work has revealed through experiments the relationship between the amount of pre-training and the performance of the pruned network, a theoretical characterization of such dependency is still missing. Aiming to mathematically analyze the amount of dense network pre-training needed for a pruned network to perform well, we discover a simple theoretical bound in the number of gradient descent pre-training iterations on a two-layer, fully-connected network, beyond which pruning via greedy forward selection yields a subnetwork that achieves good training error. Interestingly, this threshold is shown to be logarithmically dependent upon the size of the dataset, meaning that experiments with larger datasets require more pre-training for subnetworks obtained via pruning to perform well. Lastly, we empirically validate our theoretical results on a multi-layer perceptron trained on MNIST.
```
  @article{wolfe2024pretraining,
title = {How Much Pre-training Is Enough to Discover a Good Subnetwork?},
author = {Wolfe, Cameron R. and Liao, Fangshuo and Wang, Qihan and Kim, Junhyung Lyle and Kyrillidis, Anastasios},
journal = {Transactions on Machine Learning Research},
year = {2024}
}
```
Junhyung Lyle Kim, George Kollias, Amir Kalev, Ken X Wei, Anastasios Kyrillidis, ``Fast quantum state reconstruction via accelerated non-convex programming”, Photonics, 10(2), MDPI, 2023

We propose a new quantum state reconstruction method that combines ideas from compressed sensing, non-convex optimization, and acceleration methods. The algorithm, called Momentum-Inspired Factored Gradient Descent (MiFGD), extends the applicability of quantum tomography for larger systems. Despite being a non-convex method, MiFGD converges provably close to the true density matrix at an accelerated linear rate asymptotically in the absence of experimental and statistical noise, under common assumptions. With this manuscript, we present the method, prove its convergence property and provide the Frobenius norm bound guarantees with respect to the true density matrix. From a practical point of view, we benchmark the algorithm performance with respect to other existing methods, in both synthetic and real (noisy) experiments, performed on the IBM’s quantum processing unit. We find that the proposed algorithm performs orders of magnitude faster than the state-of-the-art approaches, with similar or better accuracy. In both synthetic and real experiments, we observed accurate and robust reconstruction, despite the presence of experimental and statistical noise in the tomographic data. Finally, we provide a ready-to-use code for state tomography of multi-qubit systems.
```
  		
  @inproceedings{kim2023fast,
title={Fast quantum state reconstruction via accelerated non-convex programming},
author={Kim, Junhyung Lyle and Kollias, George and Kalev, Amir and Wei, Ken X and Kyrillidis, Anastasios},
booktitle={Photonics},
volume={10},
number={2},
pages={116},
year={2023},
organization={MDPI}
}
```
Cameron R Wolfe, Jingkang Yang, Arindam Chowdhury, Chen Dun, Artun Bayer, Santiago Segarra, Anastasios Kyrillidis, ``GIST: Distributed Training for Large-Scale Graph Convolutional Network”, Journal of Applied and Computational Topology, Special issue on Data Science on Graphs, Springer, 2024.

The graph convolutional network (GCN) is a go-to solution for machine learning on graphs, but its training is notoriously difficult to scale both in terms of graph size and the number of model parameters. Although some work has explored training on large-scale graphs (e.g., GraphSAGE, ClusterGCN, etc.), we pioneer efficient training of large-scale GCN models (i.e., ultra-wide, overparameterized models) with the proposal of a novel, distributed training framework. Our proposed training methodology, called GIST, disjointly partitions the parameters of a GCN model into several, smaller sub-GCNs that are trained independently and in parallel. In addition to being compatible with all GCN architectures and existing sampling techniques for efficient GCN training, GIST i) improves model performance, ii) scales to training on arbitrarily large graphs, iii) decreases wall-clock training time, and iv) enables the training of markedly overparameterized GCN models. Remarkably, with GIST, we train an astonishgly-wide 32,768-dimensional GraphSAGE model, which exceeds the capacity of a single GPU by a factor of 8x, to SOTA performance on the Amazon2M dataset.
```
  		
  @article{wolfe2021gist,
title={GIST: Distributed training for large-scale graph convolutional networks},
author={Wolfe, Cameron R and Yang, Jingkang and Chowdhury, Arindam and Dun, Chen and Bayer, Artun and Segarra, Santiago and Kyrillidis, Anastasios},
journal={arXiv preprint arXiv:2102.10424},
year={2021}
}
```
Fangshuo Liao, Anastasios Kyrillidis, ``On the Convergence of Shallow Neural Network Training with Randomly Masked Neurons”, Transactions on Machine Learning Research (TMLR), 2022.

With the motive of training all the parameters of a neural network, we study why and when one can achieve this by iteratively creating, training, and combining randomly selected subnetworks. Such scenarios have either implicitly or explicitly emerged in the recent literature: see e.g., the Dropout family of regularization techniques, or some distributed ML training protocols that reduce communication/computation complexities, such as the Independent Subnet Training protocol. While these methods are studied empirically and utilized in practice, they often enjoy partial or no theoretical support, especially when applied on neural network-based objectives. In this manuscript, our focus is on overparameterized single hidden layer neural networks with ReLU activations in the lazy training regime. By carefully analyzing i) the subnetworks’ neural tangent kernel, ii) the surrogate functions’ gradient, and iii) how we sample and combine the surrogate functions, we prove linear convergence rate of the training error –up to a neighborhood around the optimal point– for an overparameterized single-hidden layer perceptron with a regression loss. Our analysis reveals a dependency of the size of the neighborhood around the optimal point on the number of surrogate models and the number of local training steps for each selected subnetwork. Moreover, the considered framework generalizes and provides new insights on dropout training, multi-sample dropout training, as well as Independent Subnet Training; for each case, we provide convergence results as corollaries of our main theorem.
```
 		@article{liao2021convergence,
title={On the convergence of shallow neural network training with randomly masked neurons},
author={Liao, Fangshuo and Kyrillidis, Anastasios},
journal={arXiv preprint arXiv:2112.02668},
year={2021}
}
 
```
- Junhyung Lyle Kim, Mohammad Taha Toghani, Cesar A. Uribe, Anastasios Kyrillidis, ``Local Stochastic Factored Gradient Descent for Distributed Quantum State Tomography”, IEEE Control Systems Letters (L-CSS), 2022.
We propose a distributed Quantum State Tomography (QST) protocol, named Local Stochastic Factored Gradient Descent (Local SFGD), to learn the low-rank factor of a density matrix over a set of local machines. QST is the canonical procedure to characterize the state of a quantum system, which we formulate as a stochastic nonconvex smooth optimization problem. Physically, the estimation of a low-rank density matrix helps characterizing the amount of noise introduced by quantum computation. Theoretically, we prove the local convergence of Local SFGD for a general class of restricted strongly convex/smooth loss functions, i.e., Local SFGD converges locally to a small neighborhood of the global optimum at a linear rate with a constant step size, while it locally converges exactly at a sub-linear rate with diminishing step sizes. With a proper initialization, local convergence results imply global convergence. We validate our theoretical findings with numerical simulations of QST on the Greenberger-Horne-Zeilinger (GHZ) state.
```
 		@article{kim2022local,
title={Local Stochastic Factored Gradient Descent for Distributed Quantum State Tomography},
author={Kim, Junhyung Lyle and Toghani, Mohammad Taha and Uribe, C{\'e}sar A and Kyrillidis, Anastasios},
journal={arXiv preprint arXiv:2203.11579},
year={2022}
}
 
```
- Anastasios Kyrillidis, Anshumali Shrivastava, Moshe Y Vardi, Zhiwei Zhang, ``Solving hybrid Boolean constraints in continuous space via multilinear Fourier expansions”, Artificial Intelligence, Elsevier, 2021.
The Boolean SATisfiability problem (SAT) is of central importance in computer science. Although SAT is known to be NP-complete, progress on the engineering side—especially that of Conflict-Driven Clause Learning (CDCL) and Local Search SAT solvers—has been remarkable. Yet, while SAT solvers, aimed at solving industrial-scale benchmarks in Conjunctive Normal Form (CNF), have become quite mature, SAT solvers that are effective on other types of constraints (e.g., cardinality constraints and XORs) are less well-studied; a general approach to handling non-CNF constraints is still lacking. To address the issue above, we design FourierSAT,1 an incomplete SAT solver based on Fourier Analysis (also known as Walsh-Fourier Transform) of Boolean functions, a technique to represent Boolean functions by multilinear polynomials. By such a reduction to continuous optimization, we propose an algebraic framework for solving systems consisting of different types of constraints. The idea is to leverage gradient information to guide the search process in the direction of local improvements. We show this reduction enjoys interesting theoretical properties. Empirical results demonstrate that FourierSAT can be a useful complement to other solvers on certain classes of benchmarks.
```
 		@article{kyrillidis2021solving,
title={Solving hybrid Boolean constraints in continuous space via multilinear Fourier expansions},
author={Kyrillidis, Anastasios and Shrivastava, Anshumali and Vardi, Moshe Y and Zhang, Zhiwei},
journal={Artificial Intelligence},
volume={299},
pages={103559},
year={2021},
publisher={Elsevier}
}
 
```
Amir Kalev, Anastasios Kyrillidis, and Norbert M Linke, ``Validating and certifying stabilizer states”, Physical Review A, 2019.

We propose a measurement scheme that validates the preparation of an $n$-qubit stabilizer state. The scheme involves a measurement of $n$ Pauli observables, a priori determined from the stabilizer state and which can be realized using single-qubit gates. Based on the proposed validation scheme, we derive an explicit expression for the worst-case fidelity, i.e., the minimum fidelity between the stabilizer state and any other state consistent with the measured data. We also show that the worst-case fidelity can be certified, with high probability, using $O(n^2)$ copies of the state.
```
 		@article{PhysRevA.99.042337,
			title = {Validating and certifying stabilizer states},
			author = {Kalev, Amir and Kyrillidis, Anastasios and Linke, Norbert M.},
			journal = {Phys. Rev. A},
			volume = {99},
			issue = {4},
			pages = {042337},
			numpages = {6},
			year = {2019},
			month = {Apr},
			publisher = {American Physical Society}
      } 
 
```
Ya-Ping Hsieh, Yu-Chun Kao, Rabeeh Karimi Mahabadi, Yurtsever Alp, Anastasios Kyrillidis, Volkan Cevher, ``A Non-Euclidean Gradient Descent Framework for Non-Convex Matrix Factorization”, IEEE Transactions on Signal Processing, 2018.

We study convex optimization problems that feature low-rank matrix solutions. In such scenarios, non-convex methods offer significant advantages over convex methods due to their lower space complexity, as well as practical faster convergence. Under mild assumptions, these methods feature global convergence guarantees. In this paper, we extend the results on this matter by following a different path. We derive a non-Euclidean optimization framework in the non-convex setting that takes nonlinear gradient steps on the factors. Our framework enables the possibility to further exploit the underlying problem structures, such as sparsity or low-rankness on the factorized domain, or better dimensional dependence of the smoothness parameters of the objectives. We prove that the non-Euclidean methods enjoy the same rigorous guarantees as their Euclidean counterparts under appropriate assumptions. Numerical evidence with Fourier ptychography and FastText applications, using real data, shows that our approach can enhance solution quality, as well as convergence speed over the standard non-convex approaches.
```
 		@article{hsieh2018non,
title={A non-Euclidean gradient descent framework for non-convex matrix factorization},
author={Hsieh, Ya-Ping and Kao, Yu-Chun and Mahabadi, Rabeeh Karimi and Yurtsever, Alp and Kyrillidis, Anastasios and Cevher, Volkan},
journal={IEEE Transactions on Signal Processing},
volume={66},
number={22},
pages={5917--5926},
year={2018},
publisher={IEEE}
}   
```
Dohyung Park, Anastasios Kyrillidis, Constantine Caramanis, and Sujay Sanghavi, ``Finding low-rank solutions via non-convex matrix factorization, efficiently and provably”, SIAM Journal on Imaging Sciences, 2018.

A rank-$r$ matrix $X \in \mathbb{R}^{m \times n}$ can be written as a product $U V^\top$, where $U \in \mathbb{R}^{m \times r}$ and $V \in \mathbb{R}^{n \times r}$. One could exploit this observation in optimization: e.g., consider the minimization of a convex function $f(X)$ over rank-$r$ matrices, where the set of low-rank matrices is modeled via $UV^\top$. Though such parameterization reduces the number of variables and is more computationally efficient (of particular interest is the case $r \ll \min\{m, n\}$), it comes at a cost: $f(U V^\top)$ becomes a nonconvex function w.r.t. $U$ and $V$. We study such parameterization on generic convex objectives $f$ and focus on first-order, gradient descent algorithms. We propose the bifactored gradient descent (BFGD) algorithm, an efficient first-order method that operates directly on the $U, V$ factors. We show that when $f$ is (restricted) smooth, BFGD has local sublinear convergence; when $f$ is both (restricted) smooth and (restricted) strongly convex, it has local linear convergence. For several applications, we provide simple and efficient initialization schemes that provide initial conditions, good enough for the above convergence results to hold, globally. Extensive experimental results support our arguments that BFGD is an efficient and accurate nonconvex method, compared to state-of-the-art approaches.
```
 		@article{park2018finding,
title={Finding low-rank solutions via nonconvex matrix factorization, efficiently and provably},
author={Park, Dohyung and Kyrillidis, Anastasios and Caramanis, Constantine and Sanghavi, Sujay},
journal={SIAM Journal on Imaging Sciences},
volume={11},
number={4},
pages={2165--2204},
year={2018},
publisher={SIAM}
}
 
```
Anastasios Kyrillidis, Amir Kalev, Dohyung Park, Srinadh Bhojanapalli, Constantine Caramanis and Sujay Sanghavi, ``Provable compressed sensing quantum state tomography via non-convex methods”, Nature Journal on Quantum Information, 2018.

With nowadays steadily growing quantum processors, it is required to develop new quantum tomography tools that are tailored for high-dimensional systems. In this work, we describe such a computational tool, based on recent ideas from non-convex optimization. The algorithm excels in the compressed sensing setting, where only a few data points are measured from a low-rank or highly-pure quantum state of a high-dimensional system. We show that the algorithm can practically be used in quantum tomography problems that are beyond the reach of convex solvers, and, moreover, is faster and more accurate than other state-of-the-art non-convex approaches. Crucially, we prove that, despite being a non-convex program, under mild conditions, the algorithm is guaranteed to converge to the global minimum of the quantum state tomography problem; thus, it constitutes a provable quantum state tomography protocol.
```
 		@article{kyrillidis2018provable,
			title={Provable compressed sensing quantum state tomography via non-convex methods},
			author={Kyrillidis, Anastasios and Kalev, Amir and Park, Dohyung and Bhojanapalli, Srinadh and Caramanis, Constantine and Sanghavi, Sujay},
			journal={npj Quantum Information},
			volume={4},
			number={1},
			pages={36},
			year={2018},
			publisher={Nature Publishing Group}
      }
 
```
Quoc Tran Dinh, Anastasios Kyrillidis, and Volkan Cevher, ``A single-phase, proximal path-following framework”, Mathematics of Operations Research, 2017.

We propose a new proximal, path-following framework for a class of constrained convex problems. We consider settings where the nonlinear—and possibly non-smooth—objective part is endowed with a proximity operator, and the constraint set is equipped with a self-concordant barrier. Our approach relies on the following two main ideas. First, we re-parameterize the optimality condition as an auxiliary problem, such that a good initial point is available; by doing so, a family of alternative paths towards the optimum is generated. Second, we combine the proximal operator with path-following ideas to design a single-phase, proximal, path-following algorithm. We prove that our scheme has the same $O(\sqrt{\nu} \log(1/\varepsilon))$ worst-case iteration-complexity with standard approaches [34, 38] without requiring an initial phase, where $\nu$ is the barrier parameter and $\varepsilon$ is a desired accuracy. Our framework allows errors in the calculation of proximal- Newton directions, without sacrificing the worst-case iteration complexity. We demonstrate the merits of our algorithm via three numerical examples, where proximal operators play a key role.
```
 		@article{tran2016single,
			title={A single-phase, proximal path-following framework},
			author={Tran-Dinh, Quoc and Kyrillidis, Anastasios and Cevher, Volkan},
			journal={arXiv preprint arXiv:1603.01681},
			year={2016}
      }
 
```
Hemant Tyagi, Anastasios Kyrillidis, Bernd Gartner, and Andreas Krause, ``Algorithms for Learning Sparse Additive Models with Interactions in High Dimensions”, Information and Inference: A journal of the IMA, 2017.

A function $f: \mathbb{R}^d \rightarrow \mathbb{R}$ is a Sparse Additive Model (SPAM), if it is of the form $f(\mathbf{x}) = \sum_{l \in \mathcal{S}}\phi_{l}(x_l)$ where $\mathcal{S} \subset [d]$, $|\mathcal{S}| \ll d$. Assuming $\phi$'s, $\mathcal{S}$ to be unknown, there exists extensive work for estimating $f$ from its samples. In this work, we consider a generalized version of SPAMs, that also allows for the presence of a sparse number of second order interaction terms. For some $\mathcal{S}_1 \subset [d], \mathcal{S}_2 \subset {[d] \choose 2}$, with $|\mathcal{S}_1| \ll d, |\mathcal{S}_2| \ll d^2$, the function $f$ is now assumed to be of the form: $\sum_{p \in \mathcal{S}_1}\phi_{p} (x_p) + \sum_{(l,l^{\prime}) \in \mathcal{S}_2}\phi_{(l,l^{\prime})} (x_l,x_{l^{\prime}})$. Assuming we have the freedom to query $f$ anywhere in its domain, we derive efficient algorithms that provably recover $\mathcal{S}_1,\mathcal{S}_2$ with finite sample bounds. Our analysis covers the noiseless setting where exact samples of $f$ are obtained, and also extends to the noisy setting where the queries are corrupted with noise. For the noisy setting in particular, we consider two noise models namely: i.i.d Gaussian noise and arbitrary but bounded noise. Our main methods for identification of $\mathcal{S}_2$ essentially rely on estimation of sparse Hessian matrices, for which we provide two novel compressed sensing based schemes. Once $\mathcal{S}_1, \mathcal{S}_2$ are known, we show how the individual components $\phi_p$, $\phi_{(l,l^{\prime})}$ can be estimated via additional queries of $f$, with uniform error bounds. Lastly, we provide simulation results on synthetic data that validate our theoretical findings.
```
 		@article{tyagi2017algorithms,
			title={Algorithms for Learning Sparse Additive Models with Interactions in High Dimensions},
			author={Tyagi, Hemant and Kyrillidis, Anastasios and G{\"a}rtner, Bernd and Krause, Andreas},
			journal={Information and Inference: A journal of the IMA},
			year={2017}
      }
 
```
Luca Baldassarre, Nirav Bhan, Volkan Cevher, Anastasios Kyrillidis and Siddhartha Satpathi, ``Group-sparse model selection: Hardness and relaxations”, IEEE Trans. on Information Theory, 2016. (Authors listed in alphabetical order.)

Group-based sparsity models are instrumental in linear and non-linear regression problems. The main premise of these models is the recovery of “interpretable” signals through the identification of their constituent groups, which can also provably translate in substantial savings in the number of measurements for linear models in compressive sensing. In this paper, we establish a combinatorial framework for group-model selection problems and highlight the underlying tractability issues. In particular, we show that the group-model selection problem is equivalent to the well-known NP-hard weighted maximum coverage problem (WMC). Leveraging a graph-based understanding of group models, we describe group structures which enable correct model selection in polynomial time via dynamic programming. Furthermore, we show that popular groups structures can be explained by linear inequalities involving totally unimodular matrices, which afford other polynomial time algorithms based on relaxations. We also present a generalization of the group-model that allows for within group sparsity, which can be used to model hierarchical sparsity. Finally, we study the Pareto frontier between approximation error and sparsity budget of group- sparse approximations for two tractable models, among which the tree sparsity model, and illustrate selection and computation trade-offs between our framework and the existing convex relaxations.
```
 		@article{baldassarre2013group,
			title={Group-sparse model selection: Hardness and relaxations},
			author={Baldassarre, Luca and Bhan, Nirav and Cevher, Volkan and Kyrillidis, Anastasios and Satpathi, Siddhartha},
			journal={arXiv preprint arXiv:1303.3207},
        	year={2013}
              }
 
```
Georgios Skoumas, Dieter Pfoser, Anastasios Kyrillidis and Timos Sellis, ``Location estimation using crowdsourced spatial relations”, ACM Transactions on Spatial Algorithms and Systems, vol. 2, issue 2, 2016.

The “crowd” has become a very important geospatial data provider. Specifically, nonexpert users have been providing a wealth of quantitative geospatial data (e.g., geotagged tweets or photos, online). With spatial reasoning being a basic form of human cognition, textual narratives expressing user travel experiences (e.g., travel blogs) would provide an even bigger source of geospatial data. Narratives typically contain qualitative geospatial data in the form of objects and spatial relations (e.g., “St. John’s church is to the North of the Acropolis museum.” The scope of this work is (i) to extract these spatial relations from textual narratives, (ii) to quantify (model) them, and (iii) to reason about object locations based only on the quantified spatial relations. We use information extraction methods to identify toponyms and spatial relations, and we formulate a quantitative approach based on distance and orientation features to represent the latter. Probability density functions (PDFs) for spatial relations are determined by means of a greedy expectation maximization (EM)-based algorithm. These PDFs are then used to estimate unknown object locations. Experiments using a text corpus harvested from travel blog sites establish the considerable location estimation accuracy of the proposed approach on synthetic and real-world scenarios.
```
 		@article{skoumas2016location,
 			author = {Skoumas, Georgios and Pfoser, Dieter and Kyrillidis, Anastasios and Sellis, Timos},
 			title = {Location Estimation Using Crowdsourced Spatial Relations},
 			journal = {ACM Trans. Spatial Algorithms Syst.},
 			volume = {2},
 			number = {2}, 		
          year = {2016},
 			pages = {5:1--5:23},
 			publisher = {ACM},
      } 
 
```
Quoc Tran Dinh, Anastasios Kyrillidis and Volkan Cevher, ``Composite self-concordant minimization”, Journal of Machine Learning Research, 16(Mar):371−416, 2015.

We propose a variable metric framework for minimizing the sum of a self-concordant function and a possibly non-smooth convex function, endowed with an easily computable proximal operator. We theoretically establish the convergence of our framework without relying on the usual Lipschitz gradient assumption on the smooth part. An important highlight of our work is a new set of analytic step-size selection and correction procedures based on the structure of the problem. We describe concrete algorithmic instances of our framework for several interesting applications and demonstrate them numerically on both synthetic and real data.
```
 		@article{tran2015composite,
		  	title={Composite Self-Concordant Minimization},
        	author={Tran-Dinh, Quoc and Kyrillidis, Anastasios and Cevher, Volkan},
      	journal={Journal of Machine Learning Research},
          volume={16},
          pages={371--416},
          year={2015}
      }
 
```
Michail Vlachos, Nikolaos Freris and Anastasios Kyrillidis, ``Compressive mining: fast and optimal data mining in the compressed domain”, Very Large Data Bases (VLDB) Journal, Volume 24 Issue 1, February 2015.

Real-world data typically contain repeated and periodic patterns. This suggests that they can be effectively represented and compressed using only a few coefficients of an appropriate basis (e.g., Fourier and wavelets). However, distance estimation when the data are represented using different sets of coefficients is still a largely unexplored area. This work studies the optimization problems related to obtaining the tightest lower/upper bound on Euclidean distances when each data object is potentially compressed using a different set of orthonormal coefficients. Our technique leads to tighter distance estimates, which translates into more accurate search, learning and mining operations directly in the compressed domain. We formulate the problem of estimating lower/upper distance bounds as an optimization problem. We establish the properties of optimal solutions and leverage the theoretical analysis to develop a fast algorithm to obtain an exact solution to the problem. The suggested solution provides the tightest estimation of the Euclidean norm or the correlation. We show that typical data analysis operations, such as nearest-neighbor search or k-Means clustering, can operate more accurately using the proposed compression and distance reconstruction technique. We compare it with many other prevalent compression and reconstruction techniques, including random projections and PCA-based techniques. We highlight a surprising result, namely that when the data are highly sparse in some basis, our technique may even outperform PCA-based compression. The contributions of this work are generic as our methodology is applicable to any sequential or high-dimensional data as well as to any orthogonal data transformation used for the underlying data compression scheme.
```
 		@article{vlachos2015compressive,
			title={Compressive mining: fast and optimal data mining in the compressed domain},
			author={Vlachos, Michail and Freris, Nikolaos M and Kyrillidis, Anastasios},
          journal={The VLDB Journal—The International Journal on Very Large Data Bases},
			volume={24},
			number={1},
			pages={1--24},
			year={2015},
			publisher={Springer-Verlag New York, Inc.}
      }
 
```
Quoc Tran-Dinh, Anastasios Kyrillidis and Volkan Cevher, ``An inexact proximal path-following algorithm for constrained convex minimization”, SIAM Journal on Optimization (SIOPT), vol. 24, num. 4, p. 1718-1745, 2014.

Many scientific and engineering applications feature nonsmooth convex minimization problems over convex sets. In this paper, we address an important instance of this broad class where we assume that the nonsmooth objective is equipped with a tractable proximity operator and that the convex constraint set affords a self-concordant barrier. We provide a new joint treatment of proximal and self-concordant barrier concepts and illustrate that such problems can be efficiently solved, without the need of lifting the problem dimensions, as in disciplined convex optimization approach. We propose an inexact path-following algorithmic framework and theoretically characterize the worst-case analytical complexity of this framework when the proximal subproblems are solved inexactly. To show the merits of our framework, we apply its instances to both synthetic and real-world applications, where it shows advantages over standard interior point methods. As a byproduct, we describe how our framework can obtain points on the Pareto frontier of regularized problems with self-concordant objectives in a tuning free fashion.
```
 		@article{tran2014inexact,
			title={An inexact proximal path-following algorithm for constrained convex minimization},
          author={Tran-Dinh, Quoc and Kyrillidis, Anastasios and Cevher, Volkan},
 			journal={SIAM Journal on Optimization},
 			volume={24},
 			number={4},
 			pages={1718--1745},
 			year={2014},
 			publisher={SIAM}
      }
 
```
Anastasios Kyrillidis and George. N. Karystinos, ``Fixed-rank Rayleigh quotient maximization by an M-PSK sequence”, IEEE Trans. on Communications, Volume:62, Issue:3, pages 961-975, 2014.

Certain optimization problems in communication systems, such as limited-feedback constant-envelope beamforming or noncoherent M-ary phase-shift keying (MPSK) sequence detection, result in the maximization of a fixed-rank positive semidefinite quadratic form over the MPSK alphabet. This form is a special case of the Rayleigh quotient of a matrix and, in general, its maximization by an MPSK sequence is NP-hard. However, if the rank of the matrix is not a function of its size, then the optimal solution can be computed with polynomial complexity in the matrix size. In this work, we develop a new technique to efficiently solve this problem by utilizing auxiliary continuous-valued angles and partitioning the resulting continuous space of solutions into a polynomial-size set of regions, each of which corresponds to a distinct MPSK sequence. The sequence that maximizes the Rayleigh quotient is shown to belong to this polynomial-size set of sequences, thus efficiently reducing the size of the feasible set from exponential to polynomial. Based on this analysis, we also develop an algorithm that constructs this set in polynomial time and show that it is fully parallelizable, memory efficient, and rank scalable. The proposed algorithm compares favorably with other solvers for this problem that have appeared recently in the literature.
```
		@article{kyrillidis2014fixed,
			title={Fixed-rank Rayleigh quotient maximization by an MPSK sequence},
 			author={Kyrillidis, Anastasios and Karystinos, George N},
 			journal={Communications, IEEE Transactions on},
 			volume={62},
 			number={3},
 			pages={961--975},
 			year={2014},
 			publisher={IEEE}
      }
```
Anastasios Kyrillidis and Volkan Cevher, ``Matrix recipes for hard thresholding methods”, Journal of Mathematical Imaging and Vision (JMIV), April 2013, Springer.

In this paper, we present and analyze a new set of low-rank recovery algorithms for linear inverse problems within the class of hard thresholding methods. We provide strategies on how to set up these algorithms via basic ingredients for different configurations to achieve complexity vs. accuracy tradeoffs. Moreover, we study acceleration schemes via memory-based techniques and randomized, ϵ-approximate matrix projections to decrease the computational costs in the recovery process. For most of the configurations, we present theoretical analysis that guarantees convergence under mild problem conditions. Simulation results demonstrate notable performance improvements as compared to state-of-the-art algorithms both in terms of reconstruction accuracy and computational complexity.
```
		@article{kyrillidis2014matrix,
			title={Matrix recipes for hard thresholding methods},
 			author={Kyrillidis, Anastasios and Cevher, Volkan},
 			journal={Journal of mathematical imaging and vision},
 			volume={48},
 			number={2},
 			pages={235--265},
 			year={2014},
   		publisher={Springer}
      }
```
Nikolaos D. Sidiropoulos and Anastasios Kyrillidis, ``Multi-way compressed sensing for sparse low rank tensors”, IEEE Signal Processing Letters, 19(11):757-760, Oct. 2012.

For linear models, compressed sensing theory and methods enable recovery of sparse signals of interest from few measurements - in the order of the number of nonzero entries as opposed to the length of the signal of interest. Results of similar flavor have more recently emerged for bilinear models, but no results are available for multilinear models of tensor data. In this contribution, we consider compressed sensing for sparse and low-rank tensors. More specifically, we consider low-rank tensors synthesized as sums of outer products of sparse loading vectors, and a special class of linear dimensionality-reducing transformations that reduce each mode individually. We prove interesting `oracle' properties showing that it is possible to identify the uncompressed sparse loadings directly from the compressed tensor data. The proofs naturally suggest a two-step recovery process: fitting a low-rank model in compressed domain, followed by per-mode $\ell_0$ / $\ell_1$ de-compression. This two-step process is also appealing from a computational complexity and memory capacity point of view, especially for big tensor datasets.
```
		@article{sidiropoulos2012multi,
 			title={Multi-way compressed sensing for sparse low-rank tensors},
 			author={Sidiropoulos, Nicholas D and Kyrillidis, Anastasios},
 			journal={Signal Processing Letters, IEEE},
 			volume={19},
 			number={11},
 			pages={757--760},
 			year={2012},
 			publisher={IEEE}
      }
```

Book chapters

Volkan Cevher, Sina Jafarpour and Anastasios Kyrillidis, ``Linear inverse problems with norm and sparsity constraints”, in Practical Applications of Sparse Modeling, Sept. 2014, MIT Press, (Authors listed in alphabetical order).

We describe two nonconventional algorithms for linear regression, called GAME and CLASH. The salient characteristics of these approaches is that they exploit the convex $\ell_1$-ball and non-convex $\ell_0$-sparsity constraints jointly in sparse recovery. To establish the theoretical approximation guarantees of GAME and CLASH, we cover an interesting range of topics from game theory, convex and combinatorial optimization. We illustrate that these approaches lead to improved theoretical guarantees and empirical performance beyond convex and non-convex solvers alone.
```
		@article{cevher2014linear,
			title={Linear inverse problems with norm and sparsity constraints},
			author={Cevher, Volkan and Jafarpour, Sina and Kyrillidis, Anastasios},
			journal={Practical Applications of Sparse Modeling},
			pages={179},
			year={2014},
			publisher={MIT Press}
              }
```
Anastasios Kyrillidis, Luca Baldassarre, Marwa El-Halabi, Quoc Tran-Dinh and Volkan Cevher, ``Structured sparsity: discrete and convex approaches”, in Compressed sensing and its application, Springer, 2014.

Compressive sensing (CS) exploits sparsity to recover sparse or compressible signals from dimensionality reducing, non-adaptive sensing mechanisms. Sparsity is also used to enhance interpretability in machine learning and statistics applications: While the ambient dimension is vast in modern data analysis problems, the relevant information therein typically resides in a much lower dimensional space. However, many solutions proposed nowadays do not leverage the true underlying structure. Recent results in CS extend the simple sparsity idea to more sophisticated structured sparsity models, which describe the interdependency between the nonzero components of a signal, allowing to increase the interpretability of the results and lead to better recovery performance. In order to better understand the impact of structured sparsity, in this chapter we analyze the connections between the discrete models and their convex relaxations, highlighting their relative advantages. We start with the general group sparse model and then elaborate on two important special cases: the dispersive and the hierarchical models. For each, we present the models in their discrete nature, discuss how to solve the ensuing discrete problems and then describe convex relaxations. We also consider more general structures as defined by set functions and present their convex proxies. Further, we discuss efficient optimization solutions for structured sparsity problems and illustrate structured sparsity in action via three applications.
```
		@incollection{kyrillidis2015structured,
			title={Structured sparsity: Discrete and convex approaches},
			author={Kyrillidis, Anastasios and Baldassarre, Luca and El Halabi, Marwa and Tran-Dinh, Quoc and Cevher, Volkan},
			booktitle={Compressed Sensing and its Applications},
			pages={341--387},
			year={2015},
			publisher={Springer}
              }
```

Theses

Anastasios Kyrillidis, ``Rigorous optimization recipes for sparse and low rank inverse problems with applications in data sciences”, Ph.D. Thesis, School of Computer and Communications, EPFL, September 2014.
Anastasios Kyrillidis, ``Polynomial-complexity computation of the M-phase vector that maximizes a rank-deficient quadratic form”, M.Sc. Thesis, Dept. Electronic Engineering and Computer Science, Technical University of Crete, July 2010.