On the Convergence of Shallow Neural Network Training with Randomly Masked Neurons

Fangshuo (Jasper) Liao¹

Anastasios Kyrillidis¹

¹Rice CS

Code [GitHub]

Paper [TMLR]

Cite [BibTeX]

Abstract

With the motive of training all the parameters of a neural network, we study why and when one can achieve this by iteratively creating, training, and combining randomly selected subnetworks. Such scenarios have either implicitly or explicitly emerged in the recent literature: see e.g., the Dropout family of regularization techniques, or some distributed ML training protocols that reduce communication/computation complexities, such as the Independent Subnet Training protocol. While these methods are studied empirically and utilized in practice, they often enjoy partial or no theoretical support, especially when applied on neural network-based objectives. In this manuscript, our focus is on overparameterized single hidden layer neural networks with ReLU activations in the lazy training regime. By carefully analyzing i) the subnetworks’ neural tangent kernel, ii) the surrogate functions’ gradient, and iii) how we sample and combine the surrogate functions, we prove linear convergence rate of the training error –up to a neighborhood around the optimal point– for an overparameterized single-hidden layer perceptron with a regression loss. Our analysis reveals a dependency of the size of the neighborhood around the optimal point on the number of surrogate models and the number of local training steps for each selected subnetwork. Moreover, the considered framework generalizes and provides new insights on dropout training, multi-sample dropout training, as well as Independent Subnet Training; for each case, we provide convergence results as corollaries of our main theorem.

This paper is part of the IST project, ran by PIs Anastasios Kyrillidis, Chris Jermaine, and Yingyan Lin. More info here.

Oveview

Overparameterized neural networks have led to both unexpected empirical success in deep learning, and new techniques in analyzing neural network training. While theoretical work in this field has led to a diverse set of new overparameterized neural network architectures and training algorithms, most efforts fall under the following scenario: in each iteration, we perform a gradient-based update that involves all parameters of the neural network in both the forward and backward propagation. Yet, advances in regularization techniques, computationally-efficient and communication-efficient distributed training methods favor a different narrative: one would –explicitly or implicitly– train smaller and randomly-selected models within a large model, iteratively. This brings up the following question:

“Can one meaningfully train an overparameterized ML model by iteratively training and combining together smaller versions of it?”

Motivation and connection to existing methods.

The study of partial models/subnetworks that reside in a large dense network have drawn increasing attention.

Dropout regularization. Dropout is a widely-accepted technique against overfitting in deep learning. In each training step, a random mask is generated from some pre-defined distribution, and used to mask-out part of the neurons in the neural network. Later variants of dropout include the drop-connect, multisample dropout , Gaussian dropout , and the variational dropout. Here, we restrict our attention to the vanilla dropout, and the multi-sample dropout. The vanilla dropout corresponds to our framework, if in the latter we sample only one mask per iteration, and let the subnetwork perform only one gradient descent update. The multi-sample dropout extends the vanilla dropout in that it samples multiple masks per iteration. For regression tasks, our theoretical result implies convergence guarantees for these two scenarios on a single hidden-layer perceptron.
Distributed ML training. Recent advances in distributed model/parallel training have led to variants of distributed gradient descent protocols. Yet, all training parameters are updated per outer step, which could be computationally and communication inefficient, especially in cases of high communication costs per round. The Independent Subnetwork Training (IST) protocol goes one step further: IST splits the model vertically, where each machine contains all layers of the neural network, but only with a (non-overlapping) subset of neurons being active in each layer. Multiple local SGD steps can be performed without the workers having to communicate. Methods in this line of work achieves higher communication efficiency and accuracy that is comparable to centralized training. Yet, the theoretical understanding of IST is currently missing. Our theoretical result implies convergence guarantees for IST for a single hidden-layer perceptron under the simplified assumption that every worker has full data access, and provides insights on how the number of compute nodes affects the performance of the overall protocol.

Contributions.This paper has the following key contributions:

We provide convergence rate guarantees for i) dropout regularization, ii) multisample dropout, iii) and multi-worker IST, given a regression task on a single-hidden layer perceptron.
We show that the NTK of surrogate models stays close to the infinite width NTK, thus being positive definite. Consequently, our work shows that training over surrogate models still enjoys linear convergence.
For subnetworks defined by Bernoulli masks with a fixed distribution parameter, we show that aggregated gradient in the first local step is a biased estimator of the desirable gradient of the whole network, with the bias term decreasing as the number of subnetworks grows. Moreover, all aggregated gradients during local training stays close to the aggregated gradient of the first local step. This finding leads to linear convergence of the above training framework with an error term under Bernoulli masks.
For masks sampled from categorical distribution, we provide tight bounds i) on the average loss increase, when sampling a subnetwork from the whole network; ii) on the loss decrease, when the independently trained subnetworks are combined into the whole model. This finding leads to linear convergence with a slightly different error term than the Bernoulli mask scenario.

Main statement (informal). The following consitutes an informal statement out of this work; more details and results can be found in the extended version of the paper.

Consider the training scheme shown in Figure above. If the masks are generated from a Bernoulli distribution or categorical distribution, under sufficiently large over-parameterization coefficient, and sufficiently small learning rate, training the large model via surrogate subnetworks still converges linearly, up to an neighborhood around the optimal point.

Acknowledgements

AK acknowledges funding by the NSF/Intel (CNS-2003137). This code for this template can be found here.