Research | Optimization of Machine Learning Systems Group

Department of
Mathematics and Computer Science

Optimization of Machine Learning Systems Group

Stochastic Optimization and Machine Learning

One of our core research topics lies at the intersection between stochastic optimization and machine learning. This includes studying the behavior of stochastic optimization algorithms when optimizing non-convex optimization landscapes such as the ones encountered in deep learning models, as well as designing and analyzing new stochastic optimization methods, with a special interest for applications in the field of machine learning.

Our group has for instance worked on several enhancements of stochastic-gradient-based methods, such as incorporating adaptive step sizes, momentum, leveraging second-order information, and implementing variance reduction techniques. A focal point of this research involves gaining a deeper understanding of developing theoretically grounded improvements in these aspects.

Stochastic Gradient Descent

Studying the properties of stochastic noise to optimize complex non-convex functions has been an active area of research in the field of machine learning. Prior work has shown that the noise of stochastic gradient descent improves optimization by overcoming undesirable obstacles in the landscape. Moreover, injecting artificial Gaussian noise has become a popular idea to quickly escape saddle points. Indeed, in the absence of reliable gradient information, the noise is used to explore the landscape, but it is unclear what type of noise is optimal in terms of exploration ability. In some recent work (Lucchi et al., 2022), I studied a general type of continuous-time non-Markovian process, based on fractional Brownian motion, that allows for the increments of the process to be correlated. This generalizes processes based on Brownian motion, such as the Ornstein-Uhlenbeck process. We found that this type of noise has some especially interesting properties in terms of generalization (Orvieto et al., 2022).

Deep Learning Theory

Deep Learning has become a key technology solving complex problems, such as beating humans at complex games (Silver et al., 2016), driving cars autonomously (Bojarski et al., 2016), or folding proteins (Senior et al., 2020). However, Deep Learning models are still not well-understood from a theoretical perspective. Our goal is to deepen our understand of such models using advanced mathematical tools from areas such as probability theory, random matrix theory, optimization, etc.

Batch Normalization

One of the most important recent innovations for optimizing deep neural networks is Batch Normalization (BN). This technique has been proven to successfully stabilize and accelerate the training of deep neural networks and is thus by now standard in many state-of-the-art architectures. Our research for instances focuses on understanding the role of these normalization techniques (see for instance Kohler et al. and Daneshmand et al.).

Transformers

Transformers have achieved remarkable success in several domains, ranging from natural language processing to computer vision. Nevertheless, it has been recently shown that stacking self-attention layers—the distinctive architectural component of Transformers—can result in rank collapse of the tokens’ representations at initialization. The question of if and how rank collapse affects training is still largely unanswered, and its investigation is necessary for a more comprehensive understanding of this architecture. In recent work (Noci et al., 2022), we sheded new light on the causes and the effects of this phenomenon. First, we showed that rank collapse of the tokens’ representations hinders training by causing the gradients of the queries and keys to vanish at initialization. Furthermore, we provided a thorough description of the origin of rank collapse and discuss how to prevent it via an appropriate depth-dependent scaling of the residual branches. Finally, our analysis unveils that specific architectural hyperparameters affect the gradients of queries, keys and values differently, leading to disproportionate gradient norms. This suggests an explanation for the widespread use of adaptive optimization methods (such as ADAM (Kingma and Ba, 2014)) for Transformers’ optimization. Our goup is currently further investigating the reasons behind the practical superiority of these adaptive methods to train Transformers.