Artificial Intelligence (AI) has recently revolutionised various fields of science and has also started to pervade commercial applications in an unprecedented manner. Despite great successes, most of AI’s enormous potential is still to be realised. The recent surge of AI can be attributed to advances in the machine learning field known as “Deep Learning”, that is, large deeply-layered artificial neural networks (ANNs) trained by modern learning algorithms on massive datasets. In its core, Deep Learning discovers multiple levels of distributed representations of the input, with higher levels representing more abstract concepts. These representations led to impressive successes in different research areas. In particular, artificial neural networks considerably improved the performance in computer vision, speech recognition, and internet advertising.

Sepp Hochreiter, heading this research group, is considered a pioneer of Deep Learning with his discovery of the vanishing gradient problem and the invention of long-short term memory (LSTM) networks.

#### recent publications in Deep Learning:

Normalization is dead, long live normalization!

Hoedt, P.,
Hochreiter, S.,
and Klambauer, G.

*In ICLR Blog Track*
2022

Since the advent of Batch Normalization (BN), almost every state-of-the-art (SOTA) method uses some form of normalization. After all, normalization generally speeds up learning and leads to models that generalize better than their unnormalized counterparts. This turns out to be especially useful when using some form of skip connections, which are prominent in Residual Networks (ResNets), for example. However, Brock et al. (2021a) suggest that SOTA performance can also be achieved using ResNets without normalization!

Few-Shot Learning by Dimensionality Reduction in Gradient Space

Gauch, M.,
Beck, M.,
Adler, T.,
Kotsur, D.,
Fiel, S.,
Eghbal-zadeh, H.,
Brandstetter, J.,
Kofler, J.,
Holzleitner, M.,
Zellinger, W.,
Klotz, D.,
Hochreiter, S.,
and Lehner, S.

2022

We introduce SubGD, a novel few-shot learning method which is based on the recent finding that stochastic gradient descent updates tend to live in a low-dimensional parameter subspace. In experimental and theoretical analyses, we show that models confined to a suitable predefined subspace generalize well for few-shot learning. A suitable subspace fulfills three criteria across the given tasks: it (a) allows to reduce the training error by gradient flow, (b) leads to models that generalize well, and (c) can be identified by stochastic gradient descent. SubGD identifies these subspaces from an eigendecomposition of the auto-correlation matrix of update directions across different tasks. Demonstrably, we can identify low-dimensional suitable subspaces for few-shot learning of dynamical systems, which have varying properties described by one or few parameters of the analytical system description. Such systems are ubiquitous among real-world applications in science and engineering. We experimentally corroborate the advantages of SubGD on three distinct dynamical systems problem settings, significantly outperforming popular few-shot learning methods both in terms of sample efficiency and performance.

Learning 3D Granular Flow Simulations

Mayr, A.,
Lehner, S.,
Mayrhofer, A.,
Kloss, C.,
Hochreiter, S.,
and Brandstetter, J.

2021

Recently, the application of machine learning models has gained momentum in natural sciences and engineering, which is a natural fit due to the abundance of data in these fields. However, the modeling of physical processes from simulation data without first principle solutions remains difficult. Here, we present a Graph Neural Networks approach towards accurate modeling of complex 3D granular flow simulation processes created by the discrete element method LIGGGHTS and concentrate on simulations of physical systems found in real world applications like rotating drums and hoppers. We discuss how to implement Graph Neural Networks that deal with 3D objects, boundary conditions, particle - particle, and particle - boundary interactions such that an accurate modeling of relevant physical quantities is made possible. Finally, we compare the machine learning based trajectories to LIGGGHTS trajectories in terms of particle flows and mixing entropies.

Trusted Artificial Intelligence: Towards Certification of Machine Learning Applications

Winter, P.,
Eder, S.,
Weissenböck, J.,
Schwald, C.,
Doms, T.,
Vogt, T.,
Hochreiter, S.,
and Nessler, B.

2021

Artificial Intelligence is one of the fastest growing technologies of the 21st century and accompanies us in our daily lives when interacting with technical applications. However, reliance on such technical systems is crucial for their widespread applicability and acceptance. The societal tools to express reliance are usually formalized by lawful regulations, i.e., standards, norms, accreditations, and certificates. Therefore, the TÜV AUSTRIA Group in cooperation with the Institute for Machine Learning at the Johannes Kepler University Linz, proposes a certification process and an audit catalog for Machine Learning applications. We are convinced that our approach can serve as the foundation for the certification of applications that use Machine Learning and Deep Learning, the techniques that drive the current revolution in Artificial Intelligence. While certain high-risk areas, such as fully autonomous robots in workspaces shared with humans, are still some time away from certification, we aim to cover low-risk applications with our certification procedure. Our holistic approach attempts to analyze Machine Learning applications from multiple perspectives to evaluate and verify the aspects of secure software development, functional requirements, data quality, data protection, and ethics. Inspired by existing work, we introduce four criticality levels to map the criticality of a Machine Learning application regarding the impact of its decisions on people, environment, and organizations. Currently, the audit catalog can be applied to low-risk applications within the scope of supervised learning as commonly encountered in industry. Guided by field experience, scientific developments, and market demands, the audit catalog will be extended and modified accordingly.

MC-LSTM: Mass-Conserving LSTM

Hoedt, P.,
Kratzert, F.,
Klotz, D.,
Halmich, C.,
Holzleitner, M.,
Nearing, G.,
Hochreiter, S.,
and Klambauer, G.

*In Proceedings of the 38th International Conference on Machine Learning*
2021

The success of Convolutional Neural Networks (CNNs) in computer vision is mainly driven by their strong inductive bias, which is strong enough to allow CNNs to solve vision-related tasks with random weights, meaning without learning. Similarly, Long Short-Term Memory (LSTM) has a strong inductive bias towards storing information over time. However, many real-world systems are governed by conservation laws, which lead to the redistribution of particular quantities – e.g. in physical and economical systems. Our novel Mass-Conserving LSTM (MC-LSTM) adheres to these conservation laws by extending the inductive bias of LSTM to model the redistribution of those stored quantities. MC-LSTMs set a new state-of-the-art for neural arithmetic units at learning arithmetic operations, such as addition tasks, which have a strong conservation law, as the sum is constant over time. Further, MC-LSTM is applied to traffic forecasting, modelling a pendulum, and a large benchmark dataset in hydrology, where it sets a new state-of-the-art for predicting peak flows. In the hydrology example, we show that MC-LSTM states correlate with real-world processes and are therefore interpretable.

DeepRC: Immune Repertoire Classification with Attention-Based Deep Massive Multiple Instance Learning

Widrich, M.,
Schäfl, B.,
Pavlović, M.,
Sandve, G.,
Hochreiter, S.,
Greiff, V.,
and Klambauer, G.

2020

### Abstract

High-throughput immunosequencing allows reconstructing the immune repertoire of an individual, which is a unique opportunity for new immunotherapies, immunodiagnostics, and vaccine design. Since immune repertoires are shaped by past and current immune events, such as infection and disease, and thus record an individual’s state of health, immune repertoire sequencing data may enable the prediction of health and disease using machine learning. However, finding the connections between an individual’s repertoire and the individual’s disease class, with potentially hundreds of thousands to millions of short sequences per individual, poses a difficult and unique challenge for machine learning methods. In this work, we present our method DeepRC that combines a Deep Learning architecture with attentionbased multiple instance learning. To validate that DeepRC accurately predicts an individual’s disease class based on its immune repertoire and determines the associated class-specific sequence motifs, we applied DeepRC in four large-scale experiments encompassing ground-truth simulated as well as real-world virus infection data. We demonstrate that DeepRC outperforms all tested methods with respect to predictive performance and enables the extraction of those sequence motifs that are connected to a given disease class.

Cross-Domain Few-Shot Learning by Representation Fusion

Adler, T.,
Brandstetter, J.,
Widrich, M.,
Mayr, A.,
Kreil, D.,
Kopp, M.,
Klambauer, G.,
and Hochreiter, S.

*arXiv preprint arXiv:2010.06498*
2020

In order to quickly adapt to new data, few-shot learning aims at learning from few examples, often by using already acquired knowledge. The new data often differs from the previously seen data due to a domain shift, that is, a change of the input-target distribution. While several methods perform well on small domain shifts like new target classes with similar inputs, larger domain shifts are still challenging. Large domain shifts may result in high-level concepts that are not shared between the original and the new domain. However, low-level concepts like edges in images might still be shared and useful. For cross-domain few-shot learning, we suggest representation fusion to unify different abstraction levels of a deep neural network into one representation. We propose Cross-domain Hebbian Ensemble Few-shot learning (CHEF), which achieves representation fusion by an ensemble of Hebbian learners acting on different layers of a deep neural network that was trained on the original domain. On the few-shot datasets miniImagenet and tieredImagenet, where the domain shift is small, CHEF is competitive with state-of-the-art methods. On cross-domain few-shot benchmark challenges with larger domain shifts, CHEF establishes novel state-of-the-art results in all categories. We further apply CHEF on a real-world cross-domain application in drug discovery. We consider a domain shift from bioactive molecules to environmental chemicals and drugs with twelve associated toxicity prediction tasks. On these tasks, that are highly relevant for computational drug discovery, CHEF significantly outperforms all its competitors.

First Order Generative Adversarial Networks

Seward, C.,
Unterthiner, T.,
Bergmann, U.,
Jetchev, N.,
and Hochreiter, S.

2018

GANs excel at learning high dimensional distributions, but they can update generator parameters in directions that do not correspond to the steepest descent direction of the objective. Prominent examples of problematic update directions include those used in both Goodfellow’s original GAN and the WGAN-GP. To formally describe an optimal update direction, we introduce a theoretical framework which allows the derivation of requirements on both the divergence and corresponding method for determining an update direction, with these requirements guaranteeing unbiased mini-batch updates in the direction of steepest descent. We propose a novel divergence which approximates the Wasserstein distance while regularizing the critic’s first order information. Together with an accompanying update direction, this divergence fulfills the requirements for unbiased steepest descent updates. We verify our method, the First Order GAN, with image generation on CelebA, LSUN and CIFAR-10 and set a new state of the art on the One Billion Word language generation task. Code to reproduce experiments is available.

Coulomb GANs: Provably Optimal Nash Equilibria via Potential Fields

Unterthiner, T.,
Nessler, B.,
Seward, C.,
Klambauer, G.,
Heusel, M.,
Ramsauer, H.,
and Hochreiter, S.

2018

Generative adversarial networks (GANs) evolved into one of the most successful unsupervised techniques for generating realistic images. Even though it has recently been shown that GAN training converges, GAN models often end up in local Nash equilibria that are associated with mode collapse or otherwise fail to model the target distribution. We introduce Coulomb GANs, which pose the GAN learning problem as a potential field of charged particles, where generated samples are attracted to training set samples but repel each other. The discriminator learns a potential field while the generator decreases the energy by moving its samples along the vector (force) field determined by the gradient of the potential field. Through decreasing the energy, the GAN model learns to generate samples according to the whole target distribution and does not only cover some of its modes. We prove that Coulomb GANs possess only one Nash equilibrium which is optimal in the sense that the model distribution equals the target distribution. We show the efficacy of Coulomb GANs on a variety of image datasets. On LSUN and celebA, Coulomb GANs set a new state of the art and produce a previously unseen variety of different samples.

Self-Normalizing Neural Networks

Klambauer, G.,
Unterthiner, T.,
Mayr, A.,
and Hochreiter, S.

2017

Deep Learning has revolutionized vision via convolutional neural networks (CNNs) and natural language processing via recurrent neural networks (RNNs). However, success stories of Deep Learning with standard feed-forward neural networks (FNNs) are rare. FNNs that perform well are typically shallow and, therefore cannot exploit many levels of abstract representations. We introduce self-normalizing neural networks (SNNs) to enable high-level abstract representations. While batch normalization requires explicit normalization, neuron activations of SNNs automatically converge towards zero mean and unit variance. The activation function of SNNs are "scaled exponential linear units" (SELUs), which induce self-normalizing properties. Using the Banach fixed-point theorem, we prove that activations close to zero mean and unit variance that are propagated through many network layers will converge towards zero mean and unit variance – even under the presence of noise and perturbations. This convergence property of SNNs allows to (1) train deep networks with many layers, (2) employ strong regularization, and (3) to make learning highly robust. Furthermore, for activations not close to unit variance, we prove an upper and lower bound on the variance, thus, vanishing and exploding gradients are impossible. We compared SNNs on (a) 121 tasks from the UCI machine learning repository, on (b) drug discovery benchmarks, and on (c) astronomy tasks with standard FNNs and other machine learning methods such as random forests and support vector machines. SNNs significantly outperformed all competing FNN methods at 121 UCI tasks, outperformed all competing methods at the Tox21 dataset, and set a new record at an astronomy data set. The winning SNN architectures are often very deep. Implementations are available at: github.com/bioinf-jku/SNNs.

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

Heusel, M.,
Ramsauer, H.,
Unterthiner, T.,
Nessler, B.,
and Hochreiter, S.

2017

Generative Adversarial Networks (GANs) excel at creating realistic images with complex models for which maximum likelihood is infeasible. However, the convergence of GAN training has still not been proved. We propose a two time-scale update rule (TTUR) for training GANs with stochastic gradient descent on arbitrary GAN loss functions. TTUR has an individual learning rate for both the discriminator and the generator. Using the theory of stochastic approximation, we prove that the TTUR converges under mild assumptions to a stationary local Nash equilibrium. The convergence carries over to the popular Adam optimization, for which we prove that it follows the dynamics of a heavy ball with friction and thus prefers flat minima in the objective landscape. For the evaluation of the performance of GANs at image generation, we introduce the "Fr\’echet Inception Distance" (FID) which captures the similarity of generated images to real ones better than the Inception Score. In experiments, TTUR improves learning for DCGANs and Improved Wasserstein GANs (WGAN-GP) outperforming conventional GAN training on CelebA, CIFAR-10, SVHN, LSUN Bedrooms, and the One Billion Word Benchmark.