One goal of deep learning is to provide models with the ability to store and access information in a learnable manner. A classical example of this are Hopfield Networks. These networks are capable of storing information and retrieving it by association. However, their limited capacity and the restriction to binary data render them inadequate in the context of modern deep learning. Recent work has led to a novel formulation of Hopfield Networks which exhibit considerable increase in storage capacity. However, in order to integrate these modern Hopfield networks into deep learning architectures, it is necessary to make them differentiable, which in turn requires a transition from the binary to the continuous domain.
In the paper “Hopfield Networks is all you need” we introduce a continuous generalization of modern Hopfield Networks. This includes a novel energy function for continuous patterns and a new update rule which ensures global convergence to stationary points of the energy (local minima or saddle points). Both desirable properties, the exponential storage capacity as well as the fast convergence of the update rule, are inherited from their binary counterparts. Here, fast convergence typically means convergence after one update step, which corresponds to one forward pass. This opens up the possibility to integrate continuous modern Hopfield Networks as layers into deep learning architectures.
This ability results in a wide variety of novel architectures and use cases for deep learning. The task of this group is to explore the new possibilities opening up as well as advance the understanding of this new form of Hopfield Networks.
recent publications in Modern Hopfield Networks:
Hopular: Modern Hopfield Networks for Tabular Data
Schäfl, B.,
Gruber, L.,
Bitto-Nemling, A.,
and Hochreiter, S.
2022
While Deep Learning excels in structured data as encountered in vision and natural language processing, it failed to meet its expectations on tabular data. For tabular data, Support Vector Machines (SVMs), Random Forests, and Gradient Boosting are the best performing techniques with Gradient Boosting in the lead. Recently, we saw a surge of Deep Learning methods that were tailored to tabular data but still underperform compared to Gradient Boosting on small-sized datasets. We suggest "Hopular", a novel Deep Learning architecture for medium- and small-sized datasets, where each layer is equipped with continuous modern Hopfield networks. The modern Hopfield networks use stored data to identify feature-feature, feature-target, and sample-sample dependencies. Hopular’s novelty is that every layer can directly access the original input as well as the whole training set via stored data in the Hopfield networks. Therefore, Hopular can step-wise update its current model and the resulting prediction at every layer like standard iterative learning algorithms. In experiments on small-sized tabular datasets with less than 1,000 samples, Hopular surpasses Gradient Boosting, Random Forests, SVMs, and in particular several Deep Learning methods. In experiments on medium-sized tabular data with about 10,000 samples, Hopular outperforms XGBoost, CatBoost, LightGBM and a state-of-the art Deep Learning method designed for tabular data. Thus, Hopular is a strong alternative to these methods on tabular data.
CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP
Fürst, A.,
Rumetshofer, E.,
Tran, V.,
Ramsauer, H.,
Tang, F.,
Lehner, J.,
Kreil, D.,
Kopp, M.,
Klambauer, G.,
Bitto-Nemling, A.,
and Hochreiter, S.
2021
Contrastive learning with the InfoNCE objective is exceptionally successful in various self-supervised learning tasks. Recently, the CLIP model yielded impressive results on zero-shot transfer learning when using InfoNCE for learning visual representations from natural language supervision. However, InfoNCE as a lower bound on the mutual information has been shown to perform poorly for high mutual information. In contrast, the InfoLOOB upper bound (leave one out bound) works well for high mutual information but suffers from large variance and instabilities. We introduce "Contrastive Leave One Out Boost" (CLOOB), where modern Hopfield networks boost learning with the InfoLOOB objective. Modern Hopfield networks replace the original embeddings by retrieved embeddings in the InfoLOOB objective. The retrieved embeddings give InfoLOOB two assets. Firstly, the retrieved embeddings stabilize InfoLOOB, since they are less noisy and more similar to one another than the original embeddings. Secondly, they are enriched by correlations, since the covariance structure of embeddings is reinforced through retrievals. We compare CLOOB to CLIP after learning on the Conceptual Captions and the YFCC dataset with respect to their zero-shot transfer learning performance on other datasets. CLOOB consistently outperforms CLIP at zero-shot transfer learning across all considered architectures and datasets.
Hopfield Networks Is All You Need
Ramsauer, H.,
Schäfl, B.,
Lehner, J.,
Seidl, P.,
Widrich, M.,
Gruber, L.,
Holzleitner, M.,
Pavlović, M.,
Sandve, G.,
Greiff, V.,
Kreil, D.,
Kopp, M.,
Klambauer, G.,
Brandstetter, J.,
and Hochreiter, S.
2020
We show that the transformer attention mechanism is the update rule of a modern Hopfield network with continuous states. This new Hopfield network can store exponentially (with the dimension) many patterns, converges with one update, and has exponentially small retrieval errors. The number of stored patterns is traded off against convergence speed and retrieval error. The new Hopfield network has three types of energy minima (fixed points of the update): (1) global fixed point averaging over all patterns, (2) metastable states averaging over a subset of patterns, and (3) fixed points which store a single pattern. Transformer and BERT models operate in their first layers preferably in the global averaging regime, while they operate in higher layers in metastable states. The gradient in transformers is maximal for metastable states, is uniformly distributed for global averaging, and vanishes for a fixed point near a stored pattern. Using the Hopfield network interpretation, we analyzed learning of transformer and BERT models. Learning starts with attention heads that average and then most of them switch to metastable states. However, the majority of heads in the first layers still averages and can be replaced by averaging, e.g. our proposed Gaussian weighting. In contrast, heads in the last layers steadily learn and seem to use metastable states to collect information created in lower layers. These heads seem to be a promising target for improving transformers. Neural networks with Hopfield networks outperform other methods on immune repertoire classification, where the Hopfield net stores several hundreds of thousands of patterns. We provide a new PyTorch layer called "Hopfield", which allows to equip deep learning architectures with modern Hopfield networks as a new powerful concept comprising pooling, memory, and attention. GitHub: https://github.com/ml-jku/hopfield-layers
Modern Hopfield networks and attention for immune repertoire classification
Widrich, M.,
Schäfl, B.,
Ramsauer, H.,
Pavlović, M.,
Gruber, L.,
Holzleitner, M.,
Brandstetter, J.,
Sandve, G.,
Greiff, V.,
Hochreiter, S.,
and Klambauer, G.
In Advances in Neural Information Processing Systems
2020
A central mechanism in machine learning is to identify, store, and recognize patterns. How to learn, access, and retrieve such patterns is crucial in Hopfield networks and the more recent transformer architectures. We show that the attention mechanism of transformer architectures is actually the update rule of modern Hopfield networks that can store exponentially many patterns. We exploit this high storage capacity of modern Hopfield networks to solve a challenging multiple instance learning (MIL) problem in computational biology: immune repertoire classification. In immune repertoire classification, a vast number of immune receptors are used to predict the immune status of an individual. This constitutes a MIL problem with an unprecedentedly massive number of instances, two orders of magnitude larger than currently considered problems, and with an extremely low witness rate. Accurate and interpretable machine learning methods solving this problem could pave the way towards new vaccines and therapies, which is currently a very relevant research topic intensified by the COVID-19 crisis. In this work, we present our novel method DeepRC that integrates transformer-like attention, or equivalently modern Hopfield networks, into deep learning architectures for massive MIL such as immune repertoire classification. We demonstrate that DeepRC outperforms all other methods with respect to predictive performance on large-scale experiments including simulated and real-world virus infection data and enables the extraction of sequence motifs that are connected to a given disease class. Source code and datasets: https://github.com/ml-jku/DeepRC