This post explains the paper “Hopular: Modern Hopfield Networks for Tabular Data”.
Hopular (“Modern Hopfield Networks for Tabular Data”) is a Deep Learning architecture for tabular data, where each layer is equipped with continuous modern Hopfield networks. Hopular is novel as it provides the original training set and the original input at each of its layers. Therefore, Hopular refines the current prediction at every layer by reaccessing the original training set like standard iterative learning algorithms.
A Hopular block stores two types of data:
 the whole training set
 the embedded input sample
The stored training set enables Hopular to find similarities across feature vectors and target vectors, while the stored embedded input sample enables Hopular to determine dependencies between features and targets.
In real world, smallsized and mediumsized tabular datasets with less than 10,000 samples are ubiquitous. Hitherto, Deep Learning underperformed on such datasets. In contrast, Support Vector Machines (SVMs), Random Forests and, in particular, Gradient Boosting typically lead to higher performances than Deep Learning. Gradient Boosting methods like XGBoost have the edge over other methods on most smallsized and mediumsized tabular datasets.
Hopular surpasses Gradient Boosting, Random Forests, and SVMs but also stateoftheart Deep Learning approaches to tabular data.
Table of Contents
 Motivation: Deep Learning Underperforms on Tabular Data
 Hopular: the new Deep Learning Architecture for Tabular Data
 Hopular Intuition: Mimicking Iterative Learning
 Experiments
 Code and Paper
 Additional Material
 Correspondence
Motivation: Deep Learning Underperforms on Tabular Data
In real world, smallsized and mediumsized tabular datasets with less than 10,000 samples are ubiquitous. Their omnipresence can be witnessed at Kaggle challenges. They are found in life sciences for:
 modeling certain diseases
 predicting bioassay outcomes in drug design
 modeling environmental soil contamination
They are also found in most industrial applications for:
 predicting customer behavior
 controlling processes
 optimizing logistics
 recommending other products
 employing predictive maintenance
Deep Learning could not convince so far on smallsized and mediumsized tabular datasets. Therefore, we propose the Hopular Deep Learning architecture.
Hopular: the new Deep Learning Architecture for Tabular Data
The Hopular architecture consists of:
 input layer: embedding layer
 hidden layers: Hopular blocks
 output layer: summarization layer
Algorithm 1 shows the forward pass of Hopular for an original input sample \(\Bx\).
Input Layer: Embedding of the Input Sample
A categorical feature is encoded as a onehot vector while a continuous feature is standardized. The feature value, feature type, and feature position are all mapped to an \(e\)dimensional embedding space. All three embedding vectors are summed to a feature representation. The input sample is represented by \(\By\), which is the concatenation of all the input sample’s feature representations. The current prediction \(\Bxi\) is initialized by \(\By\).
The central component of the Hopular architecture is the Hopular block.
Hidden Layer: Hopular Block
A Hopular block consists of:
 Hopfield Module \(H_s\) (samplesample interactions)
 Hopfield Module \(H_f\) (featurefeature interactions)
 Aggregation Block (result collection and information passthrough)
(I) — Hopfield Module \(H_{s}\). A continuous modern Hopfield network for Deep Learning architectures is implemented
via the layer HopfieldLayer
Ramsauer et al., 2021; Ramsauer et al., 2020
with the training set as fixed stored patterns.
The current prediction \(\Bxi\) serves as the input (the state vector) to Hopfield module \(H_{s}\).
Thus, \(\Bxi\) is interacting with the whole training data
as described in Eq.\(~\)\eqref{eq:Hs}.
Therefore, the Hopfield module \(H_{s}\) identifies samplesample interactions
and can perform similarity searches like a nearestneighbor search
in the whole training data.
The forwardpass for module \(H_{s}\) with one Hopfield network and state \(\Bxi\), learned weight matrices \(\BW_{\Bxi},\BW_{\BX}\), \(\BW_{\BS}\), the stored training set \(\BX\), and a fixed scaling parameter \(\beta\) is given as
\[\begin{align}\label{eq:Hs}\tag{1} H_s\left( \Bxi \right) \ &= \ \BW_{\BS} \ \BW_{\BX} \ \BX \ \soft \left( \beta \ \BX^{T} \ \BW_{\BX}^{T} \ \BW_{\Bxi} \ \Bxi \right). \end{align}\]The hyperparameter \(\beta\) allows to steer the nearestneighborlookup of the samplesample Hopfield module \(H_{s}\). The module \(H_s\) can comprise \(N\) separate Hopfield networks \(H_{s}^{i}\), where the module output is defined as
\[\begin{align}\label{eq:Hs_combined}\tag{2} H_{s} \left(\Bxi \right) \ &= \ \BW_{G} \ \left( H_{s}^{1} \left( \Bxi \right)^{T}, \ldots,\ H_{s}^{N} \left( \Bxi \right)^{T} \right)^{T} \ , \end{align}\]with vector \(\left( H_{s}^{1} \left( \Bxi \right)^{T}, \ldots,\ H_{s}^{N} \left( \Bxi \right)^{T} \right)^{T}\) and a learnable weight matrix \(\BW_{G}\).
(II) — Hopfield Module \(H_{f}\).
A continuous modern Hopfield network for Deep Learning architectures is implemented
via the layer Hopfield
Ramsauer et al., 2021; Ramsauer et al., 2020
with the embedded input features as stored patterns.
The current prediction \(\Bxi\) serves as the input to Hopfield module \(H_{f}\).
Prior to entering \(H_{f}\), the current predition \(\Bxi\) is reshaped
to the matrix \(\BXi\) with the embedded input features as rows.
\(\BXi\) interacts with the embedded features
of the original input sample
as described in Eq.\(~\)\eqref{eq:Hf}.
Therefore, the Hopfield module \(H_{f}\) extracts and models
featurefeature and featuretarget relations.
Thus, the current prediction \(\Bxi\) interacts with the original input sample \(\By\).
The forwardpass for module \(H_{f}\) with one Hopfield network and state \(\BXi\), learned weight matrices \(\BW_{\BXi},\BW_{\BY}\), \(\BW_{\BF}\), the embedded input sample \(\BY\), and a fixed scaling parameter \(\beta\) is given as
\[\begin{align}\label{eq:Hf}\tag{3} H_f \left(\BXi \right) \ &= \ \BW_{\BF} \ \BW_{\BY} \ \BY \ \soft \left( \beta \ \BY^{T} \ \BW_{\BY}^{T} \ \BW_{\BXi} \ \BXi \right). \end{align}\]\(H_{f}\) may contain more than one continuous modern Hopfield network, which leads to an analog equation as Eq.\(~\)\eqref{eq:Hs_combined} of \(H_{s}\).
(III) — Aggregation Block. The results of each Hopfield module are combined via a residual connection with the current prediction \(\Bxi\) (or its reshaped version \(\BXi\) for \(H_{f}\)) and thereby refining it. The current prediction \(\Bxi\) is passed by the aggregation block to the next layer.
The last layer of a Hopular architecture is the output layer which maps the current prediction to the final prediction.
Output Layer: Summarization of the Current Prediction
Hopular is trained in a multitask setting. Its objective is a weighted sum of two losses:
 for predicting masked features of the input sample (BERT masking)
 for predicting the target of the input sample (standard supervised loss)
Thus, in addition to the target, also the masked features of the input sample must be predicted during training. Therefore, the current prediction is a vector constructed by concatenating the current feature predictions and the current target prediction. The current prediction is mapped to the final prediction by separately mapping each current feature prediction to the corresponding final prediction as well as mapping the current target prediction to the final target prediction.
Hopular Intuition: Mimicking Iterative Learning
A huge advantage of Hopular is that it can mimic iterative learning algorithms, in contrast to other Deep Learning methods for tabular data like NPTs and SAINT. Both NPTs and SAINT consider featurefeature and samplesample interactions via their respective attention mechanisms which solely use the result of the previous layer. In contrast, Hopular not only uses the result of the previous layer but also the original input sample and the whole training set.
In every Hopular Block:
 the original input sample
 the whole training set
can be evaluated on the current prediction. This resembles computing the error for the input sample and updating the result on the whole training set. Sidenote: if the original features are not overwritten, the current prediction can contain the original input.
Metric Learning for Kernel Regression by a Hopular Block
We consider the NadarayaWatson kernel regression. The training set is \(\{(\Bz_1,\By_1),\ldots,(\Bz_N,\By_N)\}\) with inputs \(\Bz_i\) summarized by the input matrix \(\BZ = (\Bz_1,\ldots,\Bz_N)\) and labels \(\By_i\) summarized in the label matrix \(\BY=(\By_1,\ldots,\By_N)\). The kernel function is \(k(\Bz_i,\Bz)\). The estimator \(\Bg\) for \(\By\) given \(\Bz\) is:
\[\begin{align}\tag{4} \Bg(\Bz) \ &= \ \sum_{i=1}^N \By_i \ \frac{k(\Bz_i,\Bz)}{\sum_{i=1}^N k(\Bz_i,\Bz)} \ . \end{align}\]For vectors normalized to length \(1\) and the exponential kernel \(k(\Bz_i,\Bz_j) = \exp( \beta/2 \left\\Bz_i  \Bz_j\right\ )\), we have
\[\begin{align}\tag{5} k(\Bz_i,\Bz_j) \ &= \ c \ \exp( \beta \ \Bz_i^T \Bz_j ) \end{align}\]Therefore, the estimator is:
\[\begin{align}\label{eq:estimator}\tag{6} \Bg(\Bz) \ &= \ \BY \ \soft(\beta \ \BZ^T \Bz) \end{align}\]Metric learning for kernel regression learns the kernel \(k\) which is the distance function Weinberger & Tesauro, 2007. A Hopular Block does the same in Eq.\(~\)\eqref{eq:Hs} via learning the weight matrices \(\BW_{\BX}\) and \(\BW_{\Bxi}\). If we set in Eq.\(~\)\eqref{eq:estimator}:
\[\begin{align}\tag{7} \BZ^{T} = \BX^{T}\BW_{\BX}^{T}\ , \quad \Bz = \BW_{\Bxi}\ \Bxi\ , \quad \BY = \BW_{\BS} \BW_{\BX} \BX \end{align}\]then we obtain Eq.\(~\)\eqref{eq:Hs}, with the fixed label matrix \(\BY\).
Linear Model with the AdaBoost Objective by a Hopular Block
The AdaBoost objective for a classification with a binary target \(y \in{} \{1, +1\}\) can be written as follows (Eq.\(~\)(3) and Eq.\(~\)(4) in Shen & Li, 2010):
\[\begin{align}\tag{8} \rL \ &= \ \ln \sum_{i=1}^{N} \exp( \ y_i \ g(\Bz_i) ) \ . \end{align}\]We use this objective for learning the linear model:
\[\begin{align}\tag{9} g(\Bz_i) \ &= \ \beta \ \Bxi^T \Bz_i \ . \end{align}\]The objective multiplied by \(\beta^{1}\) with \(\BY\) as the diagonal matrix of targets \(\By_{i}\) becomes
\[\begin{align}\tag{10} \rL \ &= \ \beta^{1} \ \ln \sum_{i=1}^{N} \exp(\beta \ y_i \ \Bxi^T \Bz_i ) \ = \ \mathrm{lse}(\beta \ , \BY \ \BZ^T \Bxi) \ , \end{align}\]where \(\mathrm{lse}\) is the logsumexponential function. The gradient of this objective is
\[\begin{align}\tag{11} \frac{\partial \rL}{\partial \Bxi} \ &= \  \ \BZ \ \BY \ \soft(  \ \beta \ \BY \ \BZ^T \Bxi ) \ . \end{align}\]This is Eq.\(~\)\eqref{eq:Hs} with:
\[\begin{align}\tag{12}  \BY \BZ^{T} = \BX^{T}\BW_{\BX}^{T}\ , \quad \BW_{\Bxi} = \BI\ , \quad \BW_{\BS} = \BI \end{align}\]Thus, a Hopular Block can implement a gradient descent update rule for a linear classification model using the AdaBoost objective function. The current prediction \(\Bxi\) comes from previous layer.
Experiments
SmallSized Tabular Datasets
In this experiment we compare smallsized tabular datasets, where most of them have less than 500 samples.
Methods Compared. We compare Hopular, XGBoost, CatBoost, LightGBM, NPTs, and other 24 machine learning methods as described in Wainberg et al., 2016, Klambauer et al., 2017. The compared methods include 10 Deep Learning (DL) approaches.
Datasets. Following Klambauer et al., 2017, we consider UCI machine learning repository datasets with less than or equal to 1,000 samples as being small. We select a subset of 21 datasets, comprising 200 to 1,000 samples, from Klambauer et al., 2017. Of these, 13 datasets have 500 samples or less.
Results. Across the considered UCI repository datasets Hopular has the lowest median rank. Therefore, Hopular is the best performing method.
MediumSized Tabular Datasets
In this experiment we compare mediumsized tabular datasets of about 10,000 samples each.
Methods Compared. We compare Hopular, NPTs, XGBoost, CatBoost, and LightGBM.
Datasets. We select the datasets of SchwartzZiv and Armon, 2021, where XGBoost performed better than Deep Learning methods that have been designed for tabular data. We extend this selection by two datasets for regression: (a) colleges was already used for other Deep Learning methods for tabular data Somepalli et al., 2021, and (b) sulfur is publicly available and fits with its 10,082 instances well into the existing collection of mediumsized datasets.
Results. The next tables gives the accuracy for the different datasets and methods. Hopular is the best performing method on 3 out of the 6 datasets. The runnerup method, CatBoost, is twice the best method, whereas XGBoost once. Over the 6 datasets, NPTs and XGBoost have a median rank of 4.5, CatBoost and LightGBM of 2.5 and 2, respectively, and Hopular has a median rank of 1.5. On average over all 6 datasets, Hopular performs better than NPTs, XGBoost, CatBoost, and LightGBM.
Recap: Modern Hopfield networks
The associative memory of our choice are modern Hopfield networks for Deep Learning architectures because of their fast retrieval and high storage capacity as shown in Hopfield networks is all you need. The update mechanism of these modern Hopfield networks is equivalent to the selfattention mechanism of Transformer networks. However, modern Hopfield networks for Deep Learning architectures are more general and have a broader functionality, of which the Transformer selfattention is just one example. The according Hopfield layers can be built in Deep Learning architectures for associating two sets, encoderdecoder attention, multiple instance learning, or averaging and pooling operations. For details, see our blog Hopfield Networks is All You Need.
Modern Hopfield networks for Deep Learning architectures Ramsauer et al., 2021; Widrich et al., 2020 are associative memories that have much higher storage capacity than classical Hopfield networks and can retrieve patterns with one update only.
Code and Paper
Additional Material

Paper: Modern Hopfield Networks and Attention for Immune Repertoire Classification

Blog post on EnergyBased Perspective on Attention Mechanisms in Transformers
For more information visit our homepage https://mljku.github.io/.
Correspondence
This blog post was written by Bernhard Schäfl and Lukas Gruber.
Contributions by Angela BittoNemling and Sepp Hochreiter.
Please contact us via schaefl[at]ml.jku.at