\[\newcommand{\Ba}{\boldsymbol{a}} \newcommand{\Bp}{\boldsymbol{p}} \newcommand{\Bu}{\boldsymbol{u}} \newcommand{\Bv}{\boldsymbol{v}} \newcommand{\Bx}{\boldsymbol{x}} \newcommand{\By}{\boldsymbol{y}} \newcommand{\Bz}{\boldsymbol{z}} \newcommand{\Bw}{\boldsymbol{w}} \newcommand{\Bg}{\boldsymbol{g}} \newcommand{\BU}{\boldsymbol{U}} \newcommand{\BV}{\boldsymbol{V}} \newcommand{\BX}{\boldsymbol{X}} \newcommand{\BY}{\boldsymbol{Y}} \newcommand{\BZ}{\boldsymbol{Z}} \newcommand{\BW}{\boldsymbol{W}} \newcommand{\BS}{\boldsymbol{S}} \newcommand{\BF}{\boldsymbol{F}} \newcommand{\BG}{\boldsymbol{G}} \newcommand{\BI}{\boldsymbol{I}} \newcommand{\BXi}{\boldsymbol{\Xi}} \newcommand{\Bxi}{\boldsymbol{\xi}} \newcommand{\soft}{\mathrm{softmax}} \newcommand{\rL}{\mathrm{L}}\]

This post explains the paper “Hopular: Modern Hopfield Networks for Tabular Data”.

Hopular (“Modern Hopfield Networks for Tabular Data”) is a Deep Learning architecture for tabular data, where each layer is equipped with continuous modern Hopfield networks. Hopular is novel as it provides the original training set and the original input at each of its layers. Therefore, Hopular refines the current prediction at every layer by re-accessing the original training set like standard iterative learning algorithms.

not found

A Hopular block stores two types of data:

The stored training set enables Hopular to find similarities across feature vectors and target vectors, while the stored embedded input sample enables Hopular to determine dependencies between features and targets.

In real world, small-sized and medium-sized tabular datasets with less than 10,000 samples are ubiquitous. Hitherto, Deep Learning underperformed on such datasets. In contrast, Support Vector Machines (SVMs), Random Forests and, in particular, Gradient Boosting typically lead to higher performances than Deep Learning. Gradient Boosting methods like XGBoost have the edge over other methods on most small-sized and medium-sized tabular datasets.

Hopular surpasses Gradient Boosting, Random Forests, and SVMs but also state-of-the-art Deep Learning approaches to tabular data.

Table of Contents

  1. Motivation: Deep Learning Underperforms on Tabular Data
  2. Hopular: the new Deep Learning Architecture for Tabular Data
    1. Input Layer: Embedding of the Input Sample
    2. Hidden Layer: Hopular Block
    3. Output Layer: Summarization of the Current Prediction
  3. Hopular Intuition: Mimicking Iterative Learning
    1. Metric Learning for Kernel Regression by a Hopular Block
    2. Linear Model with the AdaBoost Objective by a Hopular Block
  4. Experiments
  5. Code and Paper
  6. Additional Material
  7. Correspondence

Motivation: Deep Learning Underperforms on Tabular Data

In real world, small-sized and medium-sized tabular datasets with less than 10,000 samples are ubiquitous. Their omnipresence can be witnessed at Kaggle challenges. They are found in life sciences for:

They are also found in most industrial applications for:

Deep Learning could not convince so far on small-sized and medium-sized tabular datasets. Therefore, we propose the Hopular Deep Learning architecture.

Hopular: the new Deep Learning Architecture for Tabular Data

The Hopular architecture consists of:

Algorithm 1 shows the forward pass of Hopular for an original input sample \(\Bx\).

not found

Input Layer: Embedding of the Input Sample

A categorical feature is encoded as a one-hot vector while a continuous feature is standardized. The feature value, feature type, and feature position are all mapped to an \(e\)-dimensional embedding space. All three embedding vectors are summed to a feature representation. The input sample is represented by \(\By\), which is the concatenation of all the input sample’s feature representations. The current prediction \(\Bxi\) is initialized by \(\By\).

not found

The central component of the Hopular architecture is the Hopular block.

Hidden Layer: Hopular Block

A Hopular block consists of:

(I) — Hopfield Module \(H_{s}\). A continuous modern Hopfield network for Deep Learning architectures is implemented via the layer HopfieldLayer Ramsauer et al., 2021; Ramsauer et al., 2020 with the training set as fixed stored patterns. The current prediction \(\Bxi\) serves as the input (the state vector) to Hopfield module \(H_{s}\). Thus, \(\Bxi\) is interacting with the whole training data as described in Eq.\(~\)\eqref{eq:Hs}. Therefore, the Hopfield module \(H_{s}\) identifies sample-sample interactions and can perform similarity searches like a nearest-neighbor search in the whole training data.

The forward-pass for module \(H_{s}\) with one Hopfield network and state \(\Bxi\), learned weight matrices \(\BW_{\Bxi},\BW_{\BX}\), \(\BW_{\BS}\), the stored training set \(\BX\), and a fixed scaling parameter \(\beta\) is given as

\[\begin{align}\label{eq:Hs}\tag{1} H_s\left( \Bxi \right) \ &= \ \BW_{\BS} \ \BW_{\BX} \ \BX \ \soft \left( \beta \ \BX^{T} \ \BW_{\BX}^{T} \ \BW_{\Bxi} \ \Bxi \right). \end{align}\]

The hyperparameter \(\beta\) allows to steer the nearest-neighbor-lookup of the sample-sample Hopfield module \(H_{s}\). The module \(H_s\) can comprise \(N\) separate Hopfield networks \(H_{s}^{i}\), where the module output is defined as

\[\begin{align}\label{eq:Hs_combined}\tag{2} H_{s} \left(\Bxi \right) \ &= \ \BW_{G} \ \left( H_{s}^{1} \left( \Bxi \right)^{T}, \ldots,\ H_{s}^{N} \left( \Bxi \right)^{T} \right)^{T} \ , \end{align}\]

with vector \(\left( H_{s}^{1} \left( \Bxi \right)^{T}, \ldots,\ H_{s}^{N} \left( \Bxi \right)^{T} \right)^{T}\) and a learnable weight matrix \(\BW_{G}\).

(II) — Hopfield Module \(H_{f}\). A continuous modern Hopfield network for Deep Learning architectures is implemented via the layer Hopfield Ramsauer et al., 2021; Ramsauer et al., 2020 with the embedded input features as stored patterns. The current prediction \(\Bxi\) serves as the input to Hopfield module \(H_{f}\). Prior to entering \(H_{f}\), the current predition \(\Bxi\) is reshaped to the matrix \(\BXi\) with the embedded input features as rows. \(\BXi\) interacts with the embedded features of the original input sample as described in Eq.\(~\)\eqref{eq:Hf}. Therefore, the Hopfield module \(H_{f}\) extracts and models feature-feature and feature-target relations. Thus, the current prediction \(\Bxi\) interacts with the original input sample \(\By\).

The forward-pass for module \(H_{f}\) with one Hopfield network and state \(\BXi\), learned weight matrices \(\BW_{\BXi},\BW_{\BY}\), \(\BW_{\BF}\), the embedded input sample \(\BY\), and a fixed scaling parameter \(\beta\) is given as

\[\begin{align}\label{eq:Hf}\tag{3} H_f \left(\BXi \right) \ &= \ \BW_{\BF} \ \BW_{\BY} \ \BY \ \soft \left( \beta \ \BY^{T} \ \BW_{\BY}^{T} \ \BW_{\BXi} \ \BXi \right). \end{align}\]

\(H_{f}\) may contain more than one continuous modern Hopfield network, which leads to an analog equation as Eq.\(~\)\eqref{eq:Hs_combined} of \(H_{s}\).

(III) — Aggregation Block. The results of each Hopfield module are combined via a residual connection with the current prediction \(\Bxi\) (or its reshaped version \(\BXi\) for \(H_{f}\)) and thereby refining it. The current prediction \(\Bxi\) is passed by the aggregation block to the next layer.

not found

The last layer of a Hopular architecture is the output layer which maps the current prediction to the final prediction.

Output Layer: Summarization of the Current Prediction

Hopular is trained in a multi-task setting. Its objective is a weighted sum of two losses:

Thus, in addition to the target, also the masked features of the input sample must be predicted during training. Therefore, the current prediction is a vector constructed by concatenating the current feature predictions and the current target prediction. The current prediction is mapped to the final prediction by separately mapping each current feature prediction to the corresponding final prediction as well as mapping the current target prediction to the final target prediction.

not found

Hopular Intuition: Mimicking Iterative Learning

A huge advantage of Hopular is that it can mimic iterative learning algorithms, in contrast to other Deep Learning methods for tabular data like NPTs and SAINT. Both NPTs and SAINT consider feature-feature and sample-sample interactions via their respective attention mechanisms which solely use the result of the previous layer. In contrast, Hopular not only uses the result of the previous layer but also the original input sample and the whole training set.

In every Hopular Block:

can be evaluated on the current prediction. This resembles computing the error for the input sample and updating the result on the whole training set. Sidenote: if the original features are not overwritten, the current prediction can contain the original input.

Metric Learning for Kernel Regression by a Hopular Block

We consider the Nadaraya-Watson kernel regression. The training set is \(\{(\Bz_1,\By_1),\ldots,(\Bz_N,\By_N)\}\) with inputs \(\Bz_i\) summarized by the input matrix \(\BZ = (\Bz_1,\ldots,\Bz_N)\) and labels \(\By_i\) summarized in the label matrix \(\BY=(\By_1,\ldots,\By_N)\). The kernel function is \(k(\Bz_i,\Bz)\). The estimator \(\Bg\) for \(\By\) given \(\Bz\) is:

\[\begin{align}\tag{4} \Bg(\Bz) \ &= \ \sum_{i=1}^N \By_i \ \frac{k(\Bz_i,\Bz)}{\sum_{i=1}^N k(\Bz_i,\Bz)} \ . \end{align}\]

For vectors normalized to length \(1\) and the exponential kernel \(k(\Bz_i,\Bz_j) = \exp(- \beta/2 \left\|\Bz_i - \Bz_j\right\| )\), we have

\[\begin{align}\tag{5} k(\Bz_i,\Bz_j) \ &= \ c \ \exp( \beta \ \Bz_i^T \Bz_j ) \end{align}\]

Therefore, the estimator is:

\[\begin{align}\label{eq:estimator}\tag{6} \Bg(\Bz) \ &= \ \BY \ \soft(\beta \ \BZ^T \Bz) \end{align}\]

Metric learning for kernel regression learns the kernel \(k\) which is the distance function Weinberger & Tesauro, 2007. A Hopular Block does the same in Eq.\(~\)\eqref{eq:Hs} via learning the weight matrices \(\BW_{\BX}\) and \(\BW_{\Bxi}\). If we set in Eq.\(~\)\eqref{eq:estimator}:

\[\begin{align}\tag{7} \BZ^{T} = \BX^{T}\BW_{\BX}^{T}\ , \quad \Bz = \BW_{\Bxi}\ \Bxi\ , \quad \BY = \BW_{\BS} \BW_{\BX} \BX \end{align}\]

then we obtain Eq.\(~\)\eqref{eq:Hs}, with the fixed label matrix \(\BY\).

Linear Model with the AdaBoost Objective by a Hopular Block

The AdaBoost objective for a classification with a binary target \(y \in{} \{-1, +1\}\) can be written as follows (Eq.\(~\)(3) and Eq.\(~\)(4) in Shen & Li, 2010):

\[\begin{align}\tag{8} \rL \ &= \ \ln \sum_{i=1}^{N} \exp(- \ y_i \ g(\Bz_i) ) \ . \end{align}\]

We use this objective for learning the linear model:

\[\begin{align}\tag{9} g(\Bz_i) \ &= \ \beta \ \Bxi^T \Bz_i \ . \end{align}\]

The objective multiplied by \(\beta^{-1}\) with \(\BY\) as the diagonal matrix of targets \(\By_{i}\) becomes

\[\begin{align}\tag{10} \rL \ &= \ \beta^{-1} \ \ln \sum_{i=1}^{N} \exp(-\beta \ y_i \ \Bxi^T \Bz_i ) \ = \ \mathrm{lse}(\beta \ , -\BY \ \BZ^T \Bxi) \ , \end{align}\]

where \(\mathrm{lse}\) is the log-sum-exponential function. The gradient of this objective is

\[\begin{align}\tag{11} \frac{\partial \rL}{\partial \Bxi} \ &= \ - \ \BZ \ \BY \ \soft( - \ \beta \ \BY \ \BZ^T \Bxi ) \ . \end{align}\]

This is Eq.\(~\)\eqref{eq:Hs} with:

\[\begin{align}\tag{12} - \BY \BZ^{T} = \BX^{T}\BW_{\BX}^{T}\ , \quad \BW_{\Bxi} = \BI\ , \quad \BW_{\BS} = \BI \end{align}\]

Thus, a Hopular Block can implement a gradient descent update rule for a linear classification model using the AdaBoost objective function. The current prediction \(\Bxi\) comes from previous layer.

Experiments

Small-Sized Tabular Datasets

In this experiment we compare small-sized tabular datasets, where most of them have less than 500 samples.

Methods Compared. We compare Hopular, XGBoost, CatBoost, LightGBM, NPTs, and other 24 machine learning methods as described in Wainberg et al., 2016, Klambauer et al., 2017. The compared methods include 10 Deep Learning (DL) approaches.

Datasets. Following Klambauer et al., 2017, we consider UCI machine learning repository datasets with less than or equal to 1,000 samples as being small. We select a subset of 21 datasets, comprising 200 to 1,000 samples, from Klambauer et al., 2017. Of these, 13 datasets have 500 samples or less.

Results. Across the considered UCI repository datasets Hopular has the lowest median rank. Therefore, Hopular is the best performing method.

not found

Medium-Sized Tabular Datasets

In this experiment we compare medium-sized tabular datasets of about 10,000 samples each.

Methods Compared. We compare Hopular, NPTs, XGBoost, CatBoost, and LightGBM.

Datasets. We select the datasets of Schwartz-Ziv and Armon, 2021, where XGBoost performed better than Deep Learning methods that have been designed for tabular data. We extend this selection by two datasets for regression: (a) colleges was already used for other Deep Learning methods for tabular data Somepalli et al., 2021, and (b) sulfur is publicly available and fits with its 10,082 instances well into the existing collection of medium-sized datasets.

Results. The next tables gives the accuracy for the different datasets and methods. Hopular is the best performing method on 3 out of the 6 datasets. The runner-up method, CatBoost, is twice the best method, whereas XGBoost once. Over the 6 datasets, NPTs and XGBoost have a median rank of 4.5, CatBoost and LightGBM of 2.5 and 2, respectively, and Hopular has a median rank of 1.5. On average over all 6 datasets, Hopular performs better than NPTs, XGBoost, CatBoost, and LightGBM.

not found

Recap: Modern Hopfield networks

The associative memory of our choice are modern Hopfield networks for Deep Learning architectures because of their fast retrieval and high storage capacity as shown in Hopfield networks is all you need. The update mechanism of these modern Hopfield networks is equivalent to the self-attention mechanism of Transformer networks. However, modern Hopfield networks for Deep Learning architectures are more general and have a broader functionality, of which the Transformer self-attention is just one example. The according Hopfield layers can be built in Deep Learning architectures for associating two sets, encoder-decoder attention, multiple instance learning, or averaging and pooling operations. For details, see our blog Hopfield Networks is All You Need.

Modern Hopfield networks for Deep Learning architectures Ramsauer et al., 2021; Widrich et al., 2020 are associative memories that have much higher storage capacity than classical Hopfield networks and can retrieve patterns with one update only.

Code and Paper

Additional Material

For more information visit our homepage https://ml-jku.github.io/.

Correspondence

This blog post was written by Bernhard Schäfl and Lukas Gruber.

Contributions by Angela Bitto-Nemling and Sepp Hochreiter.

Please contact us via schaefl[at]ml.jku.at