Skip to contents

A transformer block consisting of a multi-head self-attention mechanism followed by a feed-forward network.

This is used in LearnerTorchFTTransformer.

Usage

nn_ft_transformer_block(
  d_token,
  attention_n_heads,
  attention_dropout,
  attention_initialization,
  ffn_d_hidden = NULL,
  ffn_d_hidden_multiplier = NULL,
  ffn_dropout,
  ffn_activation,
  residual_dropout,
  prenormalization,
  is_first_layer,
  attention_normalization,
  ffn_normalization,
  query_idx = NULL,
  attention_bias,
  ffn_bias_first,
  ffn_bias_second
)

Arguments

d_token

(integer(1))
The dimension of the embedding.

attention_n_heads

(integer(1))
Number of attention heads.

attention_dropout

(numeric(1))
Dropout probability in the attention mechanism.

attention_initialization

(character(1))
Initialization method for attention weights. Either "kaiming" or "xavier".

ffn_d_hidden

(integer(1))
Hidden dimension of the feed-forward network. Multiplied by 2 if using ReGLU or GeGLU activation.

ffn_d_hidden_multiplier

(numeric(1))
Alternative way to specify the hidden dimension of the feed-forward network as d_token * d_hidden_multiplier. Also multiplied by 2 if using RegLU or GeGLU activation.

ffn_dropout

(numeric(1))
Dropout probability in the feed-forward network.

ffn_activation

(nn_module)
Activation function for the feed-forward network. Default value is nn_reglu.

residual_dropout

(numeric(1))
Dropout probability for residual connections.

prenormalization

(logical(1))
Whether to apply normalization before attention and FFN (TRUE) or after (TRUE).

is_first_layer

(logical(1))
Whether this is the first layer in the transformer stack. Default value is FALSE.

attention_normalization

(nn_module)
Normalization module to use for attention. Default value is nn_layer_norm.

ffn_normalization

(nn_module)
Normalization module to use for the feed-forward network. Default value is nn_layer_norm.

query_idx

(integer() or NULL)
Indices of the tensor to apply attention to. Should not be set manually. If NULL, then attention is applied to the entire tensor. In the last block in a stack of transformers, this is set to -1 so that attention is applied only to the embedding of the CLS token.

attention_bias

(logical(1))
Whether attention has a bias. Default is TRUE

ffn_bias_first

(logical(1))
Whether the first layer in the FFN has a bias. Default is TRUE

ffn_bias_second

(logical(1))
Whether the second layer in the FFN has a bias. Default is TRUE

References

Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, Toutanova, Kristina (2018). “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805. Gorishniy Y, Rubachev I, Khrulkov V, Babenko A (2021). “Revisiting Deep Learning for Tabular Data.” arXiv, 2106.11959.