BART configuration parameters overview

This post is related to:

  1. Techniques for handling context in LLM models

Parameters list and descriptions

Parameter Description Data Type/Options
max_position_embeddings Maximum number of positions for input tokens. Integer, e.g., 1024.
d_model Dimensionality of the model’s embeddings and hidden states. Integer, e.g., 768, 1024.
encoder_layers Number of layers in the encoder. Integer, e.g., 6, 12.
decoder_layers Number of layers in the decoder. Integer, e.g., 6, 12.
encoder_attention_heads Number of attention heads in the encoder. Integer, e.g., 12.
decoder_attention_heads Number of attention heads in the decoder. Integer, e.g., 12.
vocab_size Vocabulary size of the tokenizer. Represents the range of token IDs. Integer, e.g., 50265.
activation_function Activation function used in feedforward layers. String: relu, gelu, tanh, etc.
dropout Dropout probability applied to various layers. Float between 0.0 and 1.0, typically 0.1.
attention_dropout Dropout probability in the attention mechanism. Float between 0.0 and 1.0, typically 0.1.
init_std Standard deviation for weight initialization. Float, e.g., 0.02.
encoder_ffn_dim Dimensionality of the encoder feedforward layers. Integer, e.g., 3072.
decoder_ffn_dim Dimensionality of the decoder feedforward layers. Integer, e.g., 3072.
scale_embedding Whether to scale the embeddings by sqrt(d_model). Boolean, true or false.
use_cache Whether to use cached key/values for faster decoding. Boolean, true or false.
pad_token_id Token ID used for padding. Integer, typically 0.
bos_token_id Token ID for the beginning-of-sequence token. Integer, typically 0.
eos_token_id Token ID for the end-of-sequence token. Integer, typically 2.

Summary of parameter impact

How Changes Reflect on Model Behavior:

  1. Model Complexity:

    • Increasing encoder_layers, decoder_layers, d_model, or attention heads enhances model capacity but increases computational requirements.
  2. Regularization:

    • Dropout parameters (dropout, attention_dropout) control overfitting risks but may reduce performance if too high.
  3. Encoder-Decoder interactions:

    • encoder_ffn_dim and decoder_ffn_dim directly influence the learning ability of the model for complex patterns.
  4. Efficiency:

    • Enabling use_cache improves inference time for autoregressive tasks.

References

Challenges and reports on configuration

Playgrounds to experiment