BERT configuration parameters overview

This post is related to:

  1. Techniques for handling context in LLM models
  2. GPT2 configuration parameters overview
  3. BART configuration parameters overview

Parameters list and descriptions

Parameter Description Data Type/Options
hidden_size Dimensionality of the hidden states and embeddings. Integer, e.g., 768, 1024.
num_hidden_layers Number of hidden layers in the transformer encoder. Integer, e.g., 12, 24.
num_attention_heads Number of attention heads per transformer layer. Integer, typically a divisor of hidden_size.
vocab_size Vocabulary size of the tokenizer. Represents the range of token IDs. Integer, e.g., 30522.
intermediate_size Dimensionality of the feedforward layers. Integer, e.g., 3072.
hidden_dropout_prob Dropout probability for fully connected layers in the encoder. Float between 0.0 and 1.0, typically 0.1.
attention_probs_dropout_prob Dropout probability in the attention mechanism. Float between 0.0 and 1.0, typically 0.1.
max_position_embeddings Maximum number of positions for input tokens. Integer, e.g., 512.
type_vocab_size Size of the token type vocabulary for segment embeddings. Integer, typically 2.
initializer_range Standard deviation for weight initialization. Float, e.g., 0.02.
layer_norm_eps A small value added for numerical stability in layer normalization. Float, typically 1e-5.
output_hidden_states Whether to output all hidden states from each layer. Boolean, true or false.
output_attentions Whether to output the attention weights. Boolean, true or false.

Summary of parameter impact

How changes reflect on model behavior:

  1. Capacity:

    • Increasing hidden_size, num_hidden_layers, or num_attention_heads allows the model to capture more complex patterns but increases resource usage.
  2. Regularization:

    • Dropout probabilities (hidden_dropout_prob, attention_probs_dropout_prob) control overfitting risks but can hinder learning if set too high.
  3. Pretraining vs. Fine-tuning:

    • type_vocab_size is essential for tasks requiring segment embeddings (e.g., sentence pairs).
  4. Stability and Efficiency:

    • layer_norm_eps ensures stable training, while initializer_range affects convergence.

References

Challenges and reports on configuration

  • Report: “On the Structural Properties of BERT Models” (Kovaleva et al., 2019).
  • Challenges: Over-parameterization and inefficiency in fine-tuning for domain-specific tasks.

Playgrounds to experiment