GPT-2 configuration parameters overview

This post is related to:

  1. Techniques for handling context in LLM models
  2. BART configuration parameters overview

Parameters List and Descriptions

Parameter Description Data Type/Options
n_ctx The context size or maximum length of input tokens the model can process. Integer, e.g., 1024, 2048.
n_embd Dimensionality of the embeddings. Determines the size of the word and positional embeddings. Integer, typically 768, 1024, or 1280.
n_layer Number of transformer layers in the model. Dictates depth of the network. Integer, e.g., 12, 24, 48.
n_head Number of attention heads in each transformer layer. Reflects parallel attention mechanisms. Integer, typically a divisor of n_embd such as 12 or 16.
vocab_size Size of the vocabulary for the tokenizer. Represents the range of token IDs the model can handle. Integer, e.g., 50257.
activation_function The activation function used in feedforward layers (e.g., “gelu”). Affects non-linearity. String: relu, gelu, tanh, sigmoid, etc.
resid_pdrop Dropout probability for residual connections. Adds regularization. Float between 0.0 and 1.0. Typically 0.1.
embd_pdrop Dropout probability for embeddings. Helps prevent overfitting. Float between 0.0 and 1.0. Typically 0.1.
attn_pdrop Dropout probability in the attention mechanism. Ensures robustness. Float between 0.0 and 1.0. Typically 0.1.
initializer_range The range of the uniform distribution for weight initialization. Float, e.g., 0.02.
layer_norm_epsilon A small value added for numerical stability in layer normalization. Float, typically 1e-5.
use_cache Whether the model uses cached key/values for faster generation during inference. Boolean, true or false.
bos_token_id Token ID for the beginning-of-sequence token. Integer, e.g., 50256.
eos_token_id Token ID for the end-of-sequence token. Integer, e.g., 50256.
scale_attn_weights Whether to scale the attention weights. Boolean, true or false.
output_hidden_states Whether to return all hidden states from each layer during inference. Boolean, true or false.
output_attentions Whether to return the attention weights during inference. Boolean, true or false.
tie_word_embeddings Whether to tie the input and output word embeddings. Boolean, true or false.

Summary of Parameter Impact

How changes reflect on model behavior:

  1. Model Size and Computation:

    • Increasing n_layer, n_embd, or n_head leads to larger and more computationally intensive models but potentially improves learning capacity.
    • Reducing n_ctx limits the model’s ability to process long inputs.
  2. Regularization:

    • Dropout parameters (resid_pdrop, embd_pdrop, attn_pdrop) mitigate overfitting but may hinder performance if too high.
  3. Non-linearity:

    • The choice of activation_function (e.g., gelu vs. relu) affects gradient behavior and optimization efficiency.
  4. Stability:

    • Small layer_norm_epsilon values ensure numerical stability during normalization but may require tuning based on the architecture.
  5. Flexibility:

    • Enabling output_hidden_states or output_attentions increases interpretability but may slow inference.

References


Challenges and reports on configuration

  • Report: “Language Models are Few-Shot Learners” (Brown et al., 2020) discusses scalability challenges in transformer-based architectures.
  • Challenges: Balancing model depth and breadth while maintaining computational efficiency.

Playgrounds to experiment