GPT-2 configuration parameters overview

This post is related to:

Parameters List and Descriptions

Parameter	Description	Data Type/Options
`n_ctx`	The context size or maximum length of input tokens the model can process.	Integer, e.g., `1024`, `2048`.
`n_embd`	Dimensionality of the embeddings. Determines the size of the word and positional embeddings.	Integer, typically `768`, `1024`, or `1280`.
`n_layer`	Number of transformer layers in the model. Dictates depth of the network.	Integer, e.g., `12`, `24`, `48`.
`n_head`	Number of attention heads in each transformer layer. Reflects parallel attention mechanisms.	Integer, typically a divisor of `n_embd` such as `12` or `16`.
`vocab_size`	Size of the vocabulary for the tokenizer. Represents the range of token IDs the model can handle.	Integer, e.g., `50257`.
`activation_function`	The activation function used in feedforward layers (e.g., “gelu”). Affects non-linearity.	String: `relu`, `gelu`, `tanh`, `sigmoid`, etc.
`resid_pdrop`	Dropout probability for residual connections. Adds regularization.	Float between `0.0` and `1.0`. Typically `0.1`.
`embd_pdrop`	Dropout probability for embeddings. Helps prevent overfitting.	Float between `0.0` and `1.0`. Typically `0.1`.
`attn_pdrop`	Dropout probability in the attention mechanism. Ensures robustness.	Float between `0.0` and `1.0`. Typically `0.1`.
`initializer_range`	The range of the uniform distribution for weight initialization.	Float, e.g., `0.02`.
`layer_norm_epsilon`	A small value added for numerical stability in layer normalization.	Float, typically `1e-5`.
`use_cache`	Whether the model uses cached key/values for faster generation during inference.	Boolean, `true` or `false`.
`bos_token_id`	Token ID for the beginning-of-sequence token.	Integer, e.g., `50256`.
`eos_token_id`	Token ID for the end-of-sequence token.	Integer, e.g., `50256`.
`scale_attn_weights`	Whether to scale the attention weights.	Boolean, `true` or `false`.
`output_hidden_states`	Whether to return all hidden states from each layer during inference.	Boolean, `true` or `false`.
`output_attentions`	Whether to return the attention weights during inference.	Boolean, `true` or `false`.
`tie_word_embeddings`	Whether to tie the input and output word embeddings.	Boolean, `true` or `false`.

Summary of Parameter Impact

How changes reflect on model behavior:

Model Size and Computation:
- Increasing n_layer, n_embd, or n_head leads to larger and more computationally intensive models but potentially improves learning capacity.
- Reducing n_ctx limits the model’s ability to process long inputs.
Regularization:
- Dropout parameters (resid_pdrop, embd_pdrop, attn_pdrop) mitigate overfitting but may hinder performance if too high.
Non-linearity:
- The choice of activation_function (e.g., gelu vs. relu) affects gradient behavior and optimization efficiency.
Stability:
- Small layer_norm_epsilon values ensure numerical stability during normalization but may require tuning based on the architecture.
Flexibility:
- Enabling output_hidden_states or output_attentions increases interpretability but may slow inference.

References

Challenges and reports on configuration

Report: “Language Models are Few-Shot Learners” (Brown et al., 2020) discusses scalability challenges in transformer-based architectures.
Challenges: Balancing model depth and breadth while maintaining computational efficiency.

Playgrounds to experiment

OpenAI Playground: https://platform.openai.com/playground
Hugging Face Spaces: https://huggingface.co/spaces
Google Colab with Transformers: https://colab.research.google.com/