This post is related to:
Parameters List and Descriptions
Parameter | Description | Data Type/Options |
---|---|---|
n_ctx |
The context size or maximum length of input tokens the model can process. | Integer, e.g., 1024 , 2048 . |
n_embd |
Dimensionality of the embeddings. Determines the size of the word and positional embeddings. | Integer, typically 768 , 1024 , or 1280 . |
n_layer |
Number of transformer layers in the model. Dictates depth of the network. | Integer, e.g., 12 , 24 , 48 . |
n_head |
Number of attention heads in each transformer layer. Reflects parallel attention mechanisms. | Integer, typically a divisor of n_embd such as 12 or 16 . |
vocab_size |
Size of the vocabulary for the tokenizer. Represents the range of token IDs the model can handle. | Integer, e.g., 50257 . |
activation_function |
The activation function used in feedforward layers (e.g., “gelu”). Affects non-linearity. | String: relu , gelu , tanh , sigmoid , etc. |
resid_pdrop |
Dropout probability for residual connections. Adds regularization. | Float between 0.0 and 1.0 . Typically 0.1 . |
embd_pdrop |
Dropout probability for embeddings. Helps prevent overfitting. | Float between 0.0 and 1.0 . Typically 0.1 . |
attn_pdrop |
Dropout probability in the attention mechanism. Ensures robustness. | Float between 0.0 and 1.0 . Typically 0.1 . |
initializer_range |
The range of the uniform distribution for weight initialization. | Float, e.g., 0.02 . |
layer_norm_epsilon |
A small value added for numerical stability in layer normalization. | Float, typically 1e-5 . |
use_cache |
Whether the model uses cached key/values for faster generation during inference. | Boolean, true or false . |
bos_token_id |
Token ID for the beginning-of-sequence token. | Integer, e.g., 50256 . |
eos_token_id |
Token ID for the end-of-sequence token. | Integer, e.g., 50256 . |
scale_attn_weights |
Whether to scale the attention weights. | Boolean, true or false . |
output_hidden_states |
Whether to return all hidden states from each layer during inference. | Boolean, true or false . |
output_attentions |
Whether to return the attention weights during inference. | Boolean, true or false . |
tie_word_embeddings |
Whether to tie the input and output word embeddings. | Boolean, true or false . |
Summary of Parameter Impact
How changes reflect on model behavior:
Model Size and Computation:
- Increasing
n_layer
,n_embd
, orn_head
leads to larger and more computationally intensive models but potentially improves learning capacity. - Reducing
n_ctx
limits the model’s ability to process long inputs.
- Increasing
Regularization:
- Dropout parameters (
resid_pdrop
,embd_pdrop
,attn_pdrop
) mitigate overfitting but may hinder performance if too high.
- Dropout parameters (
Non-linearity:
- The choice of
activation_function
(e.g.,gelu
vs.relu
) affects gradient behavior and optimization efficiency.
- The choice of
Stability:
- Small
layer_norm_epsilon
values ensure numerical stability during normalization but may require tuning based on the architecture.
- Small
Flexibility:
- Enabling
output_hidden_states
oroutput_attentions
increases interpretability but may slow inference.
- Enabling
References
Challenges and reports on configuration
- Report: “Language Models are Few-Shot Learners” (Brown et al., 2020) discusses scalability challenges in transformer-based architectures.
- Challenges: Balancing model depth and breadth while maintaining computational efficiency.
Playgrounds to experiment
- OpenAI Playground: https://platform.openai.com/playground
- Hugging Face Spaces: https://huggingface.co/spaces
- Google Colab with Transformers: https://colab.research.google.com/