This post is related to:
Parameters List and Descriptions
| Parameter | Description | Data Type/Options |
|---|---|---|
n_ctx |
The context size or maximum length of input tokens the model can process. | Integer, e.g., 1024, 2048. |
n_embd |
Dimensionality of the embeddings. Determines the size of the word and positional embeddings. | Integer, typically 768, 1024, or 1280. |
n_layer |
Number of transformer layers in the model. Dictates depth of the network. | Integer, e.g., 12, 24, 48. |
n_head |
Number of attention heads in each transformer layer. Reflects parallel attention mechanisms. | Integer, typically a divisor of n_embd such as 12 or 16. |
vocab_size |
Size of the vocabulary for the tokenizer. Represents the range of token IDs the model can handle. | Integer, e.g., 50257. |
activation_function |
The activation function used in feedforward layers (e.g., “gelu”). Affects non-linearity. | String: relu, gelu, tanh, sigmoid, etc. |
resid_pdrop |
Dropout probability for residual connections. Adds regularization. | Float between 0.0 and 1.0. Typically 0.1. |
embd_pdrop |
Dropout probability for embeddings. Helps prevent overfitting. | Float between 0.0 and 1.0. Typically 0.1. |
attn_pdrop |
Dropout probability in the attention mechanism. Ensures robustness. | Float between 0.0 and 1.0. Typically 0.1. |
initializer_range |
The range of the uniform distribution for weight initialization. | Float, e.g., 0.02. |
layer_norm_epsilon |
A small value added for numerical stability in layer normalization. | Float, typically 1e-5. |
use_cache |
Whether the model uses cached key/values for faster generation during inference. | Boolean, true or false. |
bos_token_id |
Token ID for the beginning-of-sequence token. | Integer, e.g., 50256. |
eos_token_id |
Token ID for the end-of-sequence token. | Integer, e.g., 50256. |
scale_attn_weights |
Whether to scale the attention weights. | Boolean, true or false. |
output_hidden_states |
Whether to return all hidden states from each layer during inference. | Boolean, true or false. |
output_attentions |
Whether to return the attention weights during inference. | Boolean, true or false. |
tie_word_embeddings |
Whether to tie the input and output word embeddings. | Boolean, true or false. |
Summary of Parameter Impact
How changes reflect on model behavior:
Model Size and Computation:
- Increasing
n_layer,n_embd, orn_headleads to larger and more computationally intensive models but potentially improves learning capacity. - Reducing
n_ctxlimits the model’s ability to process long inputs.
- Increasing
Regularization:
- Dropout parameters (
resid_pdrop,embd_pdrop,attn_pdrop) mitigate overfitting but may hinder performance if too high.
- Dropout parameters (
Non-linearity:
- The choice of
activation_function(e.g.,geluvs.relu) affects gradient behavior and optimization efficiency.
- The choice of
Stability:
- Small
layer_norm_epsilonvalues ensure numerical stability during normalization but may require tuning based on the architecture.
- Small
Flexibility:
- Enabling
output_hidden_statesoroutput_attentionsincreases interpretability but may slow inference.
- Enabling
References
Challenges and reports on configuration
- Report: “Language Models are Few-Shot Learners” (Brown et al., 2020) discusses scalability challenges in transformer-based architectures.
- Challenges: Balancing model depth and breadth while maintaining computational efficiency.
Playgrounds to experiment
- OpenAI Playground: https://platform.openai.com/playground
- Hugging Face Spaces: https://huggingface.co/spaces
- Google Colab with Transformers: https://colab.research.google.com/