This post is related to:
- Techniques for handling context in LLM models
- GPT2 configuration parameters overview
- BART configuration parameters overview
Parameters list and descriptions
Parameter | Description | Data Type/Options |
---|---|---|
hidden_size |
Dimensionality of the hidden states and embeddings. | Integer, e.g., 768 , 1024 . |
num_hidden_layers |
Number of hidden layers in the transformer encoder. | Integer, e.g., 12 , 24 . |
num_attention_heads |
Number of attention heads per transformer layer. | Integer, typically a divisor of hidden_size . |
vocab_size |
Vocabulary size of the tokenizer. Represents the range of token IDs. | Integer, e.g., 30522 . |
intermediate_size |
Dimensionality of the feedforward layers. | Integer, e.g., 3072 . |
hidden_dropout_prob |
Dropout probability for fully connected layers in the encoder. | Float between 0.0 and 1.0 , typically 0.1 . |
attention_probs_dropout_prob |
Dropout probability in the attention mechanism. | Float between 0.0 and 1.0 , typically 0.1 . |
max_position_embeddings |
Maximum number of positions for input tokens. | Integer, e.g., 512 . |
type_vocab_size |
Size of the token type vocabulary for segment embeddings. | Integer, typically 2 . |
initializer_range |
Standard deviation for weight initialization. | Float, e.g., 0.02 . |
layer_norm_eps |
A small value added for numerical stability in layer normalization. | Float, typically 1e-5 . |
output_hidden_states |
Whether to output all hidden states from each layer. | Boolean, true or false . |
output_attentions |
Whether to output the attention weights. | Boolean, true or false . |
Summary of parameter impact
How changes reflect on model behavior:
Capacity:
- Increasing
hidden_size
,num_hidden_layers
, ornum_attention_heads
allows the model to capture more complex patterns but increases resource usage.
- Increasing
Regularization:
- Dropout probabilities (
hidden_dropout_prob
,attention_probs_dropout_prob
) control overfitting risks but can hinder learning if set too high.
- Dropout probabilities (
Pretraining vs. Fine-tuning:
type_vocab_size
is essential for tasks requiring segment embeddings (e.g., sentence pairs).
Stability and Efficiency:
layer_norm_eps
ensures stable training, whileinitializer_range
affects convergence.
References
- Paper
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- DOI: 10.48550/arXiv.1810.04805
- Read on
- Hugging Face BERT Documentation
Challenges and reports on configuration
- Report: “On the Structural Properties of BERT Models” (Kovaleva et al., 2019).
- Challenges: Over-parameterization and inefficiency in fine-tuning for domain-specific tasks.
Playgrounds to experiment
- Hugging Face Spaces: https://huggingface.co/spaces
- Google Colab with Transformers: https://colab.research.google.com/
- OpenAI Playground for Text Understanding: https://platform.openai.com/playground