This post is related to:
- Techniques for handling context in LLM models
- GPT2 configuration parameters overview
- BART configuration parameters overview
Parameters list and descriptions
| Parameter | Description | Data Type/Options |
|---|---|---|
hidden_size |
Dimensionality of the hidden states and embeddings. | Integer, e.g., 768, 1024. |
num_hidden_layers |
Number of hidden layers in the transformer encoder. | Integer, e.g., 12, 24. |
num_attention_heads |
Number of attention heads per transformer layer. | Integer, typically a divisor of hidden_size. |
vocab_size |
Vocabulary size of the tokenizer. Represents the range of token IDs. | Integer, e.g., 30522. |
intermediate_size |
Dimensionality of the feedforward layers. | Integer, e.g., 3072. |
hidden_dropout_prob |
Dropout probability for fully connected layers in the encoder. | Float between 0.0 and 1.0, typically 0.1. |
attention_probs_dropout_prob |
Dropout probability in the attention mechanism. | Float between 0.0 and 1.0, typically 0.1. |
max_position_embeddings |
Maximum number of positions for input tokens. | Integer, e.g., 512. |
type_vocab_size |
Size of the token type vocabulary for segment embeddings. | Integer, typically 2. |
initializer_range |
Standard deviation for weight initialization. | Float, e.g., 0.02. |
layer_norm_eps |
A small value added for numerical stability in layer normalization. | Float, typically 1e-5. |
output_hidden_states |
Whether to output all hidden states from each layer. | Boolean, true or false. |
output_attentions |
Whether to output the attention weights. | Boolean, true or false. |
Summary of parameter impact
How changes reflect on model behavior:
Capacity:
- Increasing
hidden_size,num_hidden_layers, ornum_attention_headsallows the model to capture more complex patterns but increases resource usage.
- Increasing
Regularization:
- Dropout probabilities (
hidden_dropout_prob,attention_probs_dropout_prob) control overfitting risks but can hinder learning if set too high.
- Dropout probabilities (
Pretraining vs. Fine-tuning:
type_vocab_sizeis essential for tasks requiring segment embeddings (e.g., sentence pairs).
Stability and Efficiency:
layer_norm_epsensures stable training, whileinitializer_rangeaffects convergence.
References
- Paper
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- DOI: 10.48550/arXiv.1810.04805
- Read on
- Hugging Face BERT Documentation
Challenges and reports on configuration
- Report: “On the Structural Properties of BERT Models” (Kovaleva et al., 2019).
- Challenges: Over-parameterization and inefficiency in fine-tuning for domain-specific tasks.
Playgrounds to experiment
- Hugging Face Spaces: https://huggingface.co/spaces
- Google Colab with Transformers: https://colab.research.google.com/
- OpenAI Playground for Text Understanding: https://platform.openai.com/playground