This post is related to:
Parameters list and descriptions
Parameter | Description | Data Type/Options |
---|---|---|
max_position_embeddings |
Maximum number of positions for input tokens. | Integer, e.g., 1024 . |
d_model |
Dimensionality of the model’s embeddings and hidden states. | Integer, e.g., 768 , 1024 . |
encoder_layers |
Number of layers in the encoder. | Integer, e.g., 6 , 12 . |
decoder_layers |
Number of layers in the decoder. | Integer, e.g., 6 , 12 . |
encoder_attention_heads |
Number of attention heads in the encoder. | Integer, e.g., 12 . |
decoder_attention_heads |
Number of attention heads in the decoder. | Integer, e.g., 12 . |
vocab_size |
Vocabulary size of the tokenizer. Represents the range of token IDs. | Integer, e.g., 50265 . |
activation_function |
Activation function used in feedforward layers. | String: relu , gelu , tanh , etc. |
dropout |
Dropout probability applied to various layers. | Float between 0.0 and 1.0 , typically 0.1 . |
attention_dropout |
Dropout probability in the attention mechanism. | Float between 0.0 and 1.0 , typically 0.1 . |
init_std |
Standard deviation for weight initialization. | Float, e.g., 0.02 . |
encoder_ffn_dim |
Dimensionality of the encoder feedforward layers. | Integer, e.g., 3072 . |
decoder_ffn_dim |
Dimensionality of the decoder feedforward layers. | Integer, e.g., 3072 . |
scale_embedding |
Whether to scale the embeddings by sqrt(d_model) . |
Boolean, true or false . |
use_cache |
Whether to use cached key/values for faster decoding. | Boolean, true or false . |
pad_token_id |
Token ID used for padding. | Integer, typically 0 . |
bos_token_id |
Token ID for the beginning-of-sequence token. | Integer, typically 0 . |
eos_token_id |
Token ID for the end-of-sequence token. | Integer, typically 2 . |
Summary of parameter impact
How Changes Reflect on Model Behavior:
Model Complexity:
- Increasing
encoder_layers
,decoder_layers
,d_model
, or attention heads enhances model capacity but increases computational requirements.
- Increasing
Regularization:
- Dropout parameters (
dropout
,attention_dropout
) control overfitting risks but may reduce performance if too high.
- Dropout parameters (
Encoder-Decoder interactions:
encoder_ffn_dim
anddecoder_ffn_dim
directly influence the learning ability of the model for complex patterns.
Efficiency:
- Enabling
use_cache
improves inference time for autoregressive tasks.
- Enabling
References
- Paper: BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
- Paper: Improving Sharpness-Aware Minimization with Fisher Mask for Better Generalization on Language Models
- Hugging Face BART Documentation
- Web article:
- Google colab notebook: BART Learns to Rap - Medium.ipynb
Challenges and reports on configuration
Report:
- Paper: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Raffel et al., 2020.
Challenges: Balancing fine-tuning for generative and discriminative tasks in sequence-to-sequence models.
- Paper: Unified Generative and Discriminative Training for Multi-modal Large Language Models
Playgrounds to experiment
- Hugging Face Spaces: https://huggingface.co/spaces
- Google Colab with Transformers: https://colab.research.google.com/
- OpenAI Playground for Generative Tasks: https://platform.openai.com/playground