BERT configuration parameters overview

This post is related to:

Parameters list and descriptions

Parameter	Description	Data Type/Options
`hidden_size`	Dimensionality of the hidden states and embeddings.	Integer, e.g., `768`, `1024`.
`num_hidden_layers`	Number of hidden layers in the transformer encoder.	Integer, e.g., `12`, `24`.
`num_attention_heads`	Number of attention heads per transformer layer.	Integer, typically a divisor of `hidden_size`.
`vocab_size`	Vocabulary size of the tokenizer. Represents the range of token IDs.	Integer, e.g., `30522`.
`intermediate_size`	Dimensionality of the feedforward layers.	Integer, e.g., `3072`.
`hidden_dropout_prob`	Dropout probability for fully connected layers in the encoder.	Float between `0.0` and `1.0`, typically `0.1`.
`attention_probs_dropout_prob`	Dropout probability in the attention mechanism.	Float between `0.0` and `1.0`, typically `0.1`.
`max_position_embeddings`	Maximum number of positions for input tokens.	Integer, e.g., `512`.
`type_vocab_size`	Size of the token type vocabulary for segment embeddings.	Integer, typically `2`.
`initializer_range`	Standard deviation for weight initialization.	Float, e.g., `0.02`.
`layer_norm_eps`	A small value added for numerical stability in layer normalization.	Float, typically `1e-5`.
`output_hidden_states`	Whether to output all hidden states from each layer.	Boolean, `true` or `false`.
`output_attentions`	Whether to output the attention weights.	Boolean, `true` or `false`.

Capacity:
- Increasing hidden_size, num_hidden_layers, or num_attention_heads allows the model to capture more complex patterns but increases resource usage.
Regularization:
- Dropout probabilities (hidden_dropout_prob, attention_probs_dropout_prob) control overfitting risks but can hinder learning if set too high.
Pretraining vs. Fine-tuning:
- type_vocab_size is essential for tasks requiring segment embeddings (e.g., sentence pairs).
Stability and Efficiency:
- layer_norm_eps ensures stable training, while initializer_range affects convergence.

Paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
DOI: 10.48550/arXiv.1810.04805
Read on
Hugging Face BERT Documentation

Report: “On the Structural Properties of BERT Models” (Kovaleva et al., 2019).
Challenges: Over-parameterization and inefficiency in fine-tuning for domain-specific tasks.

Hugging Face Spaces: https://huggingface.co/spaces
Google Colab with Transformers: https://colab.research.google.com/
OpenAI Playground for Text Understanding: https://platform.openai.com/playground