### 3-2 Looking at Transformers

- Sequence-to-sequence model proposed by Google in 2017
- BERT, GPT >> Transformer-based language model

### 3-3 Self-attention operation principle

- Key Components of Transformers

[Multi-head Attention] multi-head Attention Perform self-attention multiple times simultaneously

### 3-4 Technologies applied to transformers

- transformer block
- Basic elements: multi-head attention, feedforward neural network, residual concatenation, and layer normalization.

[Model learning technique] Dropout: To prevent overfitting, some neurons are stochastically replaced with 0. Techniques for excluding from calculations (Overfitting: A phenomenon in which deep learning models excel at learning and are poor at verifying new data) Adam Optimizer (Optimization Algorithm): To minimize the error between model output and the correct answer. The process of updating parameters throughout the model Direction to minimize error >> gradient Process of minimizing errors >> optimization Find the error The process of calculating the model in order from beginning to end >> forward propagation Perform sequentially in the reverse order of forward propagation (obtain the gradient value that minimizes the error using the loss calculation value) >> backpropagation * Optimization tool used by Transformer model >> Adam Optimizer

### 3-5 Comparison of BERT and GPT