ATTENTION:
- https://www.youtube.com/watch?v=UPtG_38Oq8o&t=1246s
- https://www.youtube.com/watch?v=PSs6nxngL6k&t=208s&pp=ygUUc3RhdCBxdWVzdCBhdHRlbnRpb24%3D
- https://www.youtube.com/watch?v=kCc8FmEb1nY&t=175s&pp=ygUQYW5kcmVqIGF0dGVudGlvbg%3D%3D
- https://www.youtube.com/watch?v=XfpMkf4rD6E&pp=ygUQYW5kcmVqIGF0dGVudGlvbg%3D%3D
- https://youtu.be/9uw3F6rndnA?si=-1NJZUVYJnzrLwQc https://youtu.be/qaWMOYf4ri8?si=TXk8Vs3usT_wQev9
- https://youtu.be/4Bdc55j80l8?si=QaW-zni1glNFWkGn
- Andrej’s one
https://www.youtube.com/watch?v=qaWMOYf4ri8&t=0s https://www.youtube.com/watch?v=OxCpWwDCDFQ&t=0s
Transformers do not make the same Markovian assumptions as bigram models. Instead, they can consider long-range dependencies and relationships between words in a sentence, regardless of their distance from each other. Transformers consist of encoder and decoder blocks that can process an input sequence and generate an output sequence.