ATTENTION:

https://www.youtube.com/watch?v=qaWMOYf4ri8&t=0s https://www.youtube.com/watch?v=OxCpWwDCDFQ&t=0s


Transformers do not make the same Markovian assumptions as bigram models. Instead, they can consider long-range dependencies and relationships between words in a sentence, regardless of their distance from each other. Transformers consist of encoder and decoder blocks that can process an input sequence and generate an output sequence.