Large language models

How do we benchmark LLM’s?

Massive Multitask Language Understanding

Essentially a multiple-choice test over 57 subjects. Kind of like the SAT’s

hella swag benchmarj

Designed to evaluate common sense natural language, by having the model finish an ambiguous sentence.