Large language models
How do we benchmark LLM’s?
Massive Multitask Language Understanding
Essentially a multiple-choice test over 57 subjects. Kind of like the SAT’s
hella swag benchmarj
Designed to evaluate common sense natural language, by having the model finish an ambiguous sentence.