MMLU Benchmark (Massive Multi-task Language Understanding)

by Stephen M. Walker II, Co-Founder / CEO

What is the MMLU Benchmark (Massive Multi-task Language Understanding)?

The MMLU Benchmark (Massive Multi-task Language Understanding) is a comprehensive evaluation is a challenging test designed to measure a text model's multitask accuracy by evaluating models in zero-shot and few-shot settings. The MMLU serves as a standardized way to assess AI performance on tasks that range from simple math to complex legal reasoning.

The MMLU Benchmark is a diverse set of tests designed to evaluate the understanding and problem-solving abilities of language models across multiple domains. The MMLU contains 57 tasks across topics including elementary mathematics, US history, computer science, and law. It requires models to demonstrate a broad knowledge base and problem-solving skills.

The MMLU provides a way to test and compare various language models like OpenAI GPT-4, Mistral 7b, Google Gemini, and Anthropic Claude 2, etc.

AI teams can use the MMLU for comprehensive evaluations when building or fine-tuning custom models that significantly modify a foundation model.

MMLU 5-Shot Leaderboard

Organization	Model	MMLU Score	Release Date
Anthropic	Claude 3	86.8	March 2024
OpenAI	GPT-4	86.4	April 2023
Google	Gemini Ultra	83.7	December 2023
Mistral AI	Mistral Large	81.2	February 2024
Anthropic	Claude 2	78.5	November 2023
Google	PaLM 2	78.3	May 2023
Google	PaLM	75.2	April 2022
Google	Gemini Pro	71.8	December 2023
Mistral AI	Mistral 8x7b	71.3	December 2023
OpenAI	GPT-3.5	70	November 2022
Mistral AI	Zephyr 7b	66.08	October 2023
OpenAI	Llama 2 65b	63.4	July 2022
Mistral AI	Mistral 7b	60.1	September 2023

Example MMLU Question and Response

We'll compare responses to a sample MMLU question from three language models: OpenAI's GPT-4.0-Turbo, Anthropic's Claude 2, and Zephyr (Mistrals 7b Variant). This comparison highlights their respective capabilities.

To prevent poisoning future data, we do not disclose the correct response.

Example question on High School European History:

This question refers to the following information. Albeit the king's Majesty justly and rightfully is and ought to be the supreme head of the Church of England, and so is recognized by the clergy of this realm in their convocations, yet nevertheless, for corroboration and confirmation thereof, and for increase of virtue in Christ's religion within this realm of England, and to repress and extirpate all errors, heresies, and other enormities and abuses heretofore used in the same, be it enacted, by authority of this present Parliament, that the king, our sovereign lord, his heirs and successors, kings of this realm, shall be taken, accepted, and reputed the only supreme head in earth of the Church of England, called Anglicans Ecclesia; and shall have and enjoy, annexed and united to the imperial crown of this realm, as well the title and style thereof, as all honors, dignities, preeminences, jurisdictions, privileges, authorities, immunities, profits, and commodities to the said dignity of the supreme head of the same Church belonging and appertaining; and that our said sovereign lord, his heirs and successors, kings of this realm, shall have full power and authority from time to time to visit, repress, redress, record, order, correct, restrain, and amend all such errors, heresies, abuses, offenses, contempts, and enormities, whatsoever they be, which by any manner of spiritual authority or jurisdiction ought or may lawfully be reformed, repressed, ordered, redressed, corrected, restrained, or amended, most to the pleasure of Almighty God, the increase of virtue in Christ's religion, and for the conservation of the peace, unity, and tranquility of this realm; any usage, foreign land, foreign authority, prescription, or any other thing or things to the contrary hereof notwithstanding. English Parliament, Act of Supremacy, 1534 

From the passage, one may infer that the English Parliament wished to argue that the Act of Supremacy would:

(A) give the English king a new position of authority
(B) give the position of head of the Church of England to Henry VIII
(C) establish Calvinism as the one true theology in England
(D) end various forms of corruption plaguing the Church in England

OpenAI GPT4.0-Turbo's Response:

From the passage, one may infer that the English Parliament wished to argue that the Act of Supremacy would:

(D) end various forms of corruption plaguing the Church in England

The passage emphasizes the intent to "repress and extirpate all errors, heresies, and other enormities and abuses heretofore used in the same," and to have the authority to "visit, repress, redress, record, order, correct, restrain, and amend all such errors, heresies, abuses, offenses, contempts, and enormities." This indicates a clear aim to reform the Church and address issues of corruption or malpractice.

Anthropic Claude 2's Response:

Based on the passage, the best inference about what the English Parliament wished to argue is that the Act of Supremacy would:

(D) end various forms of corruption plaguing the Church in England

The passage states that one purpose of the Act is "to repress and extirpate all errors, heresies, and other enormities and abuses" in the Church. It also gives the king power to "repress, redress, record, order, correct, restrain, and amend all such errors, heresies, abuses, offenses, contempts, and enormities." This suggests the Act was intended to address corruption in the Church.

Options A and B are not supported, as the passage suggests the king was already recognized as the supreme head of the Church. Option C about establishing Calvinism is not mentioned. So option D best reflects what Parliament wished to argue about the Act's purpose.

Zephyr's Response:

The English Parliament intended to argue that the Act of Supremacy would:

(D) end various forms of corruption plaguing the Church in England

The passage states that the Act is enacted "for corroboration and confirmation thereof, and for increase of virtue in Christ's religion within this realm of England, and to repress and extirpate all errors, heresies, and other enormities and abuses heretofore used in the same." This suggests that the Parliament believed that the Act would help address issues of corruption, error, heresy, and abuse within the Church in England. The Act specifically mentions "all such errors, heresies, abuses, offenses, contempts, and enormities" and empowers the king to take action against them. Therefore, the Parliament's intention appears to be that the Act would help cleanse the Church of England by addressing issues of corruption and bringing it into greater conformity with Christian virtue.

It is not clear from the passage whether the Parliament intended for the Act to create a new position of authority for the king (A), exclude his heirs from holding the position of head of the Church (B), or establish Calvinism as the one true theology in England (C). Instead, it seems that the Parliament's intent was focused on addressing issues within the English Church and granting further power to the king as the supreme head of the Church.

The performance of the models can be analyzed based on several criteria, such as coherence, relevance, detail, and clarity:

Coherence — All three models provide coherent responses that logically follow from the passage cited. They maintain a consistent argument throughout their responses.
Relevance — Each model's response is relevant to the question asked, focusing on the intent behind the Act of Supremacy as it relates to corruption in the Church of England.
Detail — The responses differ slightly in the amount of detail provided. OpenAI GPT4.0-Turbo and Anthropic Claude 2 offer a detailed explanation of the passage, citing specific phrases and explaining their significance. Zephyr's response is also detailed but includes a broader interpretation of the Act's intent, mentioning "increase of virtue in Christ's religion" as an additional goal.
Clarity — All models present their arguments in a clear and understandable manner. However, OpenAI GPT4.0-Turbo and Anthropic Claude 2 structure their responses with a slightly clearer focus on the direct implications of the Act, while Zephyr provides a more generalized conclusion.

Overall, the performance of each model is quite similar, with only minor differences in the level of detail and the breadth of the context provided. Each model effectively communicates the intended inference from the passage, demonstrating a high level of language understanding and response generation.

The goals and structure of the MMLU

The MMLU acts as a standard, however imperfect, for assessing the relative generalization capabilities of foundation models, helping researchers and developers make informed decisions when selecting models for specific applications.

From the Measuring Massive Multitask Language Understanding paper:

We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.

Some key details of the MMLU benchmark include:

Training and Validation Sets — The dataset contains 15908 questions, split into a few-shot development set, a validation set, and a test set. The few-shot development set has 5 questions per subject, the validation set may be used for selecting hyperparameters and consists of 1540 questions, and the test set has 14079 questions.
Model Performance — Initial results from MMLU revealed that smaller LLMs tended to perform around chance (25% accurate), while the larger GPT-3 (175 billion parameters) fared better with 43.9% few-shot accuracy and 37.7% zero-shot accuracy. In 2023, GPT-4 reached 86.4% 5-shot accuracy, and Google Gemini reached 83.7% 5-shot accuracy. However, even the best models still need substantial improvements before they can reach human expert-level accuracy (89.8%).
Challenging Subjects — Models, especially large language models (LLMs), struggle with calculation-intensive tasks (e.g., physics and math) and human-value-laden subjects (e.g., morality and law).

What are the key features of the MMLU benchmark?

The MMLU benchmark measures a language model's performance across a wide range of tasks, covering subjects in STEM, humanities, social sciences, and more. Some key features of the MMLU benchmark include:

57 Subjects — The benchmark covers 57 subjects across various fields, ranging from elementary mathematics to advanced professional levels in areas like law and ethics.
Granularity and Breadth — MMLU tests both world knowledge and problem-solving ability, making it ideal for identifying a model's understanding of various subjects.
Multitask Accuracy — The test measures a model's multitask accuracy by covering a diverse set of tasks, ensuring a comprehensive evaluation of the model's academic and professional knowledge.
No Large Training Sets Required — Unlike some other benchmarks, MMLU does not require large training sets. Instead, it assumes that models have acquired the necessary knowledge from reading vast quantities of diverse text, a process typically called pretraining.

These key features make the MMLU benchmark a valuable tool for evaluating the performance of language models and their ability to understand and generate language in various contexts.

How does the MMLU work?

The MMLU benchmark works by evaluating the performance of a language model across a wide range of tasks. It measures the model's ability to understand and generate language in various contexts, including machine translation, text summarization, and sentiment analysis.

The final MMLU score is the average of the scores obtained in each task, providing a comprehensive measure of the model's overall performance.

What are its benefits?

There are many benefits to the MMLU benchmark, but three of the most important benefits are:

It provides a quantitative way to compare the performance of different language models.
It can be computed efficiently and is easy to understand.
It considers the model's ability to understand and generate language in various contexts, which can capture some aspects of language structure.

What are the limitations of MMLU?

MMLU (Massive Multitask Language Understanding) is a benchmark designed to measure knowledge acquired during pretraining by evaluating models in zero-shot and few-shot settings across 57 diverse subjects. However, there are several issues with the MMLU benchmark that make it a bad benchmark:

Crucial context missing from questions — Some questions in the MMLU benchmark have missing context, which makes them difficult or impossible to answer correctly. These issues are likely due to copy-paste errors.
Ambiguous sets of answers — The benchmark contains questions with ambiguous answer sets, which can lead to confusion and incorrect evaluation of model performance.
Wrong sets of answers — Some questions in the MMLU benchmark have incorrect answer sets, which can lead to misleading evaluation results.
Sensitivity to prompting — MMLU is extremely sensitive to the exact prompting used, which can lead to significant differences in performance depending on the prompts. This makes it difficult to compare results across different implementations and papers.

An example of these issues can be found in the video "SmartGPT: Major Benchmark Broken - 89% on MMLU + Exam's Many Errors", where researchers expose severe shortcomings in the testing questions used in the MMLU benchmark. These issues highlight the need for a proper benchmarking organization that can research and create accurate, robust, and sensible benchmarking suites for evaluating state-of-the-art models.

Klu is remote-first and global

Follow us

MMLU Benchmark (Massive Multi-task Language Understanding)

What is the MMLU Benchmark (Massive Multi-task Language Understanding)?

MMLU 5-Shot Leaderboard

Example MMLU Question and Response

The goals and structure of the MMLU

What are the key features of the MMLU benchmark?

How does the MMLU work?

What are its benefits?

What are the limitations of MMLU?

More terms

What is Perl?

What is blackboard system (AI)?

It's time to build

LLMOps

Guides

LLMs