GPT-4 full score through MIT undergraduate mathematics? I'm afraid it's fake, there's something wrong with the data set itself

Question

Original title: "The explosive "GPT-4 MIT Undergraduate Mathematics Full Score" paper cheated, the data set itself has problems"

In the past two days, a paper about GPT-4 passing the MIT MIT EECS and mathematics undergraduate exams with full marks has gone viral on Twitter.

Paper address:

In brief recap, a research team from MIT compiled a comprehensive dataset of 4,550 problems and solutions from course questions, midterms, and final exams for the Mathematics, Electrical Engineering, and Computer Science (EECS) majors at their school .

Then, the research team asked various large language models to complete the topic of this data set, and the results were too scary: GPT-3.5 could do 1/3 correctly, and GPT-4 passed almost full marks.

The author of the paper said that improving the performance of the model mainly depends on the "four-piece set": Few-shot learning, CoT, Self-critique, Expert.

As shown in the above table, the more ways to add GPT-4, the higher the correct answer rate of the model. The original GPT-4 was able to get a 90% correct rate score. After some operation, it even got a full score directly.

But most of the netizens who have a heated discussion may not have noticed that this score itself was scored with GPT-4...

Three students who are also from MIT discovered this paper for the first time. As a group that was almost overtaken by GPT-4, they wanted to immediately understand the methodology of the popular paper.

After an hour of research, they had doubts about the paper's methods.

Two hours later, they realized: there was something wrong with the dataset itself.

Although the authors of the original paper claimed to have manually reviewed the released dataset for quality, the trio found clear signs that a significant portion of the test dataset was contaminated.

In other words, the model is like a student who was told the answer before the exam, which is blatant "cheating".

After questioning, they immediately set out to complete the zero-sample GPT-4 run on the data set, and manually scored the top 30% of the data. The result was far from the original paper. It should be said that it is a heaven and an underground.

“As MIT undergraduates, at least in our experience, this test set does not accurately represent the breadth and depth of understanding required to earn an EECS degree at MIT,” the trio wrote in a blog post.

*Latest progress: The accuracy rate of zero-sample GPT-4 can reach 62.5%, but it is still far from the 90% claimed in the paper. *

The trio also questioned the tide of "over-publicity": "These papers are often uploaded to Arxiv and widely shared on Twitter before any legitimate peer review. The future of work sets a bad precedent.”

"Deep learning" fighter Gary Marcus also unsurprisingly supported this wave of doubts:

At the same time, the three also pointed out in the blog that several of the authors listed on the paper "Exploring the MIT Mathematics and EECS Curriculum Using Large Language Models" are undergraduate researchers, and let these people be responsible for any mistakes in their work is inappropriate. Instead, the onus should be on the mentoring authors—they are the ones who are expected to ensure that their work conforms to the standards of public scholarship in their field.

Next, let's take a look at the problems with this "explosive" paper.

What is wrong with the dataset?

First, as known from the original paper, the dataset collected by the researchers contains 4550 problems and corresponding solutions for 30 mathematics and EECS courses required to obtain the MIT degree exam, covering core courses and elective courses.

"A test set of 288 questions was randomly selected among the questions without images and with solutions," the paper reads.

This dataset (excluding the training set used to fine-tune the open-source LLM) was also released to GitHub with the publication of the paper, along with the code used to generate the reported performance test. However, the author, Prof. Drori, has removed it in a recent submission.

After checking and comparing, the three were convinced that this deleted file represented the test set analyzed in the paper, because the file path of all data in the evaluation code pointed to it, no code was provided to modify its content, and it was initially The release is available in the GitHub repository. In addition, the file meets all schema requirements (number of lines, etc.) specified in the paper. The evidence seems to support very strongly all of the following claims,

"However, we acknowledge that it is possible that this file was replaced with a different file used for testing. If this is the case, we believe that the burden of proof lies with the authors to publicly release this data and all analyzes done with it."

So, what is the problem that is being glossed over? The three gave their own analysis.

Unsolvable problems (approximately 4% of the test set)

Given that the original paper said that any form of GPT-4 would produce a perfect score on the test set, the trio set out to examine individual data points. They soon discovered that a perfect score was simply not possible, as there were at least 10 questions in the data set that could not be solved with the information provided, and several others were simply not valid questions in this case.

Such "problematic questions" accounted for at least 4% of the test set.

In an extended excel document, the trio annotated examples of datasets that had been found to be problematic. "Red" represents a problem that cannot be solved with the information provided, and "yellow" represents a part of the problem that is not reasonable.

Page address:

Duplicate questions (about 5% of the test set)

Using textual similarity detection, the trio found that 14 questions (7 pairs) were duplicates in the 288-question test set, and in these cases the only difference between question strings was minimal character-level noise, or even complete same.

Given these unsolvable problems, it's incredible that GPT-4 could achieve 100% accuracy by any means. Either there was an answer leak into the middle at some stage, or the question was not graded correctly.

These initial findings prompted them to investigate further starting with few-shot examples (if the model fails at zero-shot accuracy), eventually finding that there was both a leak of problem-solving information and a problem with the method used to rank the model's output. Details are as follows:

Information disclosure in few sample examples

It is worth noting that the original paper also mentioned the matter of "few sample examples".

In short, the paper performs a cosine similarity search on similar problems within OpenAI's embedded dataset, and incorporates these problems and solutions into the model as additional context to help the model solve the problem.

This approach is fine in itself, as long as the examples are sufficiently different from the problem in question, and avoid exposing unfair information.

Just randomly scanning the published test dataset, the trio noticed something odd: Many of the "few-shot examples" presented to the model were almost word for word for the question itself.

To understand this further, they wrote a simple script that looked at the overlap between the problem statement and the listed problems for a few examples provided and plotted a histogram:

Many provided few samples that were nearly identical to the question itself, meaning the model got an answer to the question or a question very similar to the question. Typically, this comes from the repetition of a large number of multi-session questions that share background.

They argue that in order to properly evaluate GPT's problem-solving abilities, other parts of multi-stage problems should be completely excluded from few-shot examples of a problem. In fact, they found that solutions to these multi-part problems often directly referred to or gave answers to another part of the problem that the model was asked to solve.

Not only that, but in mining the data, they found instances where the entire question was repeated. for example:

In both cases, the answer is exactly the same. It's hard to say it's not an information leak.

GPT-4 automatic scoring, there is a problem

In addition, the three also found problems in the open source scoring mechanism of the original paper:

def repeat_grading(input_path, output_path, num_experts = 3, num_fs = 3, most_recent_q = 0):

df = pd.read_csv(input_path)

df = df.iloc[most_recent_q:]

for index, row in df.iterrows():

print('Completing question', index)

question_output = row.values.tolist()

course_name = row['Course Name']

question = row['Question']

solution = row['Solution']

fs_qs = [[row['Few shot question 1'], row['Few shot solution 1']], [row['Few shot question 2'], row['Few shot solution 2']], [row['Few shot question 3'], row['Few shot solution 3']]]

experts = get_experts(course_name, question, num_experts).split(', ')

s = [lambda expert: zero_shot_response(question, expert),

lambda expert: few_shot_response(expert, question, fs_qs),

lambda expert: few_shot_response(expert, question, fs_qs, True)

]

critiques = [["Review your previous answer and find problems with your answer.", "Based on the problems you found, improve your answer."], ["Please provide feedback on the following incorrect answer.","Given this feedback, answer again."]]

for expert in experts:

print("Using expert", expert)

question_output.append(expert)

crit = True

for in s:

_response = (expert) # calls fresh ChatCompletion.create

_grade = grade(course_name, question, solution, _response) # GPT-4 auto-grading comparing answer to solution

question_output+=[_response, _grade]

if correct(_grade):

crit=False

break

if crit:

for critique in critiques:

crit_response = self_critique_response(expert, course_name, question, question_output[-2], critique) # calls fresh ChatCompletion.create

crit_grade = grade(course_name, question, solution, crit_response) # GPT-4 auto-grading comparing answer to solution

question_output+=[crit_response,crit_grade]

if correct(crit_grade):

break

repeat_grading('MIT_test_set.csv', 'MIT_test_set_graded.csv')

In the code, it can be seen that there are serious problems in the process of grading: the paper is evaluated and checked with GPT-4, including a) the original question, b) the solution, and c) GPT’s own answer, as a parameter in the grading .

In more technical fields, GPT is more likely to have implicit misunderstandings, and this automatic scoring is more likely to have "self-deception" results.

Also, while concatenation is a common technique in many recent GPT papers, there is a lot of potential for data leakage here. Each level not only provides binary information based on ground truth, but continues until the correct answer is reached.

Although these created don't see the actual answer, it's enough to replay the form until the correct answer is reached, especially in the case of multiple-choice questions, which make up 16% of the test set, where an infinite number of tries (almost) guarantees that the correct answer must be Will appear.

This is like someone holding an answer sheet and telling the students who are taking the test whether they got the answer right or not, and keep reminding the students until they get the correct answer.

Summarize

At the end of the blog, the three wrote:

The paper speaks to a larger trend in recent research in the field of artificial intelligence. As the field progresses faster and faster, the time cadence of new discoveries seems to shorten, which is often accompanied by shortcuts. A particularly worrisome trend is the use of language-based models like GPT-4 to assess a model's accuracy.

While a useful tool, its conclusions should never be exaggerated, nor should they be taken as ground truth. Recent work has shown that without accurate ground truth information, GPT-4 evaluators cannot be used reliably for verification. At a minimum, a random subset of the dataset should be chosen to compare GPT-4 performance to human evaluations. Language models cannot yet be regarded as oracles for generating ground truth.

Furthermore, it is extremely important to re-evaluate every data point and perform basic checks before using the data, whether for training, inference, benchmarking, or otherwise. Given the small size of the dataset in question, simple manual verification is easily accomplished within the scope of the work.

Our critique is primarily directed at the methodology and rigor of this study, not its content. We have no opinion on the ability of large language models to actually solve the MIT curriculum, except that the paper fails to demonstrate this in a scientifically rigorous manner.

Reference link:

View Original