📢 Gate Square Exclusive: #WXTM Creative Contest# Is Now Live!
Celebrate CandyDrop Round 59 featuring MinoTari (WXTM) — compete for a 70,000 WXTM prize pool!
🎯 About MinoTari (WXTM)
Tari is a Rust-based blockchain protocol centered around digital assets.
It empowers creators to build new types of digital experiences and narratives.
With Tari, digitally scarce assets—like collectibles or in-game items—unlock new business opportunities for creators.
🎨 Event Period:
Aug 7, 2025, 09:00 – Aug 12, 2025, 16:00 (UTC)
📌 How to Participate:
Post original content on Gate Square related to WXTM or its
GPT-4 full score through MIT undergraduate mathematics? I'm afraid it's fake, there's something wrong with the data set itself
Original title: "The explosive "GPT-4 MIT Undergraduate Mathematics Full Score" paper cheated, the data set itself has problems"
In the past two days, a paper about GPT-4 passing the MIT MIT EECS and mathematics undergraduate exams with full marks has gone viral on Twitter.
In brief recap, a research team from MIT compiled a comprehensive dataset of 4,550 problems and solutions from course questions, midterms, and final exams for the Mathematics, Electrical Engineering, and Computer Science (EECS) majors at their school .
Then, the research team asked various large language models to complete the topic of this data set, and the results were too scary: GPT-3.5 could do 1/3 correctly, and GPT-4 passed almost full marks.
As shown in the above table, the more ways to add GPT-4, the higher the correct answer rate of the model. The original GPT-4 was able to get a 90% correct rate score. After some operation, it even got a full score directly.
But most of the netizens who have a heated discussion may not have noticed that this score itself was scored with GPT-4...
After an hour of research, they had doubts about the paper's methods.
Two hours later, they realized: there was something wrong with the dataset itself.
In other words, the model is like a student who was told the answer before the exam, which is blatant "cheating".
The trio also questioned the tide of "over-publicity": "These papers are often uploaded to Arxiv and widely shared on Twitter before any legitimate peer review. The future of work sets a bad precedent.”
"Deep learning" fighter Gary Marcus also unsurprisingly supported this wave of doubts:
Next, let's take a look at the problems with this "explosive" paper.
**What is wrong with the dataset? **
First, as known from the original paper, the dataset collected by the researchers contains 4550 problems and corresponding solutions for 30 mathematics and EECS courses required to obtain the MIT degree exam, covering core courses and elective courses.
This dataset (excluding the training set used to fine-tune the open-source LLM) was also released to GitHub with the publication of the paper, along with the code used to generate the reported performance test. However, the author, Prof. Drori, has removed it in a recent submission.
"However, we acknowledge that it is possible that this file was replaced with a different file used for testing. If this is the case, we believe that the burden of proof lies with the authors to publicly release this data and all analyzes done with it."
So, what is the problem that is being glossed over? The three gave their own analysis.
Unsolvable problems (approximately 4% of the test set)
Given that the original paper said that any form of GPT-4 would produce a perfect score on the test set, the trio set out to examine individual data points. They soon discovered that a perfect score was simply not possible, as there were at least 10 questions in the data set that could not be solved with the information provided, and several others were simply not valid questions in this case.
Such "problematic questions" accounted for at least 4% of the test set.
In an extended excel document, the trio annotated examples of datasets that had been found to be problematic. "Red" represents a problem that cannot be solved with the information provided, and "yellow" represents a part of the problem that is not reasonable.
Page address:
Duplicate questions (about 5% of the test set)
Using textual similarity detection, the trio found that 14 questions (7 pairs) were duplicates in the 288-question test set, and in these cases the only difference between question strings was minimal character-level noise, or even complete same.
Given these unsolvable problems, it's incredible that GPT-4 could achieve 100% accuracy by any means. Either there was an answer leak into the middle at some stage, or the question was not graded correctly.
These initial findings prompted them to investigate further starting with few-shot examples (if the model fails at zero-shot accuracy), eventually finding that there was both a leak of problem-solving information and a problem with the method used to rank the model's output. Details are as follows:
Information disclosure in few sample examples
It is worth noting that the original paper also mentioned the matter of "few sample examples".
In short, the paper performs a cosine similarity search on similar problems within OpenAI's embedded dataset, and incorporates these problems and solutions into the model as additional context to help the model solve the problem.
This approach is fine in itself, as long as the examples are sufficiently different from the problem in question, and avoid exposing unfair information.
Just randomly scanning the published test dataset, the trio noticed something odd: Many of the "few-shot examples" presented to the model were almost word for word for the question itself.
To understand this further, they wrote a simple script that looked at the overlap between the problem statement and the listed problems for a few examples provided and plotted a histogram:
They argue that in order to properly evaluate GPT's problem-solving abilities, other parts of multi-stage problems should be completely excluded from few-shot examples of a problem. In fact, they found that solutions to these multi-part problems often directly referred to or gave answers to another part of the problem that the model was asked to solve.
Not only that, but in mining the data, they found instances where the entire question was repeated. for example:
GPT-4 automatic scoring, there is a problem
In addition, the three also found problems in the open source scoring mechanism of the original paper:
def repeat_grading(input_path, output_path, num_experts = 3, num_fs = 3, most_recent_q = 0):
df = pd.read_csv(input_path)
df = df.iloc[most_recent_q:]
for index, row in df.iterrows():
print('Completing question', index)
question_output = row.values.tolist()
course_name = row['Course Name']
question = row['Question']
solution = row['Solution']
fs_qs = [[row['Few shot question 1'], row['Few shot solution 1']], [row['Few shot question 2'], row['Few shot solution 2']], [row['Few shot question 3'], row['Few shot solution 3']]]
experts = get_experts(course_name, question, num_experts).split(', ')
s = [lambda expert: zero_shot_response(question, expert),
lambda expert: few_shot_response(expert, question, fs_qs),
lambda expert: few_shot_response(expert, question, fs_qs, True)
]
critiques = [["Review your previous answer and find problems with your answer.", "Based on the problems you found, improve your answer."], ["Please provide feedback on the following incorrect answer.","Given this feedback, answer again."]]
for expert in experts:
print("Using expert", expert)
question_output.append(expert)
crit = True
for in s:
_response = (expert) # calls fresh ChatCompletion.create
_grade = grade(course_name, question, solution, _response) # GPT-4 auto-grading comparing answer to solution
question_output+=[_response, _grade]
if correct(_grade):
crit=False
break
if crit:
for critique in critiques:
crit_response = self_critique_response(expert, course_name, question, question_output[-2], critique) # calls fresh ChatCompletion.create
crit_grade = grade(course_name, question, solution, crit_response) # GPT-4 auto-grading comparing answer to solution
question_output+=[crit_response,crit_grade]
if correct(crit_grade):
break
repeat_grading('MIT_test_set.csv', 'MIT_test_set_graded.csv')
In the code, it can be seen that there are serious problems in the process of grading: the paper is evaluated and checked with GPT-4, including a) the original question, b) the solution, and c) GPT’s own answer, as a parameter in the grading .
In more technical fields, GPT is more likely to have implicit misunderstandings, and this automatic scoring is more likely to have "self-deception" results.
Also, while concatenation is a common technique in many recent GPT papers, there is a lot of potential for data leakage here. Each level not only provides binary information based on ground truth, but continues until the correct answer is reached.
Although these created don't see the actual answer, it's enough to replay the form until the correct answer is reached, especially in the case of multiple-choice questions, which make up 16% of the test set, where an infinite number of tries (almost) guarantees that the correct answer must be Will appear.
This is like someone holding an answer sheet and telling the students who are taking the test whether they got the answer right or not, and keep reminding the students until they get the correct answer.
Summarize
At the end of the blog, the three wrote:
Reference link: