GPT-5’s Logic: Faster Than the Best Humans, Still Blind to Its Own Mistakes

Introduction

Two years ago, I published an article comparing GPT-3.5 and GPT-4 using a formal test based on my three decades of experience recruiting and evaluating software developers. The test measures logical reasoning, abstraction, and problem-solving skills — not IQ — and is designed to identify programmers who can excel in complex system design.

With the release of GPT-5, I have now repeated the test to see how it compares with its predecessors. The results are striking: GPT-5 has reached the performance of the very best programmers I have evaluated over the years — but with important caveats about its fundamental limitations.

“When GPT-5 makes a mistake, it often can’t recover — even when explicitly told it’s wrong.”


Test Structure

The evaluation consists of 8 sections with 10 questions each (80 total), covering skills such as:

  1. Vocabulary and semantic similarity
  2. Pattern recognition with number sequences
  3. Logical reasoning with word order in sentences
  4. Pattern completion using groups and numbers
  5. Letter sequence pattern recognition
  6. Analogical reasoning
  7. Logical word sequence reasoning
  8. Basic arithmetic, logical reasoning, and abstract thinking

Human candidates have 45 minutes to complete the test. GPT models, of course, respond instantly.

In my GPT-3.5 and GPT-4 evaluations, I used a clarified, text-only version of the test to remove ambiguity and avoid visual interpretation issues. For GPT-5, I first used the exact same PDF given to human candidates — including visual elements like underlines and tables embedded as images, not as text — then re-ran it with the clarified text-only version.

Finally, I performed a review phase: re-submitting wrong answers (both with and without explicitly stating they were wrong) to see if the model could correct itself.


Results

Article content
GPT Total Score
Article content
GPT Scores by Section

Key Observations

Visual Input Penalty

  • In the raw PDF test, GPT-5 scored only 1/10 in Sections 3 and 4, where information had to be extracted only from visuals and with no text clarifications.
  • With clarified text, these sections jumped to perfect 10/10 scores — confirming that prompt clarity and input format still matter.

Prompt Engineering Still Matters

Just as with GPT-4, rephrasing and clarifying instructions significantly boosted performance.

Review-Phase Finding

  • When given wrong answers without telling it they were wrong, GPT-5 fixed some mistakes, improving its total from 70 to 73.
  • When explicitly told certain answers were wrong, it corrected only 1 of 4 — repeating the same flawed reasoning in the others.

Second-Layer Reasoning Failures

Some errors reveal a deeper limitation. For example:

Sequence: ABBCBDEFBGHI

Correct approach: Identify “B” as a separator and detect that the groups between Bs grow in size: A | BC | DEF | GHI ? → last one should be GHIJ.

GPT-5’s answer: “B” — catching the separator pattern but missing the growing group rule.

This type of multi-rule reasoning is where GPT-5 still struggles. Even with “deep reasoning mode,” it tends to stick with its first detected pattern rather than combining rules — a reflection of its statistical nature rather than true conceptual understanding.

“Despite the hype, GPT-5 still needs a human expert to guide it, verify results, and correct inevitable errors.”


Reaching Top Programmer Scores — With a Caveat

Over decades of testing thousands of programmers, I’ve seen only a handful reach 70 or higher scores. GPT-5, with clarified input and review, scored 73/80, matching those elite human results.

But here’s the critical difference:

  • Top human programmers can detect, understand, and adapt their reasoning when confronted with an error — often learning from the mistake on the spot.
  • GPT-5 cannot truly “learn” mid-test; it can revise an answer, but often repeats the same flawed logic.
  • This is not simply a matter of more training data or CPU power — it’s a structural limitation of current large language models.

Unexpected Behavior Note:

During the test, GPT-5 unexpectedly offered to format the answers into a a ready-to-send answer sheet (PDF/Word) with a name, phone, and email header.

While this was unrelated to the evaluation itself, it highlights an important aspect of AI interaction: models sometimes shift context toward “helping” in ways that, if misunderstood or misused, could have ethical implications — such as masking the model’s role in producing the answers.


Conclusion

GPT-5 is faster than any human candidate I’ve ever tested, and its average accuracy matches the very best programmers in my career. For many business and technical tasks, it is a transformative tool.

However:

  • It is not a PhD-level autonomous thinker.
  • It cannot consistently recover from reasoning errors.
  • It still needs expert human oversight for critical decisions.

Some commercial and investment narratives tend to overstate AI’s current abilities, emphasizing its successes while underplaying its weaknesses. Decision-makers should temper such claims with an understanding of the model’s structural limits — and the fact that expert human guidance remains essential.

In short:

GPT-5 is amazing. It’s disruptive. It’s useful in almost every field. But the hype suggesting it can fully replace expert human judgment is misleading. The best results still come when an subject expert guides the AI, identifies possible errors, and applies domain knowledge to verify or correct its output.


We’ll have a live session August 20th to go over article. Marc will explore:
– How GPT-5 compares to GPT-3.5 and GPT-4
– Where it dominates — and where it fails badly
– Why prompt engineering is still critical
– Why human oversight isn’t going anywhere

You’ll also hear the story behind the test, the ethical curveball GPT-5 threw in his experiment, and what this all means for the future of AI in high-stakes problem solving.

Bring your toughest AI questions — Marc will open the floor for live Q&A so you can dig into the details.

Join here: https://www.linkedin.com/feed/update/urn:li:activity:7360643254314328066