Where GPT-5 Shines — and Where It Still Fails

By Marc Taccolini, CEO & Founder, Tatsoft


Introduction

Two years ago, I published an article comparing GPT-3.5 and GPT-4 using a formal test based on my three decades of experience recruiting and evaluating software developers. The test measures logical reasoning, abstraction, and problem-solving skills — not IQ — and is designed to identify programmers who can excel in complex system design.

With the release of GPT-5, I have now repeated the test to see how it compares with its predecessors. The results are striking: GPT-5 has reached the performance of the very best programmers I have evaluated over the years — but with important caveats about its fundamental limitations.


Test Structure

The evaluation consists of 8 sections with 10 questions each (80 total), covering skills such as:

  1. Vocabulary and semantic similarity
  2. Pattern recognition with number sequences
  3. Logical reasoning with word order in sentences
  4. Pattern completion using groups and numbers
  5. Letter sequence pattern recognition
  6. Analogical reasoning
  7. Logical word sequence reasoning
  8. Problem solving (Basic arithmetic, logical reasoning, and abstract contextualization)

Human candidates have 45 minutes to complete the test. GPT models, of course, respond instantly.

In my GPT-3.5 and GPT-4 evaluations, I used a clarified, text-only version of the test to remove ambiguity and avoid visual interpretation issues. For GPT-5, I first used the exact same PDF given to human candidates — including visual elements like underlines and tables — then re-ran it with the clarified text-only version.

Finally, I performed a review phase: re-submitting wrong answers (both with and without explicitly stating they were wrong) to see if the model could correct itself.


Results – Complete Set

SectionGPT-3.5GPT-4GPT-5 (PDF)GPT-5 (Clarified)GPT-5 After Review
Vocabulary & Semantic788810
Pattern recognition, numbers999910
Word order in sentences1411010
Completion using groups7811010
Letter sequence pattern66888
Analogical reasoning77999
Logical word sequence78888
Problem Solving37888
Total4757527073
Percent58.75%71.25%65%87.5%91.25%

Observations on Review and Clarification

Visual Input Penalty

In the raw PDF test, GPT-5 scored only 1/10 in Sections 3 and 4, where information had to be extracted only from visuals and with no text clarifications. With clarified text, these sections jumped to perfect 10/10 scores — confirming that prompt clarity and input format still matter.

Prompt Engineering Still Matters

Just as with GPT-4, rephrasing and clarifying instructions significantly boosted performance.

Internationalization (Localization) Still Matters

For an exact comparison with a large dataset, the questions were presented in Portuguese. Some errors on Vocabulary & Semantics, and similar sections, seem related to subtle idiomatic nuances.

Review-Phase Findings

When given wrong answers without telling it they were wrong, GPT-5 fixed some mistakes, improving its total from 70 to 73. When explicitly told certain answers were wrong, it corrected only 1 of 4 — repeating the same flawed reasoning in the others.


The Critical Pattern Recognition Failure

Some errors reveal a deeper limitation that has profound implications for real-world applications.

The Sequence Problem

Consider the sequence: ABBCBDEFBGHI

The correct approach requires identifying two rules:

  1. “B” acts as a separator
  2. The groups between Bs grow in size and follow alphabetical order

With proper formatting, the pattern becomes clear: A (B) BC (B) DEF (B) GHI → The next letter should be J to complete “GHIJ”

GPT-5’s Inconsistent Responses

GPT-5’s responses to this sequence revealed fundamental flaws:

Attempt 1 (after 1m 38s thinking): Answer: B

  • Reasoning: “The sequence groups like this — A | BB | C | BDEF | BGHI | … From D onward it follows a pattern of B + three consecutive letters”
  • Problem: Created a rule that ignores half the dataset

Attempt 2 (after 42s thinking): Answer: J

  • Reasoning: “Read it as overlapping consecutive pairs hidden in the string: AB, BC, DE, EF, FG, GH, HI. The next pair would be IJ”
  • Problem: Correct answer but completely wrong reasoning that ignores the actual pattern

Attempt 3: Answer: B (different reasoning)

  • Created yet another flawed explanation for the same wrong answer

The critical insight: Even in “deep reasoning” mode, GPT-5 demonstrates:

  • Creating rules that ignore large sets of data
  • Creating rules that don’t apply to supplied data
  • Creating rules that apply only to subsets where rules for the entire dataset exist
  • Complete lack of consistency — different logic each time for the same input

The Consistency Problem Extends Beyond GPT

This lack of consistency isn’t unique to GPT. When searching for my own article on reasoning tests using “GPT 5 reasoning tests Taccolini,” Google’s Gemini AI consistently finds the correct article but attributes it to different fictitious relatives each time:

  • First search: Credits “Giorgio Taccolini”
  • Second search: Credits “Francesco Taccolini”
  • Third search: Credits “Matteo Taccolini”
  • Fourth search: Credits “Luca Taccolini”

At least it maintained Italian heritage consistency!

(Screenshots of these attribution errors are shown at the end of this article)

For the general public, these examples might be classified as curiosities or even humorous. But in industrial environments dealing with real physical assets — potentially critical infrastructure — they represent fundamental risks.


Unexpected Behavior Note

During the test, GPT-5 unexpectedly offered to format the answers into a ready-to-send answer sheet (PDF/Word) with a name, phone, and email header. While unrelated to the evaluation itself, it highlights an important aspect of AI interaction: models sometimes shift context toward “helping” in ways that, if misunderstood or misused, could have ethical implications — such as masking the model’s role in producing the answers.


Reaching Top Programmer Scores — With Caveats

Over decades of testing thousands of programmers, I’ve seen only a handful reach 70 or higher scores. GPT-5, with clarified input and review, scored 73/80, matching those elite human results.

But here’s the critical difference:

  • Top human programmers can detect, understand, and adapt their reasoning when confronted with an error — often learning from the mistake on the spot
  • GPT-5 cannot truly “learn” mid-test; it can revise an answer, but often repeats the same flawed logic
  • GPT-5 (and LLMs in general) are sophisticated statistical models for next-word prediction, with no internal guaranteed consistency or correctness

This is not simply a matter of more training data or CPU power — it’s a structural limitation of current large language models.


Why These Limitations Cannot Be “Fixed”

It’s important to clarify that these fundamental flaws are not possible to solve by additional training data or scaling computing resources. They are related to the fundamental statistical models used at the core of LLM technologies. Scaling data centers, training data, and alignment validation on the models can diminish the error margin but cannot cure the fundamental structural flaws that prevent them from controlling critical assets.

The proved lack of consistency, the lack of repeatability on results even with the same input, and the fundamental flaws in basic reasoning in some percentage of tests draw a clear line on where this type of technology can be applied in automation, now and in the future.


Industrial Implications

For industrial automation and critical systems, these findings have profound implications:

  1. Deterministic Requirements: Industrial control systems require the same input to always produce the same output. LLMs fundamentally cannot guarantee this.
  2. Error Recovery: When a compressor surge condition involves multiple interacting parameters, we need systems that can identify and correct errors systematically, not randomly generate new explanations.
  3. Audit Trails: The lack of consistency makes it impossible to create reliable audit trails for decision-making in regulated industries.
  4. Safety-Critical Applications: The “mostly right” nature of LLMs makes them unsuitable for autonomous control of safety-critical systems.

Conclusion

GPT-5 is faster than any human candidate I’ve ever tested, and its average accuracy matches the very best programmers in my career. For many business and technical tasks, it is a transformative tool.

However:

  • It is not a PhD-level autonomous thinker
  • It cannot consistently recover from reasoning errors
  • It still needs expert human oversight for critical decisions
  • Its limitations are structural, not solvable through scale

Some commercial and investment narratives tend to overstate AI’s current abilities, emphasizing its successes while underplaying its weaknesses. Decision-makers should temper such claims with an understanding of the model’s structural limits — and the fact that expert human guidance remains essential.

In short: GPT-5 is amazing. It’s disruptive. It’s useful in almost every field. But the hype suggesting it can fully replace expert human judgment is misleading. The best results still come when an expert guides the AI, identifies possible errors, and applies domain knowledge to verify or correct its output.

Human oversight on the driver’s seat is still non-negotiable.


Visual Evidence

GPT-5’s Inconsistent Reasoning on the Letter Sequence

Error: Ignoring the first half of the dataset

Error: Correct letter by wrong reason

Error: Fail to combine with block growing rule


Visual Evidence

Gemini: Multiple Author Credits on the Same Article

Giorgio Credit

Francesco Credit

Matteo Credit

Luca Credit


  • Article: “Beyond Prompt Engineering: The Entity Engineering Approach” – Blog Post
  • Article: “Beyond the Hype: Making AI Work in Industrial Automation” – Control Engineering
  • Video: Live Presentation on GPT-5 Evaluation – YouTube

Marc Taccolini is CEO & Founder of Tatsoft, bringing 30+ years of industrial software expertise from his work at Tatsoft and previously InduSoft (acquired by AVEVA). He conducts extensive testing on AI reasoning capabilities to understand their practical applications and limitations in industrial automation.