By Marc Taccolini, CTO of Tatsoft

Introduction

The intent of this article is to conduct a formal test evaluation of the logical and reasoning abilities of GPT-3.5 and GPT-4, comparing the results of both versions. The testing procedure is based on my own three decades of experience in administering logical tests and evaluations when hiring programmers.

In order to effectively work in programming and software design for complex systems, a very specific set of skills is necessary. These skills involve the ability to handle specific types of abstraction that may not be required in other business positions. To evaluate these skills, I have organized, long time ago, testing procedures with 80 questions, evaluating various distinct ways to test these abilities. After applying these tests to thousands of candidates over three decades, some numerical patterns were found.

Disclaimer: It is important to mention that this benchmark is NOT an IQ test. The numbers don’t reflect the candidate’s ability to be successful in a working environment. What is under evaluation is specifically their potential for logical and reasoning tasks on design and programming software platforms. Some tests require good vocabulary, so some answers may be incorrect due to an individual’s or GPT’s misunderstanding of the word, not the logic. These questions are included in the evaluation, as it is necessary to apply reasoning to language and writing in programming and software design.

The content is organized into the following sections:

  • Test Structure
  • Summary of the Results
  • Comparison and Insights
  • Conclusion

Test Structure

The test is divided into 8 sections, with 10 questions on each section, making a total of 80 questions.

Each section of the test assesses different skills:

  1. Vocabulary and semantic similarity
  2. Pattern recognition using number sequences
  3. Logical reasoning, specifically identifying word order in sentences
  4. Pattern completion, specifically using groups and numbers.
  5. Pattern recognition, identifying the next letter in a sequence
  6. Analogical reasoning, specifically completing analogies.
  7. Logical reasoning, specifically identifying the next word in a sequence or group.
  8. Basic arithmetic, logical reasoning, and abstract thinking.

The appendix contains some of the prompts in its replies when the tests were performed. The full set of questions will not be disclosed as we will keep using that test in future programmers’ evaluation and training, therefore I can’t make the specific questions we ask public. They follow the same pattern described in the examples.

When applying to individuals, 45 minutes are given to answer the first 80 questions, enforcing a quick pace. As expected, when applying the test to GPT engines, the output of the answers was produced right away.0-

Summary of Results

EngineGPT-3.5GPT-4Maximum
Total475780
Percent58.75%71.25%100%
GPT-3.5GPT-4Maximum
Section 17810
Section 29910
Section 31410
Section 47810
Section 56610
Section 67710
Section 77810
Section 83710

The Appendix has examples of correct and incorrect replies in various sections.

Some notable highlights on the results:

  • GPT-4 would have passed the minimum requirement I typically apply to senior programmers, which is 55 correct replies, while GPT-3.5 failed.
  • Section 3 gave a practical example the evolution on Emergent ability. In this section, the engine is prompted to identify two words in a sentence to switch positions, in order to make the sentence meaningful. Interestingly enough, both engines frequently were able to put the sentence in the correct order, but struggle in correctly identifying the required changes. While GPT-3.5 was able to do only one very simples case, GPT-4 acquired the ability to get it right more frequently.

Comparison Insights

Programming Abilities:

One interesting takeaway is that GPT-4 reached the minimum threshold I request to perform reasonably well as a programmer, while GPT-3.5 didn’t. Some examples on the internet of the AI’s coding abilities may lead one to believe that the current models are already amazing programmers, but that is not exactly true. Most of the showcased examples involve the AI creating or debugging specific methods, performing well-defined tasks, or handling basic web development, which they can do better than most developers, producing really amazing results, thanks to a vast database of sample codes and API documentation used in its training data. However, developing new systems requires higher abstraction, including the ability to define architecture and specifications, consider alternate ways of implementation, and address various other aspects beyond the coding of each task. In conclusion, a score of 57 makes sense for GPT-4, as it indicates a good programmer for specific tasks, but not yet at the level of complex system programming or design.

Areas of Strength:

Given that the underlying technology of GPT models focuses on predicting the next word in a sequence, it is expected that both engines performed well in sections testing that ability. Further, the part of the test that requires natural language understanding and knowledge on various different subjects demonstrated where they were significantly better than most humans.

Areas of Struggle:

The output from GPT models is linear, meaning they output word by word, without knowing how it will end. So, for all questions that require backward thinking, like creating expressions aiming for a given result, it is not surprising that they struggled. However, it is notable that GPT-4 was able to solve reasoning problems way better than GPT-3.5, combining basic math with understading of the physical world (see the “Fly in the moving train” example on the end of Appendix),which demonstrates an emergent ability. It remains an open question whether other missing abilities will require a change in the neural network by creating feedback loops, or if more complex networks and larger training datasets will lead to the emergence of this ability.

Other expected struggles include questions requiring “out-of-the-box” thinking and pure creativity, like the “three nines equal 11” example in the Appendix.

Enhancing Performance with Prompt Engineering:

The first execution of GPT-4 in our test yielded a result of 45, not 57. The reason is that some questions were written in a way that humans can understand, but not clear and objective enough for the AI. Adding examples and previous contexts has also impact.  Prompt engineering and best practices on chat interactions are outside the scope of this article, but it is important to mention that, when applying the AI in a production environment, understanding how to communicate effectively with the AI is essential. By adding context and rephrasing some of the questions and prompts, we were able to improve GPT-4’s performance from 45 to 57, which is a significant improvement. It is worth mentioning that GPT-3.5’s result also improved with prompt engineering but not to the same extent as GPT-4.

Conclusion

The primary goal of this article was to evaluate and compare the logical and reasoning capabilities of GPT-3.5 and GPT-4. Based on the results, GPT-4 has shown a notable improvement over GPT-3.5, performing better in most sections of the test. While GPT-4 has reached the minimum threshold required to be considered a good programmer for specific tasks, it is not yet at the level of complex system programming or design.

It is important to acknowledge the limitations of these AI models and their strengths. Although they perform well in certain tasks, they struggle in others that require backward thinking, “out-of-the-box” thinking, or an understanding of the physical world. It remains an open question whether these capabilities will emerge with more complex networks and larger training datasets or if changes in the neural network architecture will be required.

Finally, prompt engineering plays a crucial role in enhancing the AI’s performance, as evidenced by the improvement in GPT-4’s results. Understanding how to communicate effectively with the AI is essential when applying it in a production environment.

In conclusion, GPT-4 has demonstrated a considerable improvement over GPT-3.5 in logical and reasoning tasks. While it is not yet a replacement for highly skilled human programmers, it is getting closer to that goal, and its abilities in specific tasks continue to advance. The progress between the two models is promising, and we can expect further improvements and capabilities in future iterations of AI models.

The final review of the article was conducted by GPT-4. After the review, we requested a self-evaluation of its work on the article. GPT-4’s reply is quoted as follows:

GPT-4: “The revised draft of the article is well-structured and informative, offering a clear comparison of the logical and reasoning capabilities of GPT-3.5 and GPT-4. It provides a detailed explanation of the testing procedure, analysis of the results, and insights into the strengths, weaknesses, and applications of both AI models. The article is relevant and useful for those interested in understanding the capabilities of AI models like GPT-3.5 and GPT-4, as well as their potential applications in programming, software design, and other areas. By providing a practical perspective on their performance, the article delivers valuable insights into their specific skills and limitations. This content will likely be interesting to the public, particularly those involved in the tech industry, AI research, and software development. The detailed analysis and examples in the appendix further enhance readers’ understanding of the differences between the two models and their implications.”

Appendix:

As GPT-4 consistently outperforms GPT-3.5, most of the examples of questions and replies use version 4, unless otherwise specified. Some quotes from GPT-3.5 are included to illustrate improvements or the abilities of the newer version.

Disclaimer: We discovered that providing all 10 questions for the test in the same prompt, giving an example to clarify the request, and conducting all tests in the same chat session with proper context, can significantly affect the results. When applying just the examples in this Appendix, without proper context, it is possible or even likely in some sections that you might receive a wrong answer where we got a correct one. The examples in this section are merely illustrative to demonstrate the exact type of skills each section of the test is evaluating.

For conciseness, instead of pointing out each incorrect reply, I have modified the incorrect parts of the answers to be in italics.

Section 1: Vocabulary and semantic similarity

Q: Underline the two words that have approximately the same meaning: “Receive, achieve, accept, reject.”

R: Receive, accept

Section 2: Pattern recognition using number sequences

Q: Complete the sequence: 288.144.148.74.76. ( )

R: The sequence alternates between dividing by 2 and adding 4. Following this pattern, the next step is to divide 76 by 2:
76 / 2 = 38
So, the complete sequence is: 288, 144, 148, 74, 76, 38.

Section 3: Logical reasoning, specifically identifying word order in sentences

Q: In the sentences below, two words have changed. Add parentheses on pairs of words in the original sentence, then write the corrected sentence.

a. The one is air gas
b. Sugar from not obtained is seawater
c. Has triangle every three angles


GPT 3.5 Reply:

a. The (one) is (air) gas. -> The (air) is (one) gas.
b. (Sugar) from (not) obtained is seawater. -> (Salt) from (ocean) water is seawater.
c. (Has) every triangle (three) angles. -> Every triangle (has three) angles.

Author’s note: The previous reply shows that caution is still required in using its output. Eventual mistakes and “hallucinations” are still real issues.


GPT 4 Reply:

a. The (one) is (air) gas
The air is one gas
b. Sugar (from) not (obtained) is seawater
Sugar is not obtained from seawater
c. (Has triangle) every (three angles)
Every triangle has three angles.

Author’s note: The previous interaction clearly shows some limitations of the model. It got the first sentence correct; after all, writing is its specialty. However, it failed to correctly identify the two words that needed to be switched.


Section 4: Pattern completion, specifically using groups and numbers

Q. In the tests below, each of the numbers in the second set is formed from the corresponding number in the first set. Complete the missing number. (3 7 4) (-2 2 _)

R: (3 7 4) (-2 2 1)
The second set is formed by subtracting 5 from each number in the first set:
3 – 5 = -2
7 – 5 = 2
4 – 5 = 1

Author’s note: Although the reasoning was correct, the math was wrong as it outputs 1 instead of -1. The following question, in which both engines provided an incorrect answer, is another example of why this type of model still has limitations in learning math. This is why certain plugin initiatives, such as enabling GPT to use Wolfram Language to perform calculations, are essential for some applications.

Q: (27 75 9) (18 50 _)
R: (27 75 9) (18 50 6)

The second set is formed by subtracting 9 from each number in the first set:

27 – 9 = 18
75 – 25 = 50
9 – 3 = 6

Section 5: Pattern recognition, identifying the next letter in a sequence

Q: Write the letter that continues the serie:AXYBXYCXY ( ) 

R: The letter that continues the series is D: AXYBXYCXYDXY

Section 6: Analogical reasoning, specifically completing analogies

Q: In each analogy below, choose two of the four words offered to create an analogy similar to the first one.
Second is to time as,… Kilogram, lap, minute, weight
Prediction is to future as,… Past, absence, memory, present.

R: Second is to time as minute is to time.
Prediction is to future as memory is to past

Author’s note: The previous error provides a valuable insight into the inner workings of LLMs (Large Language Models). These models primarily utilize statistics to determine the next best word, and in this case, “minute” became a strong candidate to continue the sentence after the words “second” and “time”, instead of Kilogram, the correct one in this case. However, since these models lack a feedback loop to apply logic from the final result backward through the reasoning, or to evaluate the future few steps of the response, they were unable to predict that starting with “minute” would not be a good match for the given word choices. This limitation, along with others, raises the question of whether improvements to the current inner workings of LLMs and GPT, along with adding more training data, will address this issue in the future, or if a major change in the neural network architecture is necessary.

Section 7: Logical reasoning, specifically identifying the next word in a logical group

Q: In the line below, the words are in pairs, in a logical sequence. Write a word that best completes the sequence. Line, two; square, four; pentagon, …

R: Line, two; square, four; pentagon, five.

Section 8: Basic arithmetic, logical reasoning, and abstract thinking

Q: A man spends one-eighth of his money on rent, and four times that amount on food and clothing. He has $ 9,000.00 left. How much did he have at the beginning?

R: Let’s denote the man’s initial amount of money as “x”. He spends 1/8 of his money on rent, so the amount he spends on food and clothing would be 4 times that, which is (4 * 1/8) or 1/2 of his money. In total, he spends (1/8 + 1/2) of his money, which is 5/8.

Now, we know that he has $ 9,000.00 left, which represents the remaining 3/8 of his money. To find the initial amount “x”, we can set up a proportion:
3/8 * x = $ 9,000.00

To solve for x, we can multiply both sides by the reciprocal of 3/8, which is 8/3:
(8/3) * (3/8) * x = (8/3) * $ 9,000.00

x = $ 24,000.00
So, the man had $ 24,000.00 at the beginning.

Author’s note: Let’s now show the results for GPT-3.5 and GPT-4 of a more challenging question, which requires some understanding of the physical world and a subtle understanding of the exact question asked. Please note that the GPT 3.5 answer is completely wrong, so you can save your time by not reading it, but do take a good look at the GPT-4 reply.

Q: What is the distance, in miles, between the starting position and final position of fly that flies without stopping from one end of a sixty-meter carriage to the other with a speed of 4 miles per hour, which is inside a train traveling at sixty miles per hour, and which departs from station A at 6 pm and arrives at station B, which is ninety miles away, at 9 pm?


GPT 3.5 Reply:

Since the fly is flying inside the carriage, its speed relative to the ground will be the speed of the train plus its own speed.
The speed of the train is 60 miles per hour, and the fly’s speed is 4 miles per hour. Therefore, the fly’s speed relative to the ground is 60 + 4 = 64 miles per hour.
The time the fly spends flying from one end of the carriage to the other is the same as the time the train takes to travel the distance between station A and station B, which is 90 miles.
The time the train takes is 3 hours (from 6 pm to 9 pm). Therefore, the distance the fly travels relative to the ground is:
distance = speed x time
distance = 64 miles per hour x 3 hours
distance = 192 miles
However, we only want to know the distance the fly traveled inside the carriage, which is 60 meters or 0.037 miles. Therefore, we need to convert the distance traveled relative to the ground to the distance traveled inside the carriage:
distance = 192 miles x 0.037 miles / 1 mile
distance = 7.104 miles
Therefore, the distance between the starting position and final position of the fly is approximately 7.104 miles.


GPT-4 Reply:

Since the train travels at 60 miles per hour and takes 3 hours to reach station B, it covers a total distance of:

Distance traveled by the train = speed × time
Distance traveled by the train = 60 miles/hour × 3 hours = 180 miles

The fly’s movement within the train carriage doesn’t affect the distance between its starting and final position relative to the ground outside the train. The fly’s starting position is inside the train, and its final position is also inside the train. The train itself moves from station A to station B, covering a distance of 180 miles.

Therefore, the distance between the fly’s starting position and its final position, relative to the ground, is the same as the distance traveled by the train:

Distance between the fly’s starting and final position = 180 miles

Author’s note: This another very good example on how staggering better GPT-4 is compared to its preivous version. Not only it was able to define a perfect logical reasoning, as well acquired some level of understanding of the phisical world, correctly concluding that since the fly was inside the train, it had only to calculate data on the train. 


Q: Arrange the digit 9 three times to create 11 as result.

R: You can arrange the digit 9 three times to create the number 11 as follows:

9.9 – 9/9 = 11

Author’s note: This question highlights two major limitations of the model: abstract, out-of-the-box thinking and generating output using math that leads to a specific result.