We are going to test two of the finest models in coding and reasoning recently launched, Grok 3 (Elon's AI, naming it the smartest AI on earth) and Claude Opus 4 (best AI model for coding), head-on and see how they compare.
But before starting out, what even is the Claude Opus 4 model?
Source: Anthropic
Let me give you a quick brief. Claude Opus 4 just launched on May 22, 2025. It is said to be the best AI model for coding, with the capability of autonomous coding for hours. It has a 200K token context window and scores 72.5% in the SWE benchmark. Now that you know just enough about this model, let's see how it compares to Elon's claim of the smartest AI model on Earth, Grok 3.
Let's find out if Claude Opus 4 really is the best at coding, or if Grok 3 takes this title as well!
Stick around to see how they compare in coding (build and algorithm problems) and logic tests for both.
Prompt: Build a simple Arkanoid-style game with a paddle, ball, and bricks. There should be a ball bouncing around, breaking blocks at the top while you control a paddle at the bottom. The paddle needs to follow the arrow keys movements or WASD keys.
Here's the code it generated: Link
Frankly, this is more than what I expected with such a blunt prompt that I gave it. It built everything perfectly, from the arcade UI to the ball bouncing, physics, and all.
The code is a bit messy as it threw all the CSS, JS, and HTML into a single file, but it works, and that's what matters for the testing.
Here's the code it generated: Link
This one seems to be good as well, except for the fact that the paddle does not work with the arrow or WASD keys and works solely with the mouse. For some reason, it didn't seem to follow the prompt and instead added mouse-based paddle control.
This seemed to be a small issue that it could fix easily upon iteration, so I followed the prompts again and could easily get the paddle working with the arrow keys.
But this is no big deal, both models performed pretty well on this one.
💬 Quick refresher: I don't know about you, but I used to play this game a lot when I was a kid, and it was then named DX Ball. :)
I see many devs testing AI models on this question so I decided to give a similar question a shot on these two models.
Prompt: Make a Tron-themed ping pong game with two players facing each other inside a glowing rectangular arena. Add particle trails, collision sparks, and realistic physics like angle-based bounces. Use neon colors, and smooth animations.
Here's the code it generated: Link
This was definitely a tougher one to implement, and this model also couldn't get it correct, even after a few follow-up prompts. The ball's motion seems to work, but the whole ping pong play is missing, and there's a lot to fix here.
Not the best you could expect from this model, but it's okay. At least it got something working.
Here's the code it generated: Link
This one really came as a surprise, but Grok 3 did a great job with this question this time. The overall UI might not be as good as the one we got with Claude Opus 4, but the plus is that every feature works as expected, unlike Claude Opus 4.
I love the work on this question from this model.
Codeforces problems above (~1600) are mostly somewhat harder than LeetCode medium, so let's see how well these models perform on this question.
Claude is doing really well so far, and let's find out if it can tackle this one or not.
Here's the prompt that I used:
You would like to construct a string s, consisting of lowercase Latin letters, such that the following condition holds:
For every pair of indices i and j such that si=sj, the difference of these indices is even, that is, |i−j|mod2=0
Constructing any string is too easy, so you will be given an array c of 26 numbers — the required number of occurrences of each individual letter in the string s. So, for every i∈[1,26], the i-th letter of the Latin alphabet should occur exactly ci times.
Your task is to count the number of distinct strings s that satisfy all these conditions. Since the answer can be huge, output it modulo 998244353.
## Input
Each test consists of several test cases. The first line contains a single integer t (1≤t≤10^4)— the number of test cases. The description of test cases follows.
Each test case contains 26 integers ci (0≤ci≤5⋅10^5)— the elements of the array c
Additional constraints on the input data:
The sum of ci for every test case is positive;
The sum of ci over all test cases does not exceed 5â‹…105
## Output
For each test case, print one integer — the number of suitable strings s, taken modulo 998244353.
## Example
Input:
5
2 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0 0 0 0 1 0 3 0 0 0 0 0 0 0 0 0 0 0
1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 233527 233827
Output:
4
960
0
1
789493841
Note
In the first test case, there are 4 suitable strings: "abak", "akab", "baka" and "kaba".
Here's the code it generated: Link
No issues for Claude Opus 4 on this one as well. I'm shocked at how good this model is with coding. Almost all the coding problems I have given this model so far (including this test), it seems to always be able to solve most of them.
Here's the code it generated: Link
It literally failed on the very first test case for this problem, and you can clearly see the difference between the expected and the code output. This is really disappointing from this model.
For coding, especially, you can never go wrong choosing Claude Opus 4 over Grok 3. There can be times when Grok 3 gets a better response, but in my overall experience working with both of these models, Claude Opus 4 has always performed the best.
And the problem with Grok 3 in coding seems to be the same for other folks as well:
https://x.com/theo/status/1891736803796832298
Enough of coding tests, let's now test how good these two models are at reasoning and logical problems. I am going to test them on super tricky reasoning questions.
Prompt: Here's a chess position. black king on h8, white bishop on h7, white queen on f7, black knight on g6, and black rook on d7, how does white deliver a checkmate in one move?
The answer to this question is: Queen on g8
This is how it looks on board:
This is going to be a bit tricky for the AI models to calculate the chess position as it requires a lot of thinking outside the box, and any other models that are not exactly chess models seem to fail very often.
Let's see how these two compare here:
It failed this one miserably and answered the checkmate move as Queen on f8, which is wrong!
Anything as simple as finding a checkmate in one move is pretty hard to calculate for these LLMs, and here's a quick proof we got.
Same response from this model as well. It got the same answer, but it's completely wrong.
Prompt: A famous hotel has seven floors. Five people are living on the ground floor and each floor has three more people on it than the previous one. Which floor calls the elevator the most?
This is a trickier one. The trick here is to make the LLM perform the calculations for each room with the people living in it and make it guess the 6th floor, but the answer is the ground floor, as any person staying on a floor other than the ground floor calls the elevator more often.
Here's the response it generated:
As expected, the model fell into the trap of calculating the elevator calls for the floors based on the people living there and got the 6th floor, which is incorrect.
Here's the response it generated:
The same response as Claude Opus 4, but with a bit more reasoning on how it got the answer 23. However, this is again incorrect.
We didn't really get a solid winner for this section; both models failed similarly. However, Grok 3 still seems to be a bit better when it comes to reasoning compared to Claude Opus 4 in my overall usage.
It is clear that there are still some edge cases that LLMs can't filter out properly, resulting in completely incorrect answers. The fact that you can trick the LLMs to get to a specific answer just by adding a little twist in the prompt is what I feel justifies Artificial Intelligence.
Did we get a clear winner here? Yes, absolutely.
Claude Opus 4 is far better than Grok 3 when it comes to coding and that's really expected to me though. However, Grok 3 is quite strong when it comes to reasoning questions and even coding, as it is quite good there too, just falls little short when compared to Claude Opus 4.
💡 Pretty out of context, but I've found that Claude Opus 4 performs much worse than Grok 3 on anything other than coding and reasoning like writing and all. Could be a factor for some of you to filter.
What do you think and which one do you pick between these two models? Let me know in the comments!
Streamline your Engineering Team
Get started with a Free Trial or Book A Demo with the founderBuilding artificial
engineering intelligence.
Product
Home
Log In
Sign Up
Helpful Links
OSS Explore
PR Arena