hero

Claude Opus 4 vs Grok 3

May 26, 2025

8 min read

TL;DR

We are going to test two of the finest models in coding and reasoning recently launched, Grok 3 (Elon's AI, naming it the smartest AI on earth) and Claude Opus 4 (best AI model for coding), head-on and see how they compare.

But before starting out, what even is the Claude Opus 4 model?

Source: Anthropic

Source: Anthropic

Let me give you a quick brief. Claude Opus 4 just launched on May 22, 2025. It is said to be the best AI model for coding, with the capability of autonomous coding for hours. It has a 200K token context window and scores 72.5% in the SWE benchmark. Now that you know just enough about this model, let's see how it compares to Elon's claim of the smartest AI model on Earth, Grok 3.

Let's find out if Claude Opus 4 really is the best at coding, or if Grok 3 takes this title as well!

Stick around to see how they compare in coding (build and algorithm problems) and logic tests for both.

Coding Problems

1. Arkanoid Game

Prompt: Build a simple Arkanoid-style game with a paddle, ball, and bricks. There should be a ball bouncing around, breaking blocks at the top while you control a paddle at the bottom. The paddle needs to follow the arrow keys movements or WASD keys.

Output: Claude Opus 4

Here's the code it generated: Link

Frankly, this is more than what I expected with such a blunt prompt that I gave it. It built everything perfectly, from the arcade UI to the ball bouncing, physics, and all.

The code is a bit messy as it threw all the CSS, JS, and HTML into a single file, but it works, and that's what matters for the testing.

Output: Grok 3

Here's the code it generated: Link

This one seems to be good as well, except for the fact that the paddle does not work with the arrow or WASD keys and works solely with the mouse. For some reason, it didn't seem to follow the prompt and instead added mouse-based paddle control.

This seemed to be a small issue that it could fix easily upon iteration, so I followed the prompts again and could easily get the paddle working with the arrow keys.

But this is no big deal, both models performed pretty well on this one.

💬 Quick refresher: I don't know about you, but I used to play this game a lot when I was a kid, and it was then named DX Ball. :)

2. 3D Ping Pong Game

I see many devs testing AI models on this question so I decided to give a similar question a shot on these two models.

Prompt: Make a Tron-themed ping pong game with two players facing each other inside a glowing rectangular arena. Add particle trails, collision sparks, and realistic physics like angle-based bounces. Use neon colors, and smooth animations.

Output: Claude Opus 4

Here's the code it generated: Link

This was definitely a tougher one to implement, and this model also couldn't get it correct, even after a few follow-up prompts. The ball's motion seems to work, but the whole ping pong play is missing, and there's a lot to fix here.

Not the best you could expect from this model, but it's okay. At least it got something working.

Output: Grok 3

Here's the code it generated: Link

This one really came as a surprise, but Grok 3 did a great job with this question this time. The overall UI might not be as good as the one we got with Claude Opus 4, but the plus is that every feature works as expected, unlike Claude Opus 4.

I love the work on this question from this model.

3. Competitive Programming

Codeforces problems above (~1600) are mostly somewhat harder than LeetCode medium, so let's see how well these models perform on this question.

Claude is doing really well so far, and let's find out if it can tackle this one or not.

Here's the prompt that I used:

You would like to construct a string s, consisting of lowercase Latin letters, such that the following condition holds:

For every pair of indices i and j such that si=sj, the difference of these indices is even, that is, |i−j|mod2=0

Constructing any string is too easy, so you will be given an array c of 26 numbers — the required number of occurrences of each individual letter in the string s. So, for every i∈[1,26], the i-th letter of the Latin alphabet should occur exactly ci times.

Your task is to count the number of distinct strings s that satisfy all these conditions. Since the answer can be huge, output it modulo 998244353.

## Input

Each test consists of several test cases. The first line contains a single integer t (1≤t≤10^4)— the number of test cases. The description of test cases follows.

Each test case contains 26 integers ci (0≤ci≤5⋅10^5)— the elements of the array c

Additional constraints on the input data:

The sum of ci for every test case is positive;
The sum of ci over all test cases does not exceed 5â‹…105

## Output

For each test case, print one integer — the number of suitable strings s, taken modulo 998244353.

## Example

Input:

5
2 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0 0 0 0 1 0 3 0 0 0 0 0 0 0 0 0 0 0
1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 233527 233827

Output:

4
960
0
1
789493841

Note

In the first test case, there are 4 suitable strings: "abak", "akab", "baka" and "kaba".

Output: Claude Opus 4

Here's the code it generated: Link

No issues for Claude Opus 4 on this one as well. I'm shocked at how good this model is with coding. Almost all the coding problems I have given this model so far (including this test), it seems to always be able to solve most of them.

Claude Opus 4 success on competitive programming

Output: Grok 3

Here's the code it generated: Link

It literally failed on the very first test case for this problem, and you can clearly see the difference between the expected and the code output. This is really disappointing from this model.

Grok 3 failure on competitive programming

Summary: Coding

For coding, especially, you can never go wrong choosing Claude Opus 4 over Grok 3. There can be times when Grok 3 gets a better response, but in my overall experience working with both of these models, Claude Opus 4 has always performed the best.

And the problem with Grok 3 in coding seems to be the same for other folks as well:

https://x.com/theo/status/1891736803796832298


Reasoning Problems

Enough of coding tests, let's now test how good these two models are at reasoning and logical problems. I am going to test them on super tricky reasoning questions.

1. Find Checkmate in 1 Move

Prompt: Here's a chess position. black king on h8, white bishop on h7, white queen on f7, black knight on g6, and black rook on d7, how does white deliver a checkmate in one move?

The answer to this question is: Queen on g8

This is how it looks on board:

Chess position for checkmate in 1 move

This is going to be a bit tricky for the AI models to calculate the chess position as it requires a lot of thinking outside the box, and any other models that are not exactly chess models seem to fail very often.

Let's see how these two compare here:

Output: Claude Opus 4

It failed this one miserably and answered the checkmate move as Queen on f8, which is wrong!

Anything as simple as finding a checkmate in one move is pretty hard to calculate for these LLMs, and here's a quick proof we got.

Claude chess failure

Output: Grok 3

Same response from this model as well. It got the same answer, but it's completely wrong.

Grok 3 chess failure

2. Most Elevator Calls

Prompt: A famous hotel has seven floors. Five people are living on the ground floor and each floor has three more people on it than the previous one. Which floor calls the elevator the most?

This is a trickier one. The trick here is to make the LLM perform the calculations for each room with the people living in it and make it guess the 6th floor, but the answer is the ground floor, as any person staying on a floor other than the ground floor calls the elevator more often.

Output: Claude Opus 4

Here's the response it generated:

As expected, the model fell into the trap of calculating the elevator calls for the floors based on the people living there and got the 6th floor, which is incorrect.

Claude elevator reasoning failure

Output: Grok 3

Here's the response it generated:

Grok 3 elevator reasoning failure

The same response as Claude Opus 4, but with a bit more reasoning on how it got the answer 23. However, this is again incorrect.

Summary: Reasoning

We didn't really get a solid winner for this section; both models failed similarly. However, Grok 3 still seems to be a bit better when it comes to reasoning compared to Claude Opus 4 in my overall usage.

It is clear that there are still some edge cases that LLMs can't filter out properly, resulting in completely incorrect answers. The fact that you can trick the LLMs to get to a specific answer just by adding a little twist in the prompt is what I feel justifies Artificial Intelligence.


Conclusion

Did we get a clear winner here? Yes, absolutely.

Claude Opus 4 is far better than Grok 3 when it comes to coding and that's really expected to me though. However, Grok 3 is quite strong when it comes to reasoning questions and even coding, as it is quite good there too, just falls little short when compared to Claude Opus 4.

💡 Pretty out of context, but I've found that Claude Opus 4 performs much worse than Grok 3 on anything other than coding and reasoning like writing and all. Could be a factor for some of you to filter.

What do you think and which one do you pick between these two models? Let me know in the comments!

hero

Streamline your Engineering Team

Get started with a Free Trial or Book A Demo with the founder
footer
logo

Building artificial
engineering intelligence.

Product

Home

Log In

Sign Up

Helpful Links

OSS Explore

PR Arena

Resources

Blog

Changelog

Startups

Contact Us

Careers