What's the best AI code review tool? First impressions from trying 8 tools including Greptile, Copilot, CodeAnt, Ellipsis, Sourcery, CodeRabbit, CloudAEye and Kypso.

May 10, 2025

I’ve recently been building out our new educational platform at BlueDot. I wanted to see if AI could help review code changes - and if so, what was the best tool to use. Here I’ve documented my first impressions trying out all the AI code review tools I could find!

As of May 2025, I recommended:

Greptile for PR reviews (despite noise, it catches the most real issues)
CodeRabbit for PR summaries (excellent visualizations and overviews)

I think these tools:

can save engineer time by summarising code changes, catching basic bugs or reporting code smells
raise quite a few false positives, but are helpful overall
are likely better than human engineers at reviewing more repetitive or boring content, or knowing niche information (e.g. that a weird browser API works differently to how you might expect)
aren’t as good as human engineers at detecting more subtle bugs across files (none found the severe subtle bug in my test), as well as giving feedback on larger architectural/strategic decisions

This was a small test on just one example PR rather than comprehensive research. Different tools may perform very differently in your specific use case, and all of these tools are rapidly evolving. See the limitations section below for more.

Summary Table

SystemOverallUseful CatchesFalse AlarmsComment QualityReliabilityWaitGreptile⭐⭐⭐HighHighGood100%335sGitHub Copilot⭐⭐⭐LowLowGood100%30sCodeAnt⭐⭐ModerateModerateFair50%518sEllipsis⭐⭐LowModerateFair100%184sSourcery⭐⭐LowLowExcellent100%*243sCodeRabbit**⭐⭐ModerateHighExcellent75%285sCloudAEye❌N/AN/AN/A0%N/AKypso❌N/AN/AN/A0%N/A

*May 11th update: Sourcery initially had 50% reliability, but after posting this they fixed the bug causing this.

**August 9th update: CodeRabbit’s original score may have been negatively affected because it suppressed duplicate comments which had already been raised by other tools. We have updated the review to account for this. More details in Appendix B.

Methodology

To evaluate these AI code review tools, I tested them against a substantial pull request that added a new React application to an existing repository, and deployed it to Kubernetes with Pulumi.

The application is a markdown editor that allows users to edit and save blog posts and job listings, including the ability to upload image content to an S3 bucket.

Test Case Design

The implementation contained at least 3 bugs:

Severe bug: S3 region not specified in the SDK configuration, which would prevent file upload functionality from working
Moderate bug: The jobs API route incorrectly uses the key ‘blog’ instead of ‘job’ due to a copy/paste error
Minor bug: The app name wasn’t properly updated in the local development template files

Additionally, there were opportunities for improvement in areas such as simplifying file upload logic and implementing better error handling.

Testing Process

For each AI review tool, I performed two tests:

Requesting a review on an existing PR
Requesting a review on a newly created PR

I then tracked several metrics to evaluate each tool’s performance:

Detection capabilities:

Great catches: Severe bugs/vulnerabilities identified
Good catches: Real but non-severe issues identified
Meh flags: Technically correct but low-value nitpicks
False alarms: Completely incorrect assessments

Quality metrics:

Explanation clarity (1-5): How clearly were issues explained
Solution quality (1-5): Whether actionable fixes were provided

Operational metrics:

Reliability: Percentage of successful review completions
- I didn’t expect this to be a problem, but I started tracking this after I found several tools failed on the existing PR. This is a very small sample size (n = 2), so take it with a grain of salt!
Response time: Seconds between request and delivery

Limitations

This evaluation represents a snapshot in time (May 2025). These tools are rapidly evolving, and so their performance may change over time.
The subjective nature of some metrics means there’s room for interpretation, though I’ve tried to be consistent and fair.
Testing multiple tools on the same PR created interaction effects (like duplicate comment suppression) that could skew results.
These metrics also capture only some of the tool’s value - other things like security, standards compliance, customisation, ease of setup, pricing etc. might matter to you!

Key Takeaways

After evaluating these code review tools, several patterns emerged:

None caught the severe issue: None of the tools caught the S3 region issue, suggesting AI reviewers still struggle with subtle (but important!) problems.
Lack of architectural guidance: None of the tools offered better approaches to the overall problem, something an experienced human reviewer might have suggested or considered. (E.g. rather than building a custom editor, try ...)
Generally you’ll have to make some sensitivity/specificity trade-off: Tools with higher detection rates (like Greptile) tended to produce more false positives, while less noisy tools (like GitHub Copilot) missed more issues. Although there are differences between the tools: for example Greptile raised more real issues with the same number of false alarms as CodeRabbit.
Still valuable for sense-checking: Despite limitations, these tools can catch common mistakes and provide an additional layer of verification.

These tools will almost certainly continue to evolve, so the above takeaways won’t be the case forever. With all these AI tools it’s worth checking in every few months to see where they can speed up your workflow.

Shameless plug: Want to learn more about AI?

If you’re interested in where AI is headed, you’re in the right place! We’re an education non-profit that offers free online courses about potentially transformative technologies.

Our 2-hour Future of AI Course offers a good introduction to anyone looking to get ahead of advancements in AI.

Appendix A: Detailed Results

SystemGreat CatchesGood CatchesMeh FlagsFalse AlarmsExplanation (1-5)Solution Quality (1-5)ReliabilityResponse Time (seconds)GitHub Copilot001034100%30Ellipsis011123100%184CodeRabbit00264275%285CodeAnt04332450%518Sourcery01214450%243Greptile069633100%335CloudAEyeN/AN/AN/AN/AN/AN/A0%N/AKypsoN/AN/AN/AN/AN/AN/A0%N/A

The above are original scores and not updated for the Sourcery and CodeRabbit additions (see above).

Appendix B: Tool-Specific Observations

Greptile (⭐⭐⭐)

Strengths: Highest number of good catches (6), showing strong detection capability
Weaknesses: Too noisy with high false alarm rate (6) and many low-value nitpicks (9)
Tweaking configuration to reduce sensitivity might improve signal-to-noise ratio
Initial indexing time was high (929s) but subsequent reviews faster. I’ve taken the lower time in the table above as this will be more representative of reviews on an ongoing basis.

GitHub Copilot (⭐⭐⭐)

Strengths: Fastest response time (30s) and unobtrusive suggestions
Weaknesses: Low detection rate with no substantive issues caught
Quality of suppressed suggestions was good, but actual comments were minimal

CodeAnt AI (⭐⭐⭐)

Strengths: Good number of useful catches (4) and comfortable making larger suggestions
Weaknesses: Slow response time (518s), some irrelevant issues, and only 50% reliability
Created new repo tags without permission, which felt a bit annoying. If it had correctly tagged the PR with existing tags this would have been cool.
Not clear how to request a review on an existing PR - could not get this to work.

Ellipsis (⭐⭐)

Strengths: Caught the moderate bug (jobs API route issue)
Weaknesses: Low detection capability and poor explanation quality
Some valuable suggestions were available but suppressed

Sourcery (⭐⭐)

Strengths: Excellent explanation clarity (4/5) when it did make comments
Weaknesses: Didn’t find that many important problems. Also the first time it appeared to acknowledge the review request on an existing PR, but never submitted a review. [Update: this behaviour is now fixed as a result of this review!] The second time it provided a guide by default rather than a review [Update: this is also fixed/was user error. By default it makes reviews against the base branch, but my test was set up against a non-standard branch for testing].

CodeRabbit AI (⭐⭐)

Strengths: Excellent PR summary with good sequence diagrams and a fun poem
Weaknesses: Highest false alarm rate (6) with suggestions that could have broken the code. The good suggestions it did have were suppressed and buried in noise, making it ineffective as a review tool. Also seemed to break when asked to post comments on the existing PR, although it gave an overview (counting this as a half failure).

August 9th update: CodeRabbit informed us of a mistake we had made, noting that several comments had been suppressed as duplicates:

We [CodeRabbit] investigated deeper and found the analysis PR that led to incorrect conclusions and assumptions. TLDR - [...] CodeRabbit takes existing comments on the PR into its context and that prevents it from posting any duplicate comments. This is what happened:
CodeRabbit initially skipped the code review because by default, it does not review PRs that are set to merge with non-default branches.
In the meantime, other code review bots ran their analysis and posted feedback.
After their feedback was posted, you manually ran the CodeRabbit review using chat.
CodeRabbit triggered the review and considered the existing context on the PR, which also included comments made by other tools. CodeRabbit is designed to suppress any duplicate feedback.
CodeRabbit posted most of its findings as “Duplicate comments”. In addition, the lower quality of existing comments would have also influenced the quality of CodeRabbit reviews. For instance, our prompts ask the LLM to consider existing comments on PR as hard facts.
You can see that your methodology was quite flawed, as these tools behave differently when they are in the same arena as other tools and will influence each other. We believe that other tools that you evaluated might have been impacted for similar reasons.

As a result we re-reviewed its performance, accounting for the initially suppressed duplicate comments. It had caught more issues than we had initially believed. It did however also produced more false alarms than we initially thought. We increased our star rating and ‘number of useful catches’ and ‘false alarms’ in the summary above.

CodeRabbit then followed-up, saying that our updates partially addressed their concerns and requesting we retract the article:

We ask that you retract the current analysis and, if you choose to proceed, redo it under conditions where tools cannot influence each other’s results.
[... later email]
The limited testing doesn’t cover the nuances of a real-world code review - CodeRabbit is the only fully agentic system capable of identifying issues through code exploration using shell scripts
[...]
Additionally, we request that you or the Bluedot team disclose any potential conflicts of interest, including relationships or sponsorships with the products reviewed. The article gives the impression that Greptile may have sponsored the entire exercise; if that is the case, it should be clearly disclosed. If there were no sponsorship and your evaluations still indicate that Greptile performs better, then this exercise serves as yet another example of how evaluating agentic AI products can be difficult and often detached from their practical utility.
[... later email]
As written, it reads like a “first-impressions” blog post but is framed as comparative research. Those are very different standards.
I’ve spent time in academia at Penn, and when we published something that turned out to be incorrect, the norm was to retract quickly and follow with a transparent post-mortem explaining what went wrong. If academic integrity is the bar you want to meet here, the right next step is to acknowledge how nuanced it is to evaluate agentic AI systems and retract the report. Evals optimized for simple, single-prompt setups don’t translate to longer-horizon, multi-step agent behavior—especially when tools interact (e.g., duplicate-suppression, order effects, shared context).
Concretely, two paths:
Position it as “first impressions.” Reframe the article title and intro accordingly, remove rankings and research claims, and add clear disclaimers about scope and limitations. [...]
Treat it as research. Retract the current piece and publish a short follow-up explaining why agentic evals are hard (isolation vs. interaction, order/temporal effects, reproducibility, environment control, longer-horizon tasks). [...]

As a result, we added a section explaining that this exercise was not sponsored, and tweaked the title (before: ‘What’s the best AI code review tool? An independent evaluation of 8 tools including Greptile, Copilot, CodeAnt, Ellipsis, Sourcery, CodeRabbit, CloudAEye and Kypso.’, after: ‘What’s the best AI code review tool? First impressions from trying 8 tools including Greptile, Copilot, CodeAnt, Ellipsis, Sourcery, CodeRabbit, CloudAEye and Kypso.’) and introduction sentence slightly (before: ‘I’ve recently been building out our new educational platform at BlueDot. I wanted to see if AI could help review code changes - and if so, what was the best tool to use. Here I’ve documented my results trying out all the AI code review tools I could find so you don’t have to!’, after: ‘I’ve recently been building out our new educational platform at BlueDot. I wanted to see if AI could help review code changes - and if so, what was the best tool to use. Here I’ve documented my first impressions trying out all the AI code review tools I could find!’) and added a disclaimer in the introduction section (’This is a small test on just one example PR rather than comprehensive research. Different tools may perform very differently in your specific use case, and all of these tools are rapidly evolving. See the limitations section below for more.’).

CloudAEye (❌)

Failure point: Connected to repository in admin portal but could not generate any PR reviews
Despite appearing to be properly set up, the tool did not function as expected. The docs in the admin portal were also broken.

Kypso (❌)

Failure point: Portal was broken and prevented account creation
Unable to even begin testing due to platform technical issues.

Appendix C: Independence and bias

This review was unsponsored and conducted in good faith using consistent methodology across all tools. We had no prior relationships with any tool providers except GitHub, who provided free repository hosting through their standard non-profits program (which they gave us years ago, unconnected to this review, and we do not think affected our judgement here).

We did not have any expectations about which tools might be better/worse going into things.

A guest post by

Adam Jones

https://adamjones.me/

BlueDot Impact

Discussion about this post

Ready for more?