
Gemini CLI and the Art of Reviewing Large Codebases
Why large code reviews get slow
The hard part of reviewing a big repo is not reading code. It is deciding what to read first.
When a diff touches a few files, you can usually trace the path by hand. When a change crosses services, hooks, helpers, and tests, the review bottleneck becomes context switching. I keep running into the same questions:
- Which file actually owns the behavior?
- Is this helper shared anywhere else?
- Did the refactor change logic, or just move it around?
- What should I compare this against?
That is where Gemini CLI earns its keep. I would not treat it as the final answer, but it is good at shrinking the search space. It can summarize a directory, spot likely call chains, and help compare related files without opening twenty tabs.
What Gemini CLI is good at
The strongest use case is not “review my code for me.” It is “help me focus my review.”
Finding the right slice of code
On a large repo, the first task is often locating the seam where a bug lives. Gemini CLI can help by answering narrower questions than a full-codebase search:
- where a function is called
- which file owns a route or handler
- whether a utility is local or shared
- which tests exercise the same path
That matters because a bad review usually starts with the wrong boundary. If you only inspect the changed file, you miss the caller. If you only inspect the caller, you miss the shared helper.
Comparing related files without losing context
I also use it to compare files that should stay in sync. A common pattern is a production function and its test file, or a server handler and a validation schema.
A useful prompt is simple: ask what is different, then ask whether those differences matter. That keeps the tool from drifting into broad summaries.
| Comparison target | What you want to learn | Why it matters |
|---|---|---|
| handler vs test | whether the test covers the real branch | false confidence from shallow tests |
| old helper vs new helper | whether logic moved or changed | refactors that alter behavior |
| API schema vs client payload | whether fields still match | broken requests after deployment |
A practical review workflow
Start with a narrow question
I usually begin with one question, not a full review request.
For example:
- “What changed in auth flow between these two files?”
- “Which code path can still reach this function?”
- “What tests cover this branch?”
That is a better shape than “review everything.” Broad prompts produce broad answers, and broad answers are easy to trust too early.
Feed the tool enough surrounding code
The mistake I see most often is starving the model of context. If you only paste the changed lines, the answer will sound confident and incomplete.
Give it:
- the changed function
- the caller or route entry point
- the relevant test
- one neighboring helper if behavior is shared
That small amount of extra code usually reveals whether the change is structural or semantic.
If the review question depends on authorization, validation, or state transitions, include the code that enforces those checks. The bug is often one layer away from the diff.
Verify the output against the source
This part matters more than the model itself. Treat Gemini CLI as a fast assistant, then check the source code for every claim that would affect a merge decision.
I look for three things:
- Does the tool point at the correct file and function?
- Does the cited behavior actually exist in the source?
- Did it miss a side effect, test gap, or fallback path?
If any answer is shaky, I go back to the repo and confirm manually.
Where it helps and where it fails
Good uses for refactoring and triage
Gemini CLI is most useful when the task is structural:
- identifying duplicate logic
- summarizing a large refactor
- tracing which tests are stale
- comparing old and new implementations
- triaging a bug report into likely files
That is because these tasks depend on pattern recognition across multiple files. The tool is good at that, but it is not good at assigning business meaning unless you supply it.
Common mistakes when trusting summaries
The failure mode is not usually total hallucination. It is quieter than that.
Common mistakes include:
- assuming a summary covered every branch
- treating “looks equivalent” as proof
- missing async behavior or error handling
- accepting test coverage at face value
- forgetting that shared utilities have more callers than the current diff
In security reviews, this is especially dangerous. A summary can say a check exists when the check only runs in one code path. That is how a missing backend authorization check gets masked by a decent-looking frontend change.
Testing the workflow on a real repo
Notes on prompts and repeatability
For repeatable reviews, I keep prompts short and consistent. A good pattern is:
- state the file names
- state the exact question
- ask for evidence, not conclusions
Example:
Review these two files and tell me whether the new validation changes behavior or only moves code. Quote the relevant functions and note any untested branch.
That produces more reliable output than asking for a general opinion. It also makes it easier to compare answers across revisions.
Keeping review findings actionable
A review note is useful only if someone can act on it. I try to write findings in a way that points to a file, a behavior, and a fix.
Good review notes sound like this:
src/routes/upload.jsaccepts the file type from the client, but the server never rechecks it before storage.parseUser()now returnsnullfor malformed input, butcreateUser()still assumes an object.- The new test covers the happy path, but not the retry branch added in the refactor.
That style keeps the review grounded in code instead of vague risk language.
Conclusion
Gemini CLI is useful when a codebase is too large to hold in your head at once. It helps you find the right files, compare related paths, and narrow a review to the part that actually changed.
The important discipline is still human: ask narrow questions, provide enough surrounding code, and verify the output against the source. If you do that, the tool speeds up review without replacing judgment. If you do not, it just gives you a faster way to be wrong.


