Gemini CLI and the Art of Reviewing Large Codebases

AI Usage (87%)

Why large code reviews get slow

The hard part of reviewing a big repo is not reading code. It is deciding what to read first.

When a diff touches a few files, you can usually trace the path by hand. When a change crosses services, hooks, helpers, and tests, the review bottleneck becomes context switching. I keep running into the same questions:

Which file actually owns the behavior?
Is this helper shared anywhere else?
Did the refactor change logic, or just move it around?
What should I compare this against?

That is where Gemini CLI earns its keep. I would not treat it as the final answer, but it is good at shrinking the search space. It can summarize a directory, spot likely call chains, and help compare related files without opening twenty tabs.

What Gemini CLI is good at

The strongest use case is not “review my code for me.” It is “help me focus my review.”

Finding the right slice of code

On a large repo, the first task is often locating the seam where a bug lives. Gemini CLI can help by answering narrower questions than a full-codebase search:

where a function is called
which file owns a route or handler
whether a utility is local or shared
which tests exercise the same path

That matters because a bad review usually starts with the wrong boundary. If you only inspect the changed file, you miss the caller. If you only inspect the caller, you miss the shared helper.

Comparing related files without losing context

I also use it to compare files that should stay in sync. A common pattern is a production function and its test file, or a server handler and a validation schema.

A useful prompt is simple: ask what is different, then ask whether those differences matter. That keeps the tool from drifting into broad summaries.

Comparison target	What you want to learn	Why it matters
handler vs test	whether the test covers the real branch	false confidence from shallow tests
old helper vs new helper	whether logic moved or changed	refactors that alter behavior
API schema vs client payload	whether fields still match	broken requests after deployment

A practical review workflow

Start with a narrow question

I usually begin with one question, not a full review request.

For example:

“What changed in auth flow between these two files?”
“Which code path can still reach this function?”
“What tests cover this branch?”

That is a better shape than “review everything.” Broad prompts produce broad answers, and broad answers are easy to trust too early.

Feed the tool enough surrounding code

The mistake I see most often is starving the model of context. If you only paste the changed lines, the answer will sound confident and incomplete.

Give it:

the changed function
the caller or route entry point
the relevant test
one neighboring helper if behavior is shared

That small amount of extra code usually reveals whether the change is structural or semantic.

💪

If the review question depends on authorization, validation, or state transitions, include the code that enforces those checks. The bug is often one layer away from the diff.

Verify the output against the source

This part matters more than the model itself. Treat Gemini CLI as a fast assistant, then check the source code for every claim that would affect a merge decision.

I look for three things:

Does the tool point at the correct file and function?
Does the cited behavior actually exist in the source?
Did it miss a side effect, test gap, or fallback path?

If any answer is shaky, I go back to the repo and confirm manually.

Where it helps and where it fails

Good uses for refactoring and triage

Gemini CLI is most useful when the task is structural:

identifying duplicate logic
summarizing a large refactor
tracing which tests are stale
comparing old and new implementations
triaging a bug report into likely files

That is because these tasks depend on pattern recognition across multiple files. The tool is good at that, but it is not good at assigning business meaning unless you supply it.

Common mistakes when trusting summaries

The failure mode is not usually total hallucination. It is quieter than that.

Common mistakes include:

assuming a summary covered every branch
treating “looks equivalent” as proof
missing async behavior or error handling
accepting test coverage at face value
forgetting that shared utilities have more callers than the current diff

In security reviews, this is especially dangerous. A summary can say a check exists when the check only runs in one code path. That is how a missing backend authorization check gets masked by a decent-looking frontend change.

Testing the workflow on a real repo

Notes on prompts and repeatability

For repeatable reviews, I keep prompts short and consistent. A good pattern is:

state the file names
state the exact question
ask for evidence, not conclusions

Example:

Review these two files and tell me whether the new validation changes behavior or only moves code. Quote the relevant functions and note any untested branch.

That produces more reliable output than asking for a general opinion. It also makes it easier to compare answers across revisions.

Keeping review findings actionable

A review note is useful only if someone can act on it. I try to write findings in a way that points to a file, a behavior, and a fix.

Good review notes sound like this:

src/routes/upload.js accepts the file type from the client, but the server never rechecks it before storage.
parseUser() now returns null for malformed input, but createUser() still assumes an object.
The new test covers the happy path, but not the retry branch added in the refactor.

That style keeps the review grounded in code instead of vague risk language.

Conclusion

Gemini CLI is useful when a codebase is too large to hold in your head at once. It helps you find the right files, compare related paths, and narrow a review to the part that actually changed.

The important discipline is still human: ask narrow questions, provide enough surrounding code, and verify the output against the source. If you do that, the tool speeds up review without replacing judgment. If you do not, it just gives you a faster way to be wrong.