GSoC 2026 – Interest for Performance Monitoring and Optimization Dashboard #20009

Veivel · 2026-02-23T13:33:17Z

Veivel
Feb 23, 2026

Hi! I'm Givarrel Veivel, a software engineer from Indonesia. I'll introduce myself concisely:

You can call me Veivel (he/him)
Software Engineer @ an AI startup; currently working with React & Typescript
Formerly Data Engineer @ a fintech company, worked with data viz & data ingestion.
Graduated 5 months ago in CS @ University of Indonesia, with a thesis in performance profiling of GPU programming models.
Interested in contributing to Gemini CLI since it's an exciting tool that I personally use myself, and the project '5 - Performance Monitoring and Optimization Dashboard' involves work that suits my background & piques my interest the most (profiling, instrumentation, data viz)

I'm very new to open source & GSoC, so I'm honestly unsure of the big picture flow & next steps. I have been tinkering locally, though: from what I understand (correct me if im wrong), a good chunk of the heavy lifting of instrumentation has been done with OpenTelemetry (packages/core/src/telemetry/), while the token/session statistics with the /stats command have already been implemented. My task will tie together the existing instrumentation with the CLI rendering for /perf.

My questions at the moment:

Is my understanding of the task above correct?
Are there any next steps, or pointers you can give me to continue exploring?
For the UI rendered by /perf, I'm thinking of a layout like Glances or htop, but is there any particular expected UI?

It goes without saying, but I'm looking forward to your feedback and input. Thank you very much for your time and attention, excited to learn :)
cc @bdmorgan @sehoon38

SUNDRAM07 · 2026-02-25T17:19:36Z

SUNDRAM07
Feb 25, 2026

Hi @Veivel! Great to see another contributor interested in Idea 5 — your GPU profiling thesis background is really relevant here. I've also been exploring this area and wanted to share my findings to help move the discussion forward.

About Me

I'm Sundaram Mahajan (@SUNDRAM07), actively contributing to Gemini CLI:

PR #19935 — Fixed callCommand parsing crash (ENOENT) by applying shell-quote.parse() consistently with discoveryCommand
PR #20004 — Fixed all shell commands failing with "signal 1" on WSL2 by trapping SIGHUP in spawned processes
PR #20185 — Fixed STREAM_JSON auth errors producing no structured RESULT event (self-discovered bug)

Codebase Analysis — What Already Exists

You're right that a good chunk of instrumentation is already done. Here's what I found after a deep dive:
Existing infrastructure we should extend:

UiTelemetryService — aggregates API latency, token counts (input/output/cached/thoughts/tool), and per-tool call metrics with success/fail stats
MemoryMonitor (450+ lines!) — memory snapshots, high-water marks, threshold checks, rate-limited recording, component-level tracking
StartupProfiler — phased startup timing with CPU usage (user + system)
/stats command already has session, model, tools sub-commands with dedicated Ink components (StatsDisplay, ModelStatsDisplay, ToolStatsDisplay)
DevTools package (packages/devtools/) with WebSocket log ingestion and SSE push to a React UI
Key gaps where GSoC work would add value:

No unified aggregation — data is scattered across services with no single "performance summary"
No latency percentile analysis (P50/P90/P99 per model)
No token efficiency metrics (cache hit rate, context window utilization)
No optimization suggestions based on actual usage patterns
No startup waterfall visualization (data exists in StartupProfiler but isn't surfaced)
No export capability (JSON/Markdown for CI pipelines)
No cost estimation

Prototype Work

I've built a working prototype locally that adds a perf sub-command to the existing /stats:
PerformanceCollector service (in packages/core/src/telemetry/):

API latency percentile tracking (P50/P90/P99 per model)
Token efficiency computation (cache hit rate, output efficiency, context utilization)
Memory trend tracking with sparkline-ready data arrays
Startup phase breakdown with percentage of total time
Optimization suggestion engine (warns about high context usage, slow tools, memory pressure)
27 unit tests, all passing ✅
PerfDisplay.tsx (Ink/React component for the terminal):
Session time breakdown with bar charts (API waiting vs tool execution)
Token efficiency metrics with color-coded thresholds
Per-model latency percentiles table
Memory sparkline trend visualization
Startup waterfall with per-phase bars
Auto-generated optimization suggestions
Happy to share the fork/branch if anyone wants to look at the code.

Questions for Mentors

cc @bdmorgan @sehoon38

Command structure: I extended /stats with a perf sub-command (consistent with session/model/tools). Would you prefer a separate /perf top-level command instead?
Memory monitoring scope: Should the dashboard surface what MemoryMonitor already collects, or add new collection points (e.g., per-MCP-server memory)?
Export format: For CI integration, I'm thinking JSON for machine parsing + Markdown for human-readable reports. Any preferences?
@Veivel — I agree the UI could be inspired by Glances/htop. The existing /stats components use Ink's Box/Text primitives with the project's theme system, so any UI for this should follow that same pattern.
Looking forward to the discussion! 🚀

0 replies

aishop-lab · 2026-03-01T20:55:47Z

aishop-lab
Mar 1, 2026

Hi! I'm Manan, a student interested in GSoC Project #5 (Performance Monitoring and Optimization Dashboard).

I've been exploring the Gemini CLI codebase and the existing telemetry infrastructure is substantial — StartupProfiler tracks 6 startup phases with CPU usage, MemoryMonitor takes snapshots every 10 seconds with high-water marks, and UiTelemetryService aggregates per-tool and per-model metrics. The metric constants REGRESSION_DETECTION and BASELINE_COMPARISON in metrics.ts even suggest regression detection was planned.

My approach focuses on building a PerformanceAggregator — a thin unification layer that consumes events from the existing services and surfaces them through a /perf command with startup waterfall visualization, P95 latency tracking, and CI-ready export.

I've already submitted contributions to the repo:

PR fix(test): use for...of instead of for...in for array iteration #20757: fix for...in → for...of array iteration bug
PR fix(cli): enable skipped cleanupExpiredSessions error handling test #20758: enable skipped cleanupExpiredSessions error handling test
PR fix(sandbox): pass proxy env vars to seatbelt sandbox process #20788: fix sandboxEnv proxy vars not passed to seatbelt sandbox process (MacOS sandbox ignores proxy configuration #19187)

One question for the mentors: would you prefer the dashboard as a /stats perf subcommand (extending the existing /stats) or as a standalone /perf command?

Looking forward to feedback!

0 replies

aishop-lab · 2026-03-02T00:47:01Z

aishop-lab
Mar 2, 2026

Hi Sehoon,

I'm Manan, a final-year student at BITS Pilani. I'm really interested in GSoC Project #5 (Performance Monitoring and Optimization Dashboard). As someone who loves digging into performance bottlenecks and making things measurable, this project immediately stood out to me.

I've been exploring the Gemini CLI codebase and I can see there's already a solid telemetry foundation in place — which makes me think the real challenge here is surfacing that data in a way that's actually useful to developers and CI pipelines. That's the part I'm most excited about.

I've submitted 6 PRs to the repo (#20757, #20758, #20788, #20789, #20793, #20794) to get comfortable with the codebase and contribution process. Would love to hear what you think the highest-priority aspect of this project is — the developer-facing dashboard, the CI regression detection, or something else? That would really help me focus my proposal.

Looking forward to your thoughts!

Manan

0 replies

aishop-lab · 2026-03-02T22:19:22Z

aishop-lab
Mar 2, 2026

Quick update — I've pushed a working prototype to my fork on the prototype/perf-aggregator branch. It includes:

PerformanceAggregator: P50/P95/P99 percentiles with rolling window buffers for startup, memory, tool execution, and model latency
BaselineManager: configurable regression detection with severity levels, designed for CI integration
SuggestionEngine: 5 rules that flag common issues (slow startup, high memory, slow tools) with actionable suggestions
ReportFormatter: text/markdown/JSON export

94 tests passing across 4 test suites. Built it to extend the existing StartupProfiler and MemoryMonitor rather than replace them.

@sehoon38 would love your thoughts on whether this aligns with what you had in mind for the project — happy to adjust the approach based on feedback.

0 replies

anthonychen000 · 2026-03-04T22:17:07Z

anthonychen000
Mar 4, 2026

Hi everyone! Just to give a quick introduction about me, I'm a student at UMich currently working with Qualcomm through a university program (MDP), and I'll be interning at Oracle this summer. I've recently been taking a deeper dive into open-source, and it has been incredibly fulfilling so far.

It's awesome to see how much thought @Veivel, @SUNDRAM07, and @aishop-lab are putting into this dashboard and the different approaches being explored.

I’ve been tackling this from a slightly different architectural angle. To avoid building custom aggregators or manual rolling windows, I built a PoC that safely intercepts the CLI's existing OTel pipeline. I attached an InMemoryMetricExporter directly to the existing NodeSDK in sdk.ts.

I plan to introduce a standalone /perf command that dispatches a dedicated HistoryItemPerfDashboard type. This will ensure the heavy telemetry payload remains completely isolated from the standard chat history, keeping the existing /stats logic lightweight and fast.

I've opened a Draft PR showing this backend plumbing.

My next step is to build out that standalone command and wire the PerfSnapshot into the Ink UI. I’d love to hear everyone's thoughts on leveraging the native OTel pipeline!

cc: @sehoon38

0 replies

SUNDRAM07 · 2026-03-05T14:25:23Z

SUNDRAM07
Mar 5, 2026

Great progress from everyone here! Quick update from my side, and some observations that might be useful for all of us.

My Update

I've opened a Draft PR #21262 with the backend plumbing for:

PerformanceCollector — P50/P90/P99 latency percentiles per model, v8 heap utilization (using v8.getHeapStatistics().heap_size_limit), token efficiency, auto-generated optimization suggestions
CostEstimator — per-model token cost tracking with real Gemini pricing (2.0 Flash, 2.5 Flash/Pro), cache savings computation, cheapest-model recommendations
PerformanceExporter — JSON export (CI-ready) and Markdown export (human-readable), with configurable section filtering

42 tests across 3 test suites, core build clean.

Some observations that might help everyone

After spending time in the telemetry codebase, a few things stood out:

v8 heap limit vs raw memory — process.memoryUsage() gives you heap used/total, but the real crash risk comes from comparing against v8.getHeapStatistics().heap_size_limit. Worth adding to any memory monitoring — it lets you warn before an out-of-memory crash.
Cost visibility is a gap — Looking at the Expected Outcomes list, nobody has addressed cost estimation yet. Users running long sessions on 2.5 Pro ($10/M output tokens) have zero cost visibility. This could be a differentiator for any of our proposals.
The /stats vs /perf question — I went with extending /stats since the existing subcommand pattern (session, model, tools) is well-established at this point. But I can see the argument for a standalone /perf — the perf dashboard payload is heavier than typical stats. Worth hearing what the mentors think here.

@anthonychen000 — The OTel-based approach with InMemoryMetricExporter is architecturally clean. One thing to watch: the existing NodeSDK setup in sdk.ts is configured for push-based export (to OTLP endpoints), so intercepting the pipeline locally needs careful handling of the reader's collectAndExport timing. Happy to brainstorm on that.

@aishop-lab — Your BaselineManager for regression detection is a great idea. The REGRESSION_DETECTION and BASELINE_COMPARISON metric constants in metrics.ts suggest the maintainers were planning this. Have you looked at how baseline data would persist between CLI sessions? That serialization/storage piece could be interesting.

Looking forward to seeing how everyone's approaches evolve! 🚀

0 replies

aishop-lab · 2026-03-05T20:00:31Z

aishop-lab
Mar 5, 2026

Good question on baseline persistence — yes, the BaselineManager already handles that. It saves snapshots as JSON files to a configurable path (fs.writeFile with pretty-printed JSON) and loads them back with a type guard (isPerformanceSnapshot()) that validates the full shape before accepting the data. The design intent is that these files get committed to the repo, so CI pipelines can compare every PR against the last-known-good snapshot using compare().

The serialization itself is straightforward once you have the type guard — the more interesting design question is when to update the baseline. My thinking is it should only happen on merge to main, not on every PR, to avoid baseline drift from feature branches.

On the v8 heap limit point — agreed that heap_size_limit is the right crash-risk signal. The PerformanceAggregator in my prototype already uses v8.getHeapStatistics().heap_size_limit for the MemoryStatus.utilization ratio. Issue #20550 (4GB+ heap exhaustion with no warning) was the motivating case there.

Cost estimation is an interesting addition — that's a real gap for users running long sessions on 2.5 Pro.

0 replies

anthonychen000 · 2026-03-05T22:30:26Z

anthonychen000
Mar 5, 2026

Hi all, great to see all these different approaches evolving!

@SUNDRAM07: Good catch on the OTel time mismatch! I handled it in my initial PR by calling localMetricReader.forceFlush() just before grabbing the snapshot.

As for implementing custom PerformanceCollector and PerformanceAggregator prototypes: While extending /stats is great for UI consistency, it tightly couples telemetry to SessionContext, which changes on every token stream. Injecting percentile calculations or windowed math here may introduce latency concerns.

I wanted to prioritize keeping the core chat lightweight over maintaining this UI consistency. I just pushed an update (feat(cli): add /perf command router...) which directly ties the new /perf command to the local OTel buffer to separate these metric calculations from the global React state.

Really enjoying both of your ideas about cost estimation and baselines. This is definitely something I will take a deeper look into and excited to brainstorm with you guys on.

0 replies

SUNDRAM07 · 2026-03-06T06:14:27Z

SUNDRAM07
Mar 6, 2026

This thread is turning into a proper design review and I'm here for it 😄

On the timestamp fix — @anthonychen000 good catch with forceFlush(), that solves the immediate problem cleanly. I was about to go down the rabbit hole of adjusting exportIntervalMillis, which would've been... less elegant. The core issue is that PeriodicExportingMetricReader batches on its own schedule, so snapshotting without flushing first is like checking your bank balance from yesterday's cache. Lesson learned!

On /stats vs /perf — I've been going back and forth on this. Here's my thinking: the existing /stats subcommand pattern (session, model, tools) sets a clear precedent, and users already know where to look. BUT — Anthony makes a fair point that the perf payload is heavier. What if we keep the entry point as /stats perf for discoverability, but internally decouple the telemetry pipeline? Best of both worlds: familiar UX, clean architecture.

On V8 heap monitoring — @aishop-lab your finding about MemoryStatus.utilization being misleading is spot on. I ran into the same thing — process.memoryUsage().heapUsed / heapTotal looks healthy at 60% while v8.getHeapStatistics().heap_size_limit shows you're 2GB from OOM. The key is that heapTotal grows dynamically, so the ratio stays deceptively low. Comparing against heap_size_limit (the V8 ceiling) is the right call. Added this to our prototype in PR #21262.

On baseline persistence — Where does baseline data live? If ~/.gemini/, it persists but isn't portable. If <project>/.gemini/, it's project-specific and could be committed to git for team-wide baseline tracking. Imagine catching "this PR made API latency 40% worse" in CI — that's the killer feature.

On cost estimation — Per-request granularity with session-level rollups. Per-request lets the agent optimize mid-session ("switch to Flash for this query"), session-level gives the "$0.47 today" dashboard. Both layers, negligible overhead.

Really enjoying seeing everyone's approaches converge 🚀

0 replies

aniruddhaadak80 · 2026-03-09T18:20:06Z

aniruddhaadak80
Mar 9, 2026

From my point of view, extending the existing telemetry foundation is the right instinct, but the most valuable first version of this project is probably a small and trustworthy summary rather than a very dense dashboard. Startup timing, model latency percentiles, memory trend, and a short optimization summary already feel like a strong first cut. If those signals are reliable, the richer htop style presentation can follow naturally.

0 replies

SUNDRAM07 · 2026-03-10T06:33:51Z

SUNDRAM07
Mar 10, 2026

@aishop-lab Good to know the BaselineManager already handles serialization with type guards — that's the right approach. Updating only on merge-to-main is smart, avoids the "every feature branch resets the bar" problem. One thing worth considering: what happens when someone runs gemini for the first time with no baseline file? A cold-start heuristic (like using the first N requests as a warm-up baseline) could prevent false regression alerts on fresh setups.

@aniruddhaadak80 Agreed — trustworthy and minimal beats flashy and noisy every time. Our prototype in PR #21262 follows exactly that philosophy: startup phases, P50/P90/P99 latency, V8 heap utilization against the actual limit, and a short list of auto-generated optimization suggestions. No fancy UI, just reliable numbers. If the data is solid, the visualization is the easy part.

One thing I'd add to the "V1 shortlist": cost visibility. Users running 2.5 Pro sessions have zero insight into token spend right now. Even a simple "this session used ~$0.35" in the summary would be a huge quality-of-life win — and it's cheap to compute (just multiply token counts by the known price-per-million).

0 replies

aishop-lab · 2026-03-19T07:50:36Z

aishop-lab
Mar 19, 2026

Update — I've wired the prototype into the real telemetry pipeline.

The prototype/perf-aggregator branch now has 103 tests (94 unit + 9 integration). The integration tests use the actual UiTelemetryService, StartupProfiler marks, and ToolCallEvent types to prove end-to-end data flow:

performanceAggregator.recordToolLatency() hooks into logToolCall() in loggers.ts, giving per-call latency for P95/P99 computation (since UiTelemetryService only stores aggregate sums)
performanceAggregator.captureStartup() runs before startupProfiler.flush() to preserve phase timing data that would otherwise be cleared
Snapshots created from live getMetrics() output, not mock data

Also found and fixed a cache hit rate bug while wiring: the original formula used cached / input where input = prompt - cached, which could exceed 100%. Changed to cached / prompt.

The BaselineManager, SuggestionEngine, and ReportFormatter remain unchanged — they consume snapshots regardless of how the data was collected, so they'll work with any dashboard implementation that lands.

0 replies

GSoC 2026 – Interest for Performance Monitoring and Optimization Dashboard #20009

Uh oh!

Uh oh!

Replies: 12 comments

Uh oh!

About Me

Codebase Analysis — What Already Exists

Prototype Work

Questions for Mentors

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

My Update

Some observations that might help everyone

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!