Feat/repo map#989
Conversation
Vasanthdev2004
left a comment
There was a problem hiding this comment.
Full review on current head 5a98f25.
Verdict: Needs changes
Blocking issue:
- The repo-map focused test suite fails on Windows because fallback directory walking preserves platform-native backslashes in rendered paths. In
src/context/repoMap/repoMap.test.ts, the nested directory test expectssrc/utils/helper.ts, but on Windows the rendered map containssrc\\utils\\helper.ts. This comes from the non-git fallback path inwalkDirectory()usingpath.relative()without normalizing to POSIX-style repo paths. Since this PR explicitly calls out Windows test normalization/portability, the repo-map path representation should be normalized consistently before rendering/caching/focus matching.
Local verification:
bun test src/context/repoMap/repoMap.test.ts src/context/repoMap/queries.test.ts src/tools/RepoMapTool/RepoMapTool.test.ts src/commands/repomap/repomap.test.ts src/context.repoMap.test.tsfails with the nested-directory path assertion above.bun run buildpasses.
Suggested fix: normalize relative repo file paths to / separators at the source, likely in walkDirectory() / getRepoFiles() or a shared repo-map path normalizer, so git-backed and fallback enumeration produce the same path format across platforms.
Happy to re-review once that is fixed.
3fb7b2e to
b749351
Compare
…tural summaries
Adds a new module that builds a structural map of the repository by parsing
source files with tree-sitter, building a cross-file reference graph weighted
by IDF, ranking files with PageRank, and rendering a token-budgeted summary
of the most important files and their signatures.
Surface:
- RepoMap tool the model can call on-demand, with focus_files / focus_symbols
- /repomap slash command with --tokens, --focus, --stats, --invalidate
- Auto-injection into session system context, gated by REPO_MAP=1 env var
(compile-time feature('REPO_MAP') flag stays off in scripts/build.ts)
How it works:
git ls-files → tree-sitter WASM parse → extract defs/refs →
IDF-weighted directed graph → PageRank → render top files until token budget
Files imported by many others rank highest. Common symbol names (get, set,
map, value) are down-weighted via IDF. Results cached to disk keyed by
(path, mtime, size) — only changed files are re-parsed.
Supported languages: TypeScript, JavaScript, Python.
Tree-sitter tag queries are inlined as string constants in queries.ts so
they ship inside dist/cli.mjs and work after npm install — the .scm source
files are kept for readability/Aider attribution but are not required at
runtime. A drift-guard test (queries.test.ts) asserts byte-equality between
the inlined strings and the .scm source files.
Dependencies added: web-tree-sitter, tree-sitter-wasms, graphology,
graphology-pagerank, graphology-operators, js-tiktoken.
- Memory Management: Fix WASM memory leaks by explicitly deleting Tree-sitter Tree and Query objects. - Performance: Eliminate redundant IO by consolidating file stating and reusing metadata across the build loop. - Scalability: Refactor build loop to share a single Tree-sitter parser instance and implement an iterative directory walk to prevent stack overflow. - Algorithm: Transition to true Personalized PageRank (PPR) for superior importance ranking of focus files and their context. - Portability: Normalize repository paths to POSIX-style forward slashes consistently across all platforms. - Reliability: Added comprehensive tests for edge cases (binary/empty files, nested paths, large repo batching).
|
Hello @Vasanthdev2004 Just took into account your remarks : Technical Update: |
gnanam1990
left a comment
There was a problem hiding this comment.
Looked at the new head (commits through 5a98f25 and the most recent "technical perfection audit" pushes).
Major coordination issue
This PR ships the same 37-file set as the existing open PR #966 (feat: Codebase Intelligence — repo map with PageRank), authored separately. Both PRs add:
src/context/repoMap/{cache,gitFiles,graph,pagerank,parser,renderer,symbolExtractor,tokenize,types}.tssrc/context/repoMap/queries/{javascript,python,typescript}-tags.scmsrc/commands/repomap/*src/tools/RepoMapTool/*docs/repo-map.md
We can't merge two parallel implementations of the same feature. Before this PR can move forward, we need a maintainer call on which implementation lands. @kevincodex1 — could you weigh in on the consolidation direction? Options:
- Land #966 first, close #989, and 3kin0x rebases differentiating improvements onto the merged result as follow-up PRs.
- Land #989, close #966, and the same in reverse.
- Co-author / merge the two branches into a single PR before review.
@3kin0x — if you weren't aware of #966, no harm done; this is purely a coordination thing. Please hold further pushes here until we resolve which PR is the canonical one.
Code-side review (independent of the duplicate-PR issue)
What's improved since Vasanthdev2004's review
The Windows-path issue Vasanthdev2004 flagged is addressed — walkDirectory() now applies replace(/\\/g, '/') after path.relative() (lines ~997, ~1017 of the diff), so rendered paths normalize to POSIX form on Windows. Good fix.
Still blocking on the code itself (regardless of which PR wins)
-
6 new top-level dependencies:
graphology,graphology-operators,graphology-pagerank,js-tiktoken,tree-sitter-wasms,web-tree-sitter. The PR body talks about what they enable but doesn't address bundle-size / startup-time impact, Windows WASM-loading reliability, or feature-flag-gating the import. At minimum, all repo-map work should be lazy-imported (the diff does have// Lazy import to avoid loading tree-sitter at startupat line 388 — good — but verify nothing on the hot path triggers it). -
js-tiktokenfor token budgeting in a provider-agnostic CLI is a misfit — it tokenizes for OpenAI/Anthropic encodings, not Gemini, Ollama, MiniMax, etc. The token budget will be wrong by 10-30% on those providers. Worth either dropping the precise budgeting (use a char-count heuristic) or scoping it to providers where the tokenizer matches. -
PR description language: "technical perfection audit", "true Personalized PageRank", "drastically reducing noise" — same hyperbolic pattern as #997. Please rewrite descriptions in concrete failing-path / fix / risk terms so reviewers can evaluate efficiently.
-
Scope (+2666 / 37 files) is past the §4 reviewability threshold. Even if the feature is correct, splitting helps:
- (a) repo-walk + symbol extraction (parser + tree-sitter wiring)
- (b) graph + PageRank
- (c) RepoMapTool surface + UI
- (d) docs
Red flags that are clean ✅
- No
tengu_*, noUSER_TYPE === 'ant' - No new
fetch/ network call (graph + parsing is fully local — good) - No provider routing changes
- No CI/workflow changes
Holding off on a formal verdict until we resolve the #966 vs #989 coordination question. Once that's decided, happy to do a full review on whichever survives.
Vasanthdev2004
left a comment
There was a problem hiding this comment.
Targeted re-review on current head ee18ff75207c9077a797a512b8985b50b775c77b.
Verdict: Blocked on maintainer decision
What I rechecked:
- The Windows path-normalization blocker I previously raised is fixed.
gitFiles.tsnow normalizespath.relative()/ git file paths withreplace(/\\/g, '/'), so repo-map output uses POSIX-style paths on Windows too. - Ran the focused repo-map suite locally on Windows:
bun test src/context/repoMap/repoMap.test.ts src/context/repoMap/queries.test.ts src/tools/RepoMapTool/RepoMapTool.test.ts src/commands/repomap/repomap.test.ts src/context.repoMap.test.ts- Result: 40/40 passed.
Remaining blocker:
- This PR still overlaps almost completely with #966. I do not think we should approve or merge this until maintainers decide which repo-map PR is canonical, or the two branches are consolidated into one reviewable path.
So from my side: the earlier Windows-specific blocker is resolved, but merge-readiness is still blocked on the #966 vs #989 coordination decision.
|
Ok thank you I close it. |
Summary
Impact
Testing
Notes