Skip to content

Feat/repo map#989

Closed
3kin0x wants to merge 2 commits intoGitlawb:mainfrom
3kin0x:feat/repo-map
Closed

Feat/repo map#989
3kin0x wants to merge 2 commits intoGitlawb:mainfrom
3kin0x:feat/repo-map

Conversation

@3kin0x
Copy link
Copy Markdown
Contributor

@3kin0x 3kin0x commented May 2, 2026

Summary

  • Rendered Cache Invalidation: Updated the hash function to include file metadata (mtime and size) in the key. This ensures the repository map is automatically rebuilt as soon as a source file is modified.
  • Windows Test Normalization: Added automatic line-ending conversion to LF in tests to prevent failures caused by Git's autocrlf settings on Windows.
  • Expanded Test Coverage: Added new tests for focus-file prioritization, nested directory structures, large repository batching (100+ files), and edge cases like binary or empty source files.
  • PR Refactor: This contribution directly addresses the blocking issues identified in PR feat: Codebase Intelligence — repo map with PageRank (queries bundled in dist) #966 based on the review of commit d469b76.
  • The rendered cache remained stale after file edits because the hash only tracked filenames and token budgets.
  • Windows contributors were experiencing test failures due to byte-for-byte comparison of .scm query files.

Impact

  • user-facing impact: Users now see an up-to-date repository map immediately after editing their code, with no manual action required.
  • developer/maintainer impact: CI and local tests are now portable and pass across all operating systems (Windows/Linux/macOS), resolving technical debt from commit d469b76.

Testing

  • bun run build
  • bun run smoke
  • focused tests:
    • src/context/repoMap/repoMap.test.ts (All tests passed)
    • src/context/repoMap/queries.test.ts (All tests passed)
    • Reviewer's stale-cache reproduction script (Verified: Cache correctly invalidates on edit)

Notes

@3kin0x
Copy link
Copy Markdown
Contributor Author

3kin0x commented May 2, 2026

Copy link
Copy Markdown
Collaborator

@Vasanthdev2004 Vasanthdev2004 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full review on current head 5a98f25.

Verdict: Needs changes

Blocking issue:

  1. The repo-map focused test suite fails on Windows because fallback directory walking preserves platform-native backslashes in rendered paths. In src/context/repoMap/repoMap.test.ts, the nested directory test expects src/utils/helper.ts, but on Windows the rendered map contains src\\utils\\helper.ts. This comes from the non-git fallback path in walkDirectory() using path.relative() without normalizing to POSIX-style repo paths. Since this PR explicitly calls out Windows test normalization/portability, the repo-map path representation should be normalized consistently before rendering/caching/focus matching.

Local verification:

  • bun test src/context/repoMap/repoMap.test.ts src/context/repoMap/queries.test.ts src/tools/RepoMapTool/RepoMapTool.test.ts src/commands/repomap/repomap.test.ts src/context.repoMap.test.ts fails with the nested-directory path assertion above.
  • bun run build passes.

Suggested fix: normalize relative repo file paths to / separators at the source, likely in walkDirectory() / getRepoFiles() or a shared repo-map path normalizer, so git-backed and fallback enumeration produce the same path format across platforms.

Happy to re-review once that is fixed.

@3kin0x 3kin0x force-pushed the feat/repo-map branch 2 times, most recently from 3fb7b2e to b749351 Compare May 3, 2026 18:18
gnanam1990 and others added 2 commits May 3, 2026 21:32
…tural summaries

Adds a new module that builds a structural map of the repository by parsing
source files with tree-sitter, building a cross-file reference graph weighted
by IDF, ranking files with PageRank, and rendering a token-budgeted summary
of the most important files and their signatures.

Surface:
- RepoMap tool the model can call on-demand, with focus_files / focus_symbols
- /repomap slash command with --tokens, --focus, --stats, --invalidate
- Auto-injection into session system context, gated by REPO_MAP=1 env var
  (compile-time feature('REPO_MAP') flag stays off in scripts/build.ts)

How it works:
  git ls-files → tree-sitter WASM parse → extract defs/refs →
  IDF-weighted directed graph → PageRank → render top files until token budget

Files imported by many others rank highest. Common symbol names (get, set,
map, value) are down-weighted via IDF. Results cached to disk keyed by
(path, mtime, size) — only changed files are re-parsed.

Supported languages: TypeScript, JavaScript, Python.

Tree-sitter tag queries are inlined as string constants in queries.ts so
they ship inside dist/cli.mjs and work after npm install — the .scm source
files are kept for readability/Aider attribution but are not required at
runtime. A drift-guard test (queries.test.ts) asserts byte-equality between
the inlined strings and the .scm source files.

Dependencies added: web-tree-sitter, tree-sitter-wasms, graphology,
graphology-pagerank, graphology-operators, js-tiktoken.
- Memory Management: Fix WASM memory leaks by explicitly deleting Tree-sitter Tree and Query objects.
- Performance: Eliminate redundant IO by consolidating file stating and reusing metadata across the build loop.
- Scalability: Refactor build loop to share a single Tree-sitter parser instance and implement an iterative directory walk to prevent stack overflow.
- Algorithm: Transition to true Personalized PageRank (PPR) for superior importance ranking of focus files and their context.
- Portability: Normalize repository paths to POSIX-style forward slashes consistently across all platforms.
- Reliability: Added comprehensive tests for edge cases (binary/empty files, nested paths, large repo batching).
@3kin0x 3kin0x requested a review from Vasanthdev2004 May 3, 2026 19:38
@3kin0x
Copy link
Copy Markdown
Contributor Author

3kin0x commented May 3, 2026

Hello @Vasanthdev2004

Just took into account your remarks :

Technical Update:
- Transitioned to true Personalized PageRank (PPR). Importance now flows naturally from focus files to their dependencies, providing a much higher signal-to-noise ratio in the repository map.
- Improve robustness
- Systematically fixed potential WASM memory leaks by explicitly deleting Tree-sitter Tree and Query objects.
- Refactored the build loop to reuse a single parser instance and implemented an iterative stack-based directory walk (Stack-safe for extremely deep repos).
- Consolidated file metadata collection. Every file in the repo is now stated exactly once per build operation, reaching the theoretical minimum IO overhead.
- Normalized all repository paths to POSIX-style forward slashes at the source.
- Expanded the test suite to cover 18+ scenarios (Binary files, empty files, nested paths, large-repo batching, and focus-file prioritization).

Copy link
Copy Markdown
Collaborator

@gnanam1990 gnanam1990 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked at the new head (commits through 5a98f25 and the most recent "technical perfection audit" pushes).

Major coordination issue

This PR ships the same 37-file set as the existing open PR #966 (feat: Codebase Intelligence — repo map with PageRank), authored separately. Both PRs add:

  • src/context/repoMap/{cache,gitFiles,graph,pagerank,parser,renderer,symbolExtractor,tokenize,types}.ts
  • src/context/repoMap/queries/{javascript,python,typescript}-tags.scm
  • src/commands/repomap/*
  • src/tools/RepoMapTool/*
  • docs/repo-map.md

We can't merge two parallel implementations of the same feature. Before this PR can move forward, we need a maintainer call on which implementation lands. @kevincodex1 — could you weigh in on the consolidation direction? Options:

  1. Land #966 first, close #989, and 3kin0x rebases differentiating improvements onto the merged result as follow-up PRs.
  2. Land #989, close #966, and the same in reverse.
  3. Co-author / merge the two branches into a single PR before review.

@3kin0x — if you weren't aware of #966, no harm done; this is purely a coordination thing. Please hold further pushes here until we resolve which PR is the canonical one.

Code-side review (independent of the duplicate-PR issue)

What's improved since Vasanthdev2004's review

The Windows-path issue Vasanthdev2004 flagged is addressed — walkDirectory() now applies replace(/\\/g, '/') after path.relative() (lines ~997, ~1017 of the diff), so rendered paths normalize to POSIX form on Windows. Good fix.

Still blocking on the code itself (regardless of which PR wins)

  1. 6 new top-level dependencies: graphology, graphology-operators, graphology-pagerank, js-tiktoken, tree-sitter-wasms, web-tree-sitter. The PR body talks about what they enable but doesn't address bundle-size / startup-time impact, Windows WASM-loading reliability, or feature-flag-gating the import. At minimum, all repo-map work should be lazy-imported (the diff does have // Lazy import to avoid loading tree-sitter at startup at line 388 — good — but verify nothing on the hot path triggers it).

  2. js-tiktoken for token budgeting in a provider-agnostic CLI is a misfit — it tokenizes for OpenAI/Anthropic encodings, not Gemini, Ollama, MiniMax, etc. The token budget will be wrong by 10-30% on those providers. Worth either dropping the precise budgeting (use a char-count heuristic) or scoping it to providers where the tokenizer matches.

  3. PR description language: "technical perfection audit", "true Personalized PageRank", "drastically reducing noise" — same hyperbolic pattern as #997. Please rewrite descriptions in concrete failing-path / fix / risk terms so reviewers can evaluate efficiently.

  4. Scope (+2666 / 37 files) is past the §4 reviewability threshold. Even if the feature is correct, splitting helps:

    • (a) repo-walk + symbol extraction (parser + tree-sitter wiring)
    • (b) graph + PageRank
    • (c) RepoMapTool surface + UI
    • (d) docs

Red flags that are clean

  • No tengu_*, no USER_TYPE === 'ant'
  • No new fetch / network call (graph + parsing is fully local — good)
  • No provider routing changes
  • No CI/workflow changes

Holding off on a formal verdict until we resolve the #966 vs #989 coordination question. Once that's decided, happy to do a full review on whichever survives.

Copy link
Copy Markdown
Collaborator

@Vasanthdev2004 Vasanthdev2004 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Targeted re-review on current head ee18ff75207c9077a797a512b8985b50b775c77b.

Verdict: Blocked on maintainer decision

What I rechecked:

  • The Windows path-normalization blocker I previously raised is fixed. gitFiles.ts now normalizes path.relative() / git file paths with replace(/\\/g, '/'), so repo-map output uses POSIX-style paths on Windows too.
  • Ran the focused repo-map suite locally on Windows:
    • bun test src/context/repoMap/repoMap.test.ts src/context/repoMap/queries.test.ts src/tools/RepoMapTool/RepoMapTool.test.ts src/commands/repomap/repomap.test.ts src/context.repoMap.test.ts
    • Result: 40/40 passed.

Remaining blocker:

  • This PR still overlaps almost completely with #966. I do not think we should approve or merge this until maintainers decide which repo-map PR is canonical, or the two branches are consolidated into one reviewable path.

So from my side: the earlier Windows-specific blocker is resolved, but merge-readiness is still blocked on the #966 vs #989 coordination decision.

@3kin0x
Copy link
Copy Markdown
Contributor Author

3kin0x commented May 4, 2026

Ok thank you I close it.

@3kin0x 3kin0x closed this May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants