feat: Codebase Intelligence — repo map with PageRank (queries bundled in dist)#966
feat: Codebase Intelligence — repo map with PageRank (queries bundled in dist)#966gnanam1990 wants to merge 1 commit intomainfrom
Conversation
…tural summaries
Adds a new module that builds a structural map of the repository by parsing
source files with tree-sitter, building a cross-file reference graph weighted
by IDF, ranking files with PageRank, and rendering a token-budgeted summary
of the most important files and their signatures.
Surface:
- RepoMap tool the model can call on-demand, with focus_files / focus_symbols
- /repomap slash command with --tokens, --focus, --stats, --invalidate
- Auto-injection into session system context, gated by REPO_MAP=1 env var
(compile-time feature('REPO_MAP') flag stays off in scripts/build.ts)
How it works:
git ls-files → tree-sitter WASM parse → extract defs/refs →
IDF-weighted directed graph → PageRank → render top files until token budget
Files imported by many others rank highest. Common symbol names (get, set,
map, value) are down-weighted via IDF. Results cached to disk keyed by
(path, mtime, size) — only changed files are re-parsed.
Supported languages: TypeScript, JavaScript, Python.
Tree-sitter tag queries are inlined as string constants in queries.ts so
they ship inside dist/cli.mjs and work after npm install — the .scm source
files are kept for readability/Aider attribution but are not required at
runtime. A drift-guard test (queries.test.ts) asserts byte-equality between
the inlined strings and the .scm source files.
Dependencies added: web-tree-sitter, tree-sitter-wasms, graphology,
graphology-pagerank, graphology-operators, js-tiktoken.
Vasanthdev2004
left a comment
There was a problem hiding this comment.
Thanks for reopening this cleanly after #543. The packaging fix direction is good ? inlining the .scm queries into queries.ts does address the npm tarball/runtime asset issue I previously blocked on. I did a current-head review at d469b76 and found two blockers before this should merge.
Verdict: Needs changes
Blocking issues:
- The rendered repo-map cache can return stale maps after file edits.
computeMapHash()only includes the file list, token budget, and focus files, andbuildRepoMap()checks the rendered__rendered__${mapHash}entry before validating per-file mtimes/sizes. That means if a source file changes but the file list stays the same,/repomapcan return the previous rendered map forever until manual--invalidate.
Minimal repro on current head:
bun --eval "import { mkdtempSync, writeFileSync, rmSync } from 'fs'; import { tmpdir } from 'os'; import { join } from 'path'; import { buildRepoMap, invalidateCache } from './src/context/repoMap/index.ts'; const root = mkdtempSync(join(tmpdir(), 'repomap-stale-')); try { writeFileSync(join(root, 'main.ts'), 'export function oldName(): void {}\n'); invalidateCache(root); const first = await buildRepoMap({ root, maxTokens: 1024 }); writeFileSync(join(root, 'main.ts'), 'export function newName(): void {}\n'); const second = await buildRepoMap({ root, maxTokens: 1024 }); console.log(JSON.stringify({ firstCacheHit: first.cacheHit, secondCacheHit: second.cacheHit, secondHasOld: second.map.includes('oldName'), secondHasNew: second.map.includes('newName') }, null, 2)); } finally { invalidateCache(root); rmSync(root, { recursive: true, force: true }); }"Current output:
{
"firstCacheHit": false,
"secondCacheHit": true,
"secondHasOld": true,
"secondHasNew": false
}The rendered-cache key needs to include a source fingerprint/metadata fingerprint, or the rendered cache should be validated after per-file cache checks rather than before them. Please add a regression test that edits a file and confirms the second map reflects the new symbol without requiring manual invalidation.
src/context/repoMap/queries.test.tsfails on Windows because the byte-for-byte drift guard is line-ending sensitive. On my Windows checkout, the.scmfiles are read with CRLF while the inlined constants are LF, so all three language drift checks fail even though the visible content is the same.
Local result:
bun test src/context/repoMap/queries.test.ts
# 1 pass / 3 failPlease normalize line endings in the test before comparison, or enforce LF for the .scm query files via .gitattributes. Since OpenClaude has active Windows users, the drift guard should pass on Windows checkouts too.
What I checked:
- Current head
d469b76 parser.ts/queries.ts/queries.test.tspackaging fixbuildRepoMap()cache path and rendered-cache keying/repomapcommand andRepoMapToolsurfacesbun test src/context.repoMap.test.tspassed 4/4 isolatedbun test src/context/repoMap/queries.test.tsfailed 3/4 on Windows as described
Happy to re-review once those two are fixed. The overall feature shape still looks useful; these are correctness/test-portability issues rather than objections to the direction.
…est line endings - Update computeMapHash to include file mtime and size in the hash key. This ensures that editing a file invalidates the rendered repo-map cache even if the file list remains the same. - Normalize line endings (\r\n -> \n) in queries.test.ts before comparison to ensure drift guards pass on Windows checkouts. Addresses reviewer blockers for PR Gitlawb#966.
|
Hello, just helped you there : #989 Best regards, |
…est line endings - Update computeMapHash to include file mtime and size in the hash key. This ensures that editing a file invalidates the rendered repo-map cache even if the file list remains the same. - Normalize line endings (\r\n -> \n) in queries.test.ts before comparison to ensure drift guards pass on Windows checkouts. Addresses reviewer blockers for PR Gitlawb#966.
Summary
Re-opens the repo-map feature from #543 with the npm-package shipping fix that surfaced in @Vasanthdev2004's last review.
What changed vs #543
The blocker on #543 was that
src/context/repoMap/parser.tsread tree-sitter tag queries viareadFileSync('./queries/*-tags.scm')at runtime, butpackage.json'sfilesallowlist only shipsbin/,dist/cli.mjs, andREADME.md.npm pack --dry-runconfirmed the.scmfiles were missing from the tarball, so symbol extraction would silently return empty results afternpm install -g @gitlawb/openclaude.Fix: the queries are now inlined as string constants in
src/context/repoMap/queries.tsandloadQuery()reads from those constants instead of the filesystem. The.scmfiles remain in the repo as the canonical source-of-truth (preserving the Aider MIT attribution and keeping them readable as standalone tree-sitter queries), and a drift-guard test (queries.test.ts) asserts byte-for-byte equality between the inlined strings and the.scmsource files. If anyone edits a.scmand forgets to mirror the change, that test fails.Verified the queries now ship inside the bundle:
No
.scmfiles are required at runtime.readFileSync/existsSyncimports and thegetQueryPath()helper are removed fromparser.ts.Why a new PR instead of pushing to #543
#543's branch carried a stale-merge concern in an earlier review and the iteration history was getting hard to follow. Cleaner to land this as a fresh branch off current
mainwith a single squashed commit. Closing #543 in favor of this once it's reviewed.Surface (unchanged from #543)
src/context/repoMap/(13 files incl. queries.ts)queries/*.scm(canonical) +queries.ts(inlined)__fixtures__/mini-repo/(5 files)src/tools/RepoMapTool/(4 files)src/tools.tssrc/commands/repomap/(3 files)/repomap,--tokens,--focus,--stats,--invalidatesrc/context.tsgetRepoMapContext()memoized; gated byfeature('REPO_MAP')ORprocess.env.REPO_MAPtruthyscripts/build.tsREPO_MAP: false— compile-time off; users opt in withREPO_MAP=1 openclaudedocs/repo-map.md,README.mdHow it works
Files imported by many others rank highest. Common symbol names (
get,set,map,value) are down-weighted via IDF. Results are cached to disk keyed by(path, mtime, size)— only changed files are re-parsed.Configuration
Dependencies added
web-tree-sitter,tree-sitter-wasms,graphology,graphology-pagerank,graphology-operators,js-tiktoken(~80MB in node_modules; onlydist/cli.mjsships).Test plan
bun install— cleanbun test src/context/repoMap src/tools/RepoMapTool src/commands/repomap src/context.repoMap.test.ts— 36 pass / 0 failbun test src/context/repoMap/queries.test.ts— 4 pass (drift guard verifies inlined strings match.scmfiles byte-for-byte)bun test(full suite) — 1749 pass / 2 fail; the 2 failures (detectProvider — modelOverride from --model flag) reproduce onmainand are unrelated to this PRbun run build— successnpm pack --dry-run— confirmed onlydist/cli.mjsships, no.scmfiles neededgrep -c 'function_signature\|class_definition\|generator_function' dist/cli.mjsreturns 4/repomap,--tokens,--focus,--stats,--invalidateSupported languages
TypeScript, JavaScript, Python. Additional grammars in a follow-up.
Known limitations
feature('REPO_MAP')defaults to off; users opt in viaREPO_MAP=1.Closes
Supersedes #543 (will close that one once this is reviewed).