fix(merge-tracker): prevent false-positive duplicates from fuzzy match and num collisions#538
Conversation
…h and num collisions Two related issues caused valid TSV additions to be silently dropped or to overwrite unrelated existing entries: 1. roleFuzzyMatch ignored seniority and specialty modifiers. "Junior Electrical QA Engineer" was treated as a duplicate of "Electrical Engineer" because they shared two long words. With score N/A < existing score, the new entry was silently skipped. Fix: gate fuzzy match on seniority (junior/senior/lead/principal/ staff/head/chief) and specialty (qa/qc/quality/test/assurance/safety/ security) modifier agreement before checking word overlap. Add a reference-ID short-circuit (#148182, job=148031, req. 58035, ref ABC-123-S) so entries with distinct portal IDs cannot be considered the same role regardless of title similarity. 2. exact-num match treated bare number collisions as duplicates. When a TSV's num matched an existing entry of a different company/role, the merge silently used that as duplicate target. With higher existing score, the new TSV was dropped. Fix: only treat exact num match as duplicate when company AND role also fuzzy-match. Bare num collisions fall through to the "new entry" path, which auto-assigns the next available number. Both fixes verified against the real cases that triggered them: - Junior Electrical QA Engineer not flagged as dup of Electrical Engineer - Same num + different company falls through to next-num path - Same num + same company + matching role still treated as legit re-eval
|
Welcome to career-ops, @Mutykov! Thanks for your first PR. A few things to know:
We'll review your PR soon. Join our Discord if you have questions. |
📝 WalkthroughWalkthroughRole deduplication logic in merge-tracker.mjs was enhanced with stricter matching rules. The Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@merge-tracker.mjs`:
- Around line 78-84: findModifier and SPECIALTY_MODIFIERS currently return the
first literal token (e.g., "qa" vs "quality") causing semantically identical
specialties to be treated as different; update SPECIALTY_MODIFIERS into a
synonyms map (or add a separate CANONICAL_SPECIALTY map) that maps variants like
"qa", "qc", "quality", "test", "assurance" -> a single canonical key (e.g.,
"quality_assurance"), then change findModifier to lowercase the text, search for
any synonym via the map, and return the canonical value (not the raw matched
token); update any downstream comparisons (where findModifier is used) to
compare canonical specialty keys.
- Around line 298-308: The exact-num duplicate check (the existingApps.find
callback that sets duplicate) needs a distinct-ref short-circuit: inside the
predicate (where it compares app.num, normalizeCompany(app.company) vs
normalizeCompany(addition.company), and roleFuzzyMatch(addition.role, app.role))
return false early if both records have different refs (e.g. addition.ref and
app.ref exist and addition.ref !== app.ref) so that rows with the same
num/company/role but different portal refs are not treated as duplicates; add
that check before calling roleFuzzyMatch to mirror the ref-aware branch further
down.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: c6e8c84b-ebfb-4cbf-849b-4f77505afb17
📒 Files selected for processing (1)
merge-tracker.mjs
| // Specialty modifiers — same logic for role specialty (QA / quality / test / safety / security). | ||
| const SPECIALTY_MODIFIERS = ['qa', 'qc', 'quality', 'test', 'assurance', 'safety', 'security']; | ||
|
|
||
| function findModifier(text, list) { | ||
| const low = text.toLowerCase(); | ||
| return list.find(m => new RegExp(`\\b${m}\\b`).test(low)); | ||
| } |
There was a problem hiding this comment.
Normalize specialty synonyms before comparing them.
findModifier() returns the first literal token, so equivalent titles like QA Engineer and Quality Assurance Engineer resolve to qa vs quality; Line 114 then rejects them as different roles. That will create duplicate tracker rows when the same posting is worded differently across sources.
Suggested fix
-const SPECIALTY_MODIFIERS = ['qa', 'qc', 'quality', 'test', 'assurance', 'safety', 'security'];
+const SPECIALTY_GROUPS = [
+ ['quality', ['qa', 'qc', 'quality', 'test', 'assurance']],
+ ['safety', ['safety']],
+ ['security', ['security']],
+];
function findModifier(text, list) {
const low = text.toLowerCase();
return list.find(m => new RegExp(`\\b${m}\\b`).test(low));
}
+
+function findModifierGroup(text, groups) {
+ const low = text.toLowerCase();
+ for (const [group, terms] of groups) {
+ if (terms.some(term => new RegExp(`\\b${term}\\b`).test(low))) return group;
+ }
+ return undefined;
+}
...
- const specA = findModifier(a, SPECIALTY_MODIFIERS);
- const specB = findModifier(b, SPECIALTY_MODIFIERS);
+ const specA = findModifierGroup(a, SPECIALTY_GROUPS);
+ const specB = findModifierGroup(b, SPECIALTY_GROUPS);Also applies to: 111-114
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@merge-tracker.mjs` around lines 78 - 84, findModifier and SPECIALTY_MODIFIERS
currently return the first literal token (e.g., "qa" vs "quality") causing
semantically identical specialties to be treated as different; update
SPECIALTY_MODIFIERS into a synonyms map (or add a separate CANONICAL_SPECIALTY
map) that maps variants like "qa", "qc", "quality", "test", "assurance" -> a
single canonical key (e.g., "quality_assurance"), then change findModifier to
lowercase the text, search for any synonym via the map, and return the canonical
value (not the raw matched token); update any downstream comparisons (where
findModifier is used) to compare canonical specialty keys.
| // Exact entry number match — only treated as a duplicate if company+role | ||
| // also fuzzy-match. A bare num collision (e.g. two unrelated TSVs both | ||
| // numbered 076) is NOT a duplicate; the new entry should get the next | ||
| // available number instead. (Bug fix 2026-04-29: prior logic would | ||
| // silently drop valid entries when an unrelated entry already used the | ||
| // same num.) | ||
| duplicate = existingApps.find(app => { | ||
| if (app.num !== addition.num) return false; | ||
| if (normalizeCompany(app.company) !== normalizeCompany(addition.company)) return false; | ||
| return roleFuzzyMatch(addition.role, app.role); | ||
| }); |
There was a problem hiding this comment.
Apply the distinct-ref short-circuit in the exact-num path too.
Line 307 still treats same num + same company + fuzzy role as a duplicate even when the two rows carry different portal IDs. A collision like req. 58035 vs req. 58036 will still skip or overwrite a valid addition before the ref-aware branch at Lines 312-324 runs.
Suggested fix
if (!duplicate) {
// Exact entry number match — only treated as a duplicate if company+role
// also fuzzy-match. A bare num collision (e.g. two unrelated TSVs both
// numbered 076) is NOT a duplicate; the new entry should get the next
// available number instead. (Bug fix 2026-04-29: prior logic would
// silently drop valid entries when an unrelated entry already used the
// same num.)
+ const newRefs = extractRefIds(`${addition.role} ${addition.notes || ''}`);
duplicate = existingApps.find(app => {
if (app.num !== addition.num) return false;
if (normalizeCompany(app.company) !== normalizeCompany(addition.company)) return false;
+ const oldRefs = extractRefIds(`${app.role} ${app.notes || ''}`);
+ if (newRefs.length > 0 && oldRefs.length > 0 && !newRefs.some(r => oldRefs.includes(r))) {
+ return false;
+ }
return roleFuzzyMatch(addition.role, app.role);
});
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@merge-tracker.mjs` around lines 298 - 308, The exact-num duplicate check (the
existingApps.find callback that sets duplicate) needs a distinct-ref
short-circuit: inside the predicate (where it compares app.num,
normalizeCompany(app.company) vs normalizeCompany(addition.company), and
roleFuzzyMatch(addition.role, app.role)) return false early if both records have
different refs (e.g. addition.ref and app.ref exist and addition.ref !==
app.ref) so that rows with the same num/company/role but different portal refs
are not treated as duplicates; add that check before calling roleFuzzyMatch to
mirror the ref-aware branch further down.
Summary
Two related issues in
merge-tracker.mjsthat cause valid TSV additions to be silently dropped or to overwrite unrelated existing entries.Bug 1 —
roleFuzzyMatchignored seniority and specialty modifiers"Junior Electrical QA Engineer"was treated as a duplicate of"Electrical Engineer"because they shared two long words (electrical,engineer). With scoreN/A < existing score, the new entry was silently skipped — so a real rejection (or any new entry on a sub-variant role at the same company) never made it into the tracker.Fix: gate fuzzy match on seniority and specialty modifier agreement before checking word overlap. Also add a reference-ID short-circuit (
#148182,job=148031,req. 58035,ref ABC-123-S) so entries with distinct portal IDs cannot be considered the same role regardless of title similarity.Bug 2 — exact-num match treated bare number collisions as duplicates
When a TSV's
numhappened to match an existing entry of a different company/role, the merge used that existing entry as a duplicate target. With higher existing score, the new TSV was dropped — silently rewriting the wrong row.Fix: only treat exact-num match as duplicate when company AND role also fuzzy-match. Bare num collisions fall through to the "new entry" path, which auto-assigns the next available number via the existing logic.
Why this matters
Both bugs failed silently: TSVs were marked as merged (moved to
merged/), the script reported success, but the tracker either lost the row or had the wrong row updated. Recovery required manual editing ofapplications.md— which the data contract discourages.Test plan
roleFuzzyMatchunit tests (7/7 passed):Junior Electrical QA Engineer≠Electrical Engineer – Düsseldorf(bug fix)Senior Electrical Engineer – Mons=Senior Electrical Engineer – Dublin(legit dup)Electrical Engineer – Frankfurt (3)=Electrical Engineer – Frankfurt (4)(legit numbered variants)Senior Electrical Engineer≠Electrical Engineer(seniority mismatch)Lead Commissioning Engineer≠Commissioning Engineer(seniority mismatch)Commissioning Manager=Commissioning Engineerwhen no seniority modifier on either side (allowed)Electrical QA Engineer≠Electrical Engineer(specialty mismatch)extractRefIdsrecognises four ID patterns (#NNN, job=NNN, req. NNN, ref ABC-...).🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes