Skip to content

fix(merge-tracker): prevent false-positive duplicates from fuzzy match and num collisions#538

Open
Mutykov wants to merge 1 commit intosantifer:mainfrom
Mutykov:fix/merge-tracker-fuzzy-and-num-collisions
Open

fix(merge-tracker): prevent false-positive duplicates from fuzzy match and num collisions#538
Mutykov wants to merge 1 commit intosantifer:mainfrom
Mutykov:fix/merge-tracker-fuzzy-and-num-collisions

Conversation

@Mutykov
Copy link
Copy Markdown

@Mutykov Mutykov commented Apr 29, 2026

Summary

Two related issues in merge-tracker.mjs that cause valid TSV additions to be silently dropped or to overwrite unrelated existing entries.

Bug 1 — roleFuzzyMatch ignored seniority and specialty modifiers

"Junior Electrical QA Engineer" was treated as a duplicate of "Electrical Engineer" because they shared two long words (electrical, engineer). With score N/A < existing score, the new entry was silently skipped — so a real rejection (or any new entry on a sub-variant role at the same company) never made it into the tracker.

Fix: gate fuzzy match on seniority and specialty modifier agreement before checking word overlap. Also add a reference-ID short-circuit (#148182, job=148031, req. 58035, ref ABC-123-S) so entries with distinct portal IDs cannot be considered the same role regardless of title similarity.

Bug 2 — exact-num match treated bare number collisions as duplicates

When a TSV's num happened to match an existing entry of a different company/role, the merge used that existing entry as a duplicate target. With higher existing score, the new TSV was dropped — silently rewriting the wrong row.

Fix: only treat exact-num match as duplicate when company AND role also fuzzy-match. Bare num collisions fall through to the "new entry" path, which auto-assigns the next available number via the existing logic.

Why this matters

Both bugs failed silently: TSVs were marked as merged (moved to merged/), the script reported success, but the tracker either lost the row or had the wrong row updated. Recovery required manual editing of applications.md — which the data contract discourages.

Test plan

  • roleFuzzyMatch unit tests (7/7 passed):
    • Junior Electrical QA EngineerElectrical Engineer – Düsseldorf (bug fix)
    • Senior Electrical Engineer – Mons = Senior Electrical Engineer – Dublin (legit dup)
    • Electrical Engineer – Frankfurt (3) = Electrical Engineer – Frankfurt (4) (legit numbered variants)
    • Senior Electrical EngineerElectrical Engineer (seniority mismatch)
    • Lead Commissioning EngineerCommissioning Engineer (seniority mismatch)
    • Commissioning Manager = Commissioning Engineer when no seniority modifier on either side (allowed)
    • Electrical QA EngineerElectrical Engineer (specialty mismatch)
  • Exact-num collision tests (3/3 passed):
    • Same num + different company → not duplicate (bug fix)
    • Same num + same company + different role → not duplicate (bug fix)
    • Same num + same company + matching role → still duplicate (legit re-eval)
  • extractRefIds recognises four ID patterns (#NNN, job=NNN, req. NNN, ref ABC-...).
  • End-to-end run on a real session with 14 new TSVs — no false-positive skips, no overwrites.

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Improved accuracy of duplicate role detection to prevent incorrectly merging unrelated entries.
    • Enhanced role comparison to better distinguish roles with different seniority levels and specialties (e.g., QA, security-related roles).
    • Added reference ID matching to ensure roles from different sources are treated as distinct entries.

…h and num collisions

Two related issues caused valid TSV additions to be silently dropped or to
overwrite unrelated existing entries:

1. roleFuzzyMatch ignored seniority and specialty modifiers.
   "Junior Electrical QA Engineer" was treated as a duplicate of
   "Electrical Engineer" because they shared two long words. With score
   N/A < existing score, the new entry was silently skipped.

   Fix: gate fuzzy match on seniority (junior/senior/lead/principal/
   staff/head/chief) and specialty (qa/qc/quality/test/assurance/safety/
   security) modifier agreement before checking word overlap. Add a
   reference-ID short-circuit (#148182, job=148031, req. 58035, ref
   ABC-123-S) so entries with distinct portal IDs cannot be considered
   the same role regardless of title similarity.

2. exact-num match treated bare number collisions as duplicates.
   When a TSV's num matched an existing entry of a different
   company/role, the merge silently used that as duplicate target. With
   higher existing score, the new TSV was dropped.

   Fix: only treat exact num match as duplicate when company AND role
   also fuzzy-match. Bare num collisions fall through to the "new
   entry" path, which auto-assigns the next available number.

Both fixes verified against the real cases that triggered them:
- Junior Electrical QA Engineer not flagged as dup of Electrical Engineer
- Same num + different company falls through to next-num path
- Same num + same company + matching role still treated as legit re-eval
@github-actions
Copy link
Copy Markdown
Contributor

Welcome to career-ops, @Mutykov! Thanks for your first PR.

A few things to know:

  • Tests will run automatically — check the status below
  • Make sure you've linked a related issue (required for features)
  • Read CONTRIBUTING.md if you haven't

We'll review your PR soon. Join our Discord if you have questions.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

📝 Walkthrough

Walkthrough

Role deduplication logic in merge-tracker.mjs was enhanced with stricter matching rules. The roleFuzzyMatch() function now checks for seniority and specialty modifier differences, and duplicate detection now requires normalized company equality, positive fuzzy match results, and overlapping reference IDs when both entries have them.

Changes

Cohort / File(s) Summary
Role Deduplication Enhancement
merge-tracker.mjs
Expanded roleFuzzyMatch() to return false on seniority/specialty modifier mismatches; refined duplicate detection to require company normalization match and positive fuzzy match alongside numeric collision; added portal/job reference ID extraction and comparison to treat entries as distinct when IDs don't overlap.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

🔧 scripts

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and accurately reflects the main change: preventing false-positive duplicates caused by fuzzy matching and numeric collisions in merge-tracker.mjs.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch fix/merge-tracker-fuzzy-and-num-collisions

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@merge-tracker.mjs`:
- Around line 78-84: findModifier and SPECIALTY_MODIFIERS currently return the
first literal token (e.g., "qa" vs "quality") causing semantically identical
specialties to be treated as different; update SPECIALTY_MODIFIERS into a
synonyms map (or add a separate CANONICAL_SPECIALTY map) that maps variants like
"qa", "qc", "quality", "test", "assurance" -> a single canonical key (e.g.,
"quality_assurance"), then change findModifier to lowercase the text, search for
any synonym via the map, and return the canonical value (not the raw matched
token); update any downstream comparisons (where findModifier is used) to
compare canonical specialty keys.
- Around line 298-308: The exact-num duplicate check (the existingApps.find
callback that sets duplicate) needs a distinct-ref short-circuit: inside the
predicate (where it compares app.num, normalizeCompany(app.company) vs
normalizeCompany(addition.company), and roleFuzzyMatch(addition.role, app.role))
return false early if both records have different refs (e.g. addition.ref and
app.ref exist and addition.ref !== app.ref) so that rows with the same
num/company/role but different portal refs are not treated as duplicates; add
that check before calling roleFuzzyMatch to mirror the ref-aware branch further
down.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: c6e8c84b-ebfb-4cbf-849b-4f77505afb17

📥 Commits

Reviewing files that changed from the base of the PR and between 62b767d and f124d99.

📒 Files selected for processing (1)
  • merge-tracker.mjs

Comment thread merge-tracker.mjs
Comment on lines +78 to +84
// Specialty modifiers — same logic for role specialty (QA / quality / test / safety / security).
const SPECIALTY_MODIFIERS = ['qa', 'qc', 'quality', 'test', 'assurance', 'safety', 'security'];

function findModifier(text, list) {
const low = text.toLowerCase();
return list.find(m => new RegExp(`\\b${m}\\b`).test(low));
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Normalize specialty synonyms before comparing them.

findModifier() returns the first literal token, so equivalent titles like QA Engineer and Quality Assurance Engineer resolve to qa vs quality; Line 114 then rejects them as different roles. That will create duplicate tracker rows when the same posting is worded differently across sources.

Suggested fix
-const SPECIALTY_MODIFIERS = ['qa', 'qc', 'quality', 'test', 'assurance', 'safety', 'security'];
+const SPECIALTY_GROUPS = [
+  ['quality', ['qa', 'qc', 'quality', 'test', 'assurance']],
+  ['safety', ['safety']],
+  ['security', ['security']],
+];

 function findModifier(text, list) {
   const low = text.toLowerCase();
   return list.find(m => new RegExp(`\\b${m}\\b`).test(low));
 }
+
+function findModifierGroup(text, groups) {
+  const low = text.toLowerCase();
+  for (const [group, terms] of groups) {
+    if (terms.some(term => new RegExp(`\\b${term}\\b`).test(low))) return group;
+  }
+  return undefined;
+}
...
-  const specA = findModifier(a, SPECIALTY_MODIFIERS);
-  const specB = findModifier(b, SPECIALTY_MODIFIERS);
+  const specA = findModifierGroup(a, SPECIALTY_GROUPS);
+  const specB = findModifierGroup(b, SPECIALTY_GROUPS);

Also applies to: 111-114

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@merge-tracker.mjs` around lines 78 - 84, findModifier and SPECIALTY_MODIFIERS
currently return the first literal token (e.g., "qa" vs "quality") causing
semantically identical specialties to be treated as different; update
SPECIALTY_MODIFIERS into a synonyms map (or add a separate CANONICAL_SPECIALTY
map) that maps variants like "qa", "qc", "quality", "test", "assurance" -> a
single canonical key (e.g., "quality_assurance"), then change findModifier to
lowercase the text, search for any synonym via the map, and return the canonical
value (not the raw matched token); update any downstream comparisons (where
findModifier is used) to compare canonical specialty keys.

Comment thread merge-tracker.mjs
Comment on lines +298 to +308
// Exact entry number match — only treated as a duplicate if company+role
// also fuzzy-match. A bare num collision (e.g. two unrelated TSVs both
// numbered 076) is NOT a duplicate; the new entry should get the next
// available number instead. (Bug fix 2026-04-29: prior logic would
// silently drop valid entries when an unrelated entry already used the
// same num.)
duplicate = existingApps.find(app => {
if (app.num !== addition.num) return false;
if (normalizeCompany(app.company) !== normalizeCompany(addition.company)) return false;
return roleFuzzyMatch(addition.role, app.role);
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Apply the distinct-ref short-circuit in the exact-num path too.

Line 307 still treats same num + same company + fuzzy role as a duplicate even when the two rows carry different portal IDs. A collision like req. 58035 vs req. 58036 will still skip or overwrite a valid addition before the ref-aware branch at Lines 312-324 runs.

Suggested fix
   if (!duplicate) {
     // Exact entry number match — only treated as a duplicate if company+role
     // also fuzzy-match. A bare num collision (e.g. two unrelated TSVs both
     // numbered 076) is NOT a duplicate; the new entry should get the next
     // available number instead. (Bug fix 2026-04-29: prior logic would
     // silently drop valid entries when an unrelated entry already used the
     // same num.)
+    const newRefs = extractRefIds(`${addition.role} ${addition.notes || ''}`);
     duplicate = existingApps.find(app => {
       if (app.num !== addition.num) return false;
       if (normalizeCompany(app.company) !== normalizeCompany(addition.company)) return false;
+      const oldRefs = extractRefIds(`${app.role} ${app.notes || ''}`);
+      if (newRefs.length > 0 && oldRefs.length > 0 && !newRefs.some(r => oldRefs.includes(r))) {
+        return false;
+      }
       return roleFuzzyMatch(addition.role, app.role);
     });
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@merge-tracker.mjs` around lines 298 - 308, The exact-num duplicate check (the
existingApps.find callback that sets duplicate) needs a distinct-ref
short-circuit: inside the predicate (where it compares app.num,
normalizeCompany(app.company) vs normalizeCompany(addition.company), and
roleFuzzyMatch(addition.role, app.role)) return false early if both records have
different refs (e.g. addition.ref and app.ref exist and addition.ref !==
app.ref) so that rows with the same num/company/role but different portal refs
are not treated as duplicates; add that check before calling roleFuzzyMatch to
mirror the ref-aware branch further down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant