Skip to content

feat(scan): optional location_filter in portals.yml + persist location to scan-history#570

Open
mstine wants to merge 1 commit intosantifer:mainfrom
mstine:feat/location-filter
Open

feat(scan): optional location_filter in portals.yml + persist location to scan-history#570
mstine wants to merge 1 commit intosantifer:mainfrom
mstine:feat/location-filter

Conversation

@mstine
Copy link
Copy Markdown

@mstine mstine commented May 4, 2026

Closes #569.

Summary

Adds opt-in location filtering to scan.mjs. Backwards compatible — if the new location_filter block is absent from portals.yml, behavior is identical to before.

Changes

  • scan.mjs (~25 lines net):
    • New buildLocationFilter() helper alongside buildTitleFilter()
    • Applied in the filter chain after title, before dedup
    • New Filtered by location counter in the summary output
    • Location persisted as 7th column in scan-history.tsv (appended — existing 6-column files still parse fine since loadSeenUrls only reads column 0)
  • templates/portals.example.yml: documented commented-out example block above title_filter
  • modes/scan.md: brief addition to the filter section (step 6b)

Filter semantics

  • Empty job location string → pass (don't penalize missing data)
  • Any block keyword present → reject (precedence over allow)
  • allow empty → pass once block is cleared
  • allow non-empty → must match at least one
  • All matches case-insensitive substring

Verification

Local re-scan against 13 enabled companies on a real portals.yml with the new filter configured for US-only + ATL/NYC metros:

Metric Before After
Total jobs found 3,144 3,144
Filtered by title 2,730 2,730
Filtered by location 308
New offers added 273 99

npm run doctor passes. Manually verified scan-history.tsv rows now include location column for audit.

Test plan

  • npm run doctor passes
  • Real scan with location_filter configured filters foreign roles correctly
  • Scan with no location_filter block produces identical behavior to pre-change (backwards compat)
  • scan-history.tsv 7th column populates with the job's location string
  • Maintainer review of API design choices (substring vs regex, allow/block precedence)

Non-goals

  • No regex support — substring-only keeps the code surface tiny
  • No per-company location overrides — companies stay enabled/disabled at the company level
  • Doesn't change pipeline.md line format — avoids breaking downstream parsers

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added optional location-based filtering to the scan workflow with case-insensitive matching and configurable allow/block rules.
    • Location information is now recorded in scan history for audit purposes.
    • Enhanced console output to separately report location-filtered job removals.
  • Documentation

    • Added comprehensive configuration documentation and examples for the new location filtering feature.

…n to scan-history

Adds opt-in location filtering to scan.mjs. Backwards compatible — absent
location_filter block means "allow all" (current behavior).

Semantics:
  - Empty job location → pass (don't penalize missing data)
  - block keyword present → reject (precedence over allow)
  - allow empty → pass (cleared block)
  - allow non-empty → must match at least one keyword
  - All matches case-insensitive substring

Also persists location as 7th column in scan-history.tsv. Appended at the
end so existing 6-column files continue to parse (loadSeenUrls only reads
column 0).

- scan.mjs: buildLocationFilter() + filter step + counter + tsv column
- templates/portals.example.yml: documented commented-out example block
- modes/scan.md: brief addition to the existing filter section (step 6b)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

📝 Walkthrough

Walkthrough

A new optional location filtering stage is added to the job scanner. Jobs are filtered by case-insensitive substring matching against configured allow/block lists, with block taking precedence. Job location is now persisted as a 7th column in the scan history for audit purposes. All changes are backward-compatible.

Changes

Location Filtering Feature

Layer / File(s) Summary
Configuration Schema & Documentation
templates/portals.example.yml, modes/scan.md
New location_filter block documented with allow/block semantics (case-insensitive substring matching, block precedence, empty allow passes, non-empty allow requires match). Example configuration provided and marked as optional/commented out.
Filtering Predicate
scan.mjs (lines 137–158)
buildLocationFilter() creates a predicate that gates jobs by job.location: missing/empty values pass; block matches reject with precedence; empty allow passes; non-empty allow requires at least one substring match (case-insensitive).
Scan Loop Integration
scan.mjs (lines 313–334)
Main scan loop extended to apply both title and location filter predicates; new counters totalFilteredTitle and totalFilteredLocation track removals separately; jobs failing either filter increment the appropriate counter and skip further processing.
Filter Instantiation
scan.mjs (line 292)
locationFilter predicate is built from config.location_filter once at startup in main().
History Persistence
scan.mjs (lines 243–253)
appendToScanHistory() updated to write 7-column header including location when file is created; each appended row now includes o.location as the final column. Backward-compatible: existing 6-column files continue to parse since loadSeenUrls reads only column 0.
Reporting
scan.mjs (lines 368–369)
Console summary output extended to report title-filtered and location-filtered removal counts separately (replacing previous combined title filter report).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: adding optional location filtering capability and persisting location data to scan history.
Linked Issues check ✅ Passed The code changes fully implement the requirements from issue #569: optional location_filter with proper semantics, persistence to scan-history.tsv, and documentation.
Out of Scope Changes check ✅ Passed All changes directly support the location filtering feature. No unrelated modifications to other functionality or scope creep detected.
Docstring Coverage ✅ Passed Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modes/scan.md`:
- Around line 112-119: The Scan History examples and Step 8 still show the old
6-column schema; update all example rows and Step 8 to include the new 7th
column (location) and ensure the description mentions that location is persisted
as the 7th column in scan-history.tsv; search for references to location_filter,
portals.yml, scan-history.tsv, "Scan History" and Step 8 in this file and modify
the sample TSV rows and any explanatory text to use seven tab-separated columns
(with the new location column as column 7) and note that matching is
case-insensitive substring per the location_filter rules.

In `@scan.mjs`:
- Around line 146-156: The buildLocationFilter function currently calls .map()
on locationFilter.allow and locationFilter.block assuming arrays; guard and
normalize these fields first so non-array or string values don't crash the scan:
check Array.isArray(locationFilter.allow) and
Array.isArray(locationFilter.block) (or coerce strings into single-element
arrays) before mapping, default to [] when missing or invalid, then proceed to
create the lower-cased allow and block lists and return the existing predicate.
Update the logic around the allow/block creation in buildLocationFilter to
perform this normalization.
- Around line 242-248: The current appendToScanHistory only creates a new file
with the 7-column header but does not upgrade existing legacy files with a
6-column header; update appendToScanHistory to detect and migrate an old header:
when SCAN_HISTORY_PATH exists, read the first line (utf-8), split on '\t' and if
it lacks the 'location' column or has fewer than 7 columns, replace the header
line with 'url\tfirst_seen\tportal\ttitle\tcompany\tstatus\tlocation\n' and
rewrite the file preserving all subsequent lines; keep the existing behavior of
creating the file when missing and ensure migration is a one-time header-only
change so no rows are shifted or data lost.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 35d3210d-f4f4-4d3f-8754-48e9880ec451

📥 Commits

Reviewing files that changed from the base of the PR and between 13b4fef and a74d6af.

📒 Files selected for processing (3)
  • modes/scan.md
  • scan.mjs
  • templates/portals.example.yml

Comment thread modes/scan.md
Comment on lines +112 to +119
6b. **Filtrar por ubicación (opcional)** usando `location_filter` de `portals.yml`:
- Si el bloque `location_filter` está ausente, todas las ubicaciones pasan (comportamiento por defecto)
- Ubicación vacía en una oferta → pasa (no penalizar datos faltantes)
- Cualquier keyword de `block` presente → rechazar (precedencia sobre allow)
- `allow` vacío → pasa (ya superó block)
- `allow` no vacío → debe coincidir al menos una keyword
- Todas las coincidencias son case-insensitive substring
- La ubicación se persiste como 7ª columna en `scan-history.tsv` para auditoría posterior
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update the later scan-history.tsv examples to 7 columns as well.

This section introduces location as column 7, but Step 8 and the Scan History sample below still show 6-field rows. Leaving both schemas in one doc will cause manual scan workflows to keep writing the old format.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modes/scan.md` around lines 112 - 119, The Scan History examples and Step 8
still show the old 6-column schema; update all example rows and Step 8 to
include the new 7th column (location) and ensure the description mentions that
location is persisted as the 7th column in scan-history.tsv; search for
references to location_filter, portals.yml, scan-history.tsv, "Scan History" and
Step 8 in this file and modify the sample TSV rows and any explanatory text to
use seven tab-separated columns (with the new location column as column 7) and
note that matching is case-insensitive substring per the location_filter rules.

Comment thread scan.mjs
Comment on lines +146 to +156
function buildLocationFilter(locationFilter) {
if (!locationFilter) return () => true;
const allow = (locationFilter.allow || []).map(k => k.toLowerCase());
const block = (locationFilter.block || []).map(k => k.toLowerCase());

return (location) => {
if (!location) return true;
const lower = location.toLowerCase();
if (block.length > 0 && block.some(k => lower.includes(k))) return false;
if (allow.length === 0) return true;
return allow.some(k => lower.includes(k));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard location_filter.allow and block before calling .map().

Lines 148-149 assume both YAML values are arrays. A common config typo like allow: "Remote" will throw here and abort the entire scan before any company is processed. Normalize or coerce these fields first so the optional filter degrades gracefully instead of crashing.

🛠️ Suggested hardening
 function buildLocationFilter(locationFilter) {
   if (!locationFilter) return () => true;
-  const allow = (locationFilter.allow || []).map(k => k.toLowerCase());
-  const block = (locationFilter.block || []).map(k => k.toLowerCase());
+  const toKeywords = (value) =>
+    (Array.isArray(value) ? value : typeof value === 'string' ? [value] : [])
+      .filter(v => typeof v === 'string' && v.trim() !== '')
+      .map(v => v.toLowerCase());
+  const allow = toKeywords(locationFilter.allow);
+  const block = toKeywords(locationFilter.block);
 
   return (location) => {
-    if (!location) return true;
-    const lower = location.toLowerCase();
+    const lower = String(location ?? '').trim().toLowerCase();
+    if (!lower) return true;
     if (block.length > 0 && block.some(k => lower.includes(k))) return false;
     if (allow.length === 0) return true;
     return allow.some(k => lower.includes(k));
   };
 }
As per coding guidelines, "Ensure scripts handle missing data/ directories gracefully."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scan.mjs` around lines 146 - 156, The buildLocationFilter function currently
calls .map() on locationFilter.allow and locationFilter.block assuming arrays;
guard and normalize these fields first so non-array or string values don't crash
the scan: check Array.isArray(locationFilter.allow) and
Array.isArray(locationFilter.block) (or coerce strings into single-element
arrays) before mapping, default to [] when missing or invalid, then proceed to
create the lower-cased allow and block lists and return the existing predicate.
Update the logic around the allow/block creation in buildLocationFilter to
perform this normalization.

Comment thread scan.mjs
Comment on lines 242 to 248
function appendToScanHistory(offers, date) {
// Ensure file + header exist
// Ensure file + header exist. Location appended as 7th column for non-breaking
// backward compat — older scan-history.tsv files with 6 columns still parse fine
// since loadSeenUrls only reads column 0.
if (!existsSync(SCAN_HISTORY_PATH)) {
writeFileSync(SCAN_HISTORY_PATH, 'url\tfirst_seen\tportal\ttitle\tcompany\tstatus\n', 'utf-8');
writeFileSync(SCAN_HISTORY_PATH, 'url\tfirst_seen\tportal\ttitle\tcompany\tstatus\tlocation\n', 'utf-8');
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Migrate legacy scan-history.tsv headers before appending location rows.

Line 247 only fixes brand-new files. Existing repositories keep the old 6-column header forever, so the new location field lands under an unlabeled column. URL dedup still works, but the scan-history audit/export format stays inconsistent until the header is upgraded once.

🧾 Suggested one-time header upgrade
 function appendToScanHistory(offers, date) {
+  const header6 = 'url\tfirst_seen\tportal\ttitle\tcompany\tstatus';
+  const header7 = `${header6}\tlocation`;
+
   // Ensure file + header exist. Location appended as 7th column for non-breaking
   // backward compat — older scan-history.tsv files with 6 columns still parse fine
   // since loadSeenUrls only reads column 0.
   if (!existsSync(SCAN_HISTORY_PATH)) {
-    writeFileSync(SCAN_HISTORY_PATH, 'url\tfirst_seen\tportal\ttitle\tcompany\tstatus\tlocation\n', 'utf-8');
+    writeFileSync(SCAN_HISTORY_PATH, `${header7}\n`, 'utf-8');
+  } else {
+    const text = readFileSync(SCAN_HISTORY_PATH, 'utf-8');
+    if (text.startsWith(`${header6}\n`) && !text.startsWith(`${header7}\n`)) {
+      writeFileSync(
+        SCAN_HISTORY_PATH,
+        text.replace(`${header6}\n`, `${header7}\n`),
+        'utf-8',
+      );
+    }
   }
 
   const lines = offers.map(o =>
     `${o.url}\t${date}\t${o.source}\t${o.title}\t${o.company}\tadded\t${o.location || ''}`
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
function appendToScanHistory(offers, date) {
// Ensure file + header exist
// Ensure file + header exist. Location appended as 7th column for non-breaking
// backward compat — older scan-history.tsv files with 6 columns still parse fine
// since loadSeenUrls only reads column 0.
if (!existsSync(SCAN_HISTORY_PATH)) {
writeFileSync(SCAN_HISTORY_PATH, 'url\tfirst_seen\tportal\ttitle\tcompany\tstatus\n', 'utf-8');
writeFileSync(SCAN_HISTORY_PATH, 'url\tfirst_seen\tportal\ttitle\tcompany\tstatus\tlocation\n', 'utf-8');
}
function appendToScanHistory(offers, date) {
const header6 = 'url\tfirst_seen\tportal\ttitle\tcompany\tstatus';
const header7 = `${header6}\tlocation`;
// Ensure file + header exist. Location appended as 7th column for non-breaking
// backward compat — older scan-history.tsv files with 6 columns still parse fine
// since loadSeenUrls only reads column 0.
if (!existsSync(SCAN_HISTORY_PATH)) {
writeFileSync(SCAN_HISTORY_PATH, `${header7}\n`, 'utf-8');
} else {
const text = readFileSync(SCAN_HISTORY_PATH, 'utf-8');
if (text.startsWith(`${header6}\n`) && !text.startsWith(`${header7}\n`)) {
writeFileSync(
SCAN_HISTORY_PATH,
text.replace(`${header6}\n`, `${header7}\n`),
'utf-8',
);
}
}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scan.mjs` around lines 242 - 248, The current appendToScanHistory only
creates a new file with the 7-column header but does not upgrade existing legacy
files with a 6-column header; update appendToScanHistory to detect and migrate
an old header: when SCAN_HISTORY_PATH exists, read the first line (utf-8), split
on '\t' and if it lacks the 'location' column or has fewer than 7 columns,
replace the header line with
'url\tfirst_seen\tportal\ttitle\tcompany\tstatus\tlocation\n' and rewrite the
file preserving all subsequent lines; keep the existing behavior of creating the
file when missing and ensure migration is a one-time header-only change so no
rows are shifted or data lost.

deepak-glitch pushed a commit to deepak-glitch/career-ops that referenced this pull request May 6, 2026
…Fs, archived 3 below-threshold)

- scan.mjs added 2 new offers (ElevenLabs Enterprise Solutions Engineer NA + EU-German)
- Level 3 WebSearch added 4 Lever FDE candidates (Research Innovations [closed], JetBridge, zaimler, Symmetry Systems Associate L3)
- 5 A-G evaluations: santifer#566 ElevenLabs NA (3.2/5, PDF), santifer#567 ElevenLabs EU German (1.5/5 SKIP), santifer#568 JetBridge LATAM (2.4/5 SKIP), santifer#569 zaimler NYC (2.7/5 SKIP), santifer#570 Symmetry L3 (3.0/5, PDF — fuzzy-merged into existing santifer#105)
- Research Innovations TC2259 marked closed (Lever 404; defense/EW + US Secret + 8-12 yrs hard veto)
- All reports include Location in header per Shortlist Display Policy
- 0 errors / 0 warnings on verify-pipeline.mjs

https://claude.ai/code/session_overnight-2026-05-06T04:26Z
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: optional location_filter in portals.yml + persist location to scan-history.tsv

1 participant