Skip to content

fix(copyright): drop legal-prose false positives#825

Merged
mstykow merged 6 commits intomainfrom
fix/copyright-legal-prose-false-positives
Apr 30, 2026
Merged

fix(copyright): drop legal-prose false positives#825
mstykow merged 6 commits intomainfrom
fix/copyright-legal-prose-false-positives

Conversation

@mstykow
Copy link
Copy Markdown
Owner

@mstykow mstykow commented Apr 29, 2026

Summary

  • filter generic legal-prose copyright and holder fragments in the shared copyright refiner instead of special-casing the local compare target
  • normalize smart quotes in refiner punctuation stripping and add preservation coverage for legitimate notice-shaped copyrights and holders after narrowing the earlier over-broad refiner pass
  • add detector and refiner regressions for the Meta SDK-style legal prose and update one to_improve golden fixture where bare copyright is now correctly treated as junk holder noise

Issues

  • Covers:
  • Closes:

Scope and exclusions

  • Included:
    • shared refiner heuristics for copyright and holder junk filtering
    • preservation tests for (c) the European Community 2007 and Copyright (c) 1988, 1993
    • regression tests for the observed legal-prose false-positive class
    • one justified golden-fixture expectation update for misco4/to_improve/junk-copyright-224.txt.yml
    • rerun validation with cargo test --features golden-tests --test copyright_golden test_golden_copyrights and compare-outputs on .provenant/LICENSE using profile common
  • Explicit exclusions:

Follow-up work

  • Created or intentionally deferred:
    • investigate whether the remaining URL trailing-slash normalization mismatch should be normalized in Provenant output, compare reduction, or left as-is

Expected-output fixture changes

  • Files changed: testdata/copyright-golden/copyrights/misco4/to_improve/junk-copyright-224.txt.yml
  • Why the new expected output is correct: the fixture input is only copyright _copyright, so a bare copyright holder is junk noise rather than a real holder identity; the narrowed refiner now preserves legitimate copyright-prefixed notice holders while dropping this singleton junk case

mstykow and others added 6 commits April 29, 2026 23:18
Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>
Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>
Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>
@mstykow mstykow merged commit 80139ca into main Apr 30, 2026
15 checks passed
@mstykow mstykow deleted the fix/copyright-legal-prose-false-positives branch April 30, 2026 09:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant