Skip to content

fix(benchmark): verify apache hadoop target#851

Merged
mstykow merged 2 commits intomainfrom
verify/hadoop
May 6, 2026
Merged

fix(benchmark): verify apache hadoop target#851
mstykow merged 2 commits intomainfrom
verify/hadoop

Conversation

@mstykow
Copy link
Copy Markdown
Owner

@mstykow mstykow commented May 6, 2026

Summary

  • verify apache/hadoop at dbcc7cd797100e6b32cd84f85b53a5193a5f9af0 with a benchmark-grade compare-outputs run and record the result in docs/BENCHMARKS.md
  • fix generic sourcemap, copyright, metadata-cleanup, tree-walk, and comment-author paths exposed by Hadoop parity review
  • regenerate docs/scan-duration-vs-files.svg after adding the Hadoop timing row (278.63s vs 4575.96s, 16.42× faster)

Issues

  • Covers: Hadoop benchmark verification for https://github.com/apache/hadoop at dbcc7cd797100e6b32cd84f85b53a5193a5f9af0
  • Closes:

Scope and exclusions

  • Included:
    • generic sourcemap license/party handling improvements
    • generic copyright/holder extraction and metadata-cleanup fixes exposed by Hadoop
    • Hadoop benchmark entry and regenerated benchmark chart
  • Explicit exclusions:
    • no parser .expected.json or golden fixture updates
    • no attempt to chase reviewed JDiff/CHANGELOG/NOTICE noise that did not justify another general fix

Intentional differences from Python

  • keep reviewed differences where Provenant is deliberately cleaner or more specific, including de-duplicated repeated MIT sourcemap detections and rejecting ScanCode's Apache-2.0 AND LGPL-2.1-only over-assertion on plain Apache notice headers

Follow-up work

  • Created or intentionally deferred:
    • deferred low-value NOTICE/author cleanup that did not justify another robust generic fix after the Hadoop review

Validation

  • cargo test test_collapse_repeated_sourcemap_license_detections_combines_concrete_detections
  • cargo test test_detect_multiline_x_editable_actual_line5_shape_survives_postprocess_boundary
  • cargo test test_detect_multiline_x_editable_actual_line5_shape_end_to_end
  • cargo test test_collect_trailing_orphan_tokens_absorbs_name_before_legal_tail
  • cargo test test_derive_holder_from_simple_copyright_string_strips_and_onwards_prefix
  • cargo test test_extract_comment_author_supplements_handles_html_comment_by_line
  • npm run check:docs
  • cargo run --manifest-path xtask/Cargo.toml --bin compare-outputs -- --repo-url https://github.com/apache/hadoop.git --repo-ref 7b92f0bc7f8fb8f243635493da76e8f59475ef87 --profile common

Benchmark evidence

  • Target snapshot: apache/hadoop @ dbcc7cd797100e6b32cd84f85b53a5193a5f9af0
  • Compare artifacts: .provenant/compare-runs/20260506T074747Z-hadoop-28100
  • Files: 16,370
  • Timing: Provenant 278.63s; ScanCode 4575.96s (16.42× faster)

Expected-output fixture changes

  • Files changed: None
  • Why the new expected output is correct: no golden or .expected fixtures changed; this branch updates runtime behavior and benchmark docs only

mstykow added 2 commits May 6, 2026 10:05
Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>
Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>
@mstykow mstykow merged commit 71b1de6 into main May 6, 2026
15 checks passed
@mstykow mstykow deleted the verify/hadoop branch May 6, 2026 10:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant