Skip to content

Optimize diff algorithm's time complexity, part 1#4

Open
QZGao wants to merge 4 commits into
wikimedia:masterfrom
QZGao:optimization-without-differ
Open

Optimize diff algorithm's time complexity, part 1#4
QZGao wants to merge 4 commits into
wikimedia:masterfrom
QZGao:optimization-without-differ

Conversation

@QZGao
Copy link
Copy Markdown

@QZGao QZGao commented Jun 6, 2026

This PR partially addresses T342805 by removing several avoidable slow paths in WikiWho's revision analysis while keeping the existing difflib.Differ-based word matcher.

The main fix is in analyse_words_in_sentences(). The old word phase had quadratic behavior for large revisions:

  • The full Differ.compare(...) output was materialized with list(...).
  • For each current token, the code rescanned the diff list from the beginning.
  • On matched/deleted tokens, it then rescanned unmatched_words_prev to find the first unused previous Word object with the same value.
  • Diff entries were consumed by replacing them with '', but every token still started scanning from position 0.

For a worst-case Google Play revision with about 22k current tokens and 22k previous tokens, this word phase alone took about 36 seconds.

This PR changes that phase to consume Differ().compare(text_prev, text_curr) once, in order, with prev_index and curr_index cursors. Differ already emits alignment-ordered rows, so the post-diff assignment can be handled directly:

diff tag action
' ' reuse the next unmatched previous Word for the next current token slot
'-' mark the next previous Word as deleted/outbound
'+' create a new Word for the next current token slot
'?' skip the Differ hint line

To make that possible, the code now builds curr_slots, a flat ordered list of (sentence_curr, word_value) pairs, while building text_curr. That preserves the destination sentence for each current token without later rediscovering it by looping back through unmatched_sentences_curr.

The pure-addition case also uses curr_slots, removing a duplicated traversal over current sentences and split tokens.

Additional performance fixes:

  • Add spam_hashes_set for O(1) spam hash membership checks while preserving the existing spam_hashes list.
  • Centralize spam revision recording in _add_spam_revision(...).
  • Make JSON revision hash calculation lazy: calculate_hash(text) now only runs when the API response does not include sha1.
  • Replace self.temp.append(...); self.temp.count(...) duplicate tracking with local counters for repeated paragraph and sentence hashes.
  • Move TOKEN_SYMBOLS and TOKEN_SYMBOL_REPLACEMENTS to module scope so the long symbol list and formatted replacements are built once instead of on every tokenization call.
  • Combine empty-token filtering and pipe placeholder restoration into one list comprehension in split_into_tokens().

Testing:

  • Tested against revision 1296988276 of the Google Play article, with about 22,189 current tokens and 22,444 previous tokens.
  • The optimized word phase took about 0.43 seconds, compared to about 36 seconds before.
  • Correctness checks on that revision matched between the old and new implementations for:
    • matched_prev count
    • vandalism flag
    • token_id_delta
    • tokens_delta
    • per-sentence word totals

I also tested the full Google Play page locally against already-fetched revision history. The WikiWho package processing itself still takes roughly 60 seconds for that page, which reflects the cost of the current one-request-calculates-all design. Further improvement for very large pages such as Google Play or Barack Obama would require larger architectural changes in #2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant