Optimize diff algorithm's time complexity, part 1 by QZGao · Pull Request #4 · wikimedia/WikiWho

QZGao · 2026-06-06T00:48:52Z

This PR partially addresses T342805 by removing several avoidable slow paths in WikiWho's revision analysis while keeping the existing difflib.Differ-based word matcher.

The main fix is in analyse_words_in_sentences(). The old word phase had quadratic behavior for large revisions:

The full Differ.compare(...) output was materialized with list(...).
For each current token, the code rescanned the diff list from the beginning.
On matched/deleted tokens, it then rescanned unmatched_words_prev to find the first unused previous Word object with the same value.
Diff entries were consumed by replacing them with '', but every token still started scanning from position 0.

For a worst-case Google Play revision with about 22k current tokens and 22k previous tokens, this word phase alone took about 36 seconds.

This PR changes that phase to consume Differ().compare(text_prev, text_curr) once, in order, with prev_index and curr_index cursors. Differ already emits alignment-ordered rows, so the post-diff assignment can be handled directly:

diff tag	action
`' '`	reuse the next unmatched previous `Word` for the next current token slot
`'-'`	mark the next previous `Word` as deleted/outbound
`'+'`	create a new `Word` for the next current token slot
`'?'`	skip the `Differ` hint line

To make that possible, the code now builds curr_slots, a flat ordered list of (sentence_curr, word_value) pairs, while building text_curr. That preserves the destination sentence for each current token without later rediscovering it by looping back through unmatched_sentences_curr.

The pure-addition case also uses curr_slots, removing a duplicated traversal over current sentences and split tokens.

Additional performance fixes:

Add spam_hashes_set for O(1) spam hash membership checks while preserving the existing spam_hashes list.
Centralize spam revision recording in _add_spam_revision(...).
Make JSON revision hash calculation lazy: calculate_hash(text) now only runs when the API response does not include sha1.
Replace self.temp.append(...); self.temp.count(...) duplicate tracking with local counters for repeated paragraph and sentence hashes.
Move TOKEN_SYMBOLS and TOKEN_SYMBOL_REPLACEMENTS to module scope so the long symbol list and formatted replacements are built once instead of on every tokenization call.
Combine empty-token filtering and pipe placeholder restoration into one list comprehension in split_into_tokens().

Testing:

Tested against revision 1296988276 of the Google Play article, with about 22,189 current tokens and 22,444 previous tokens.
The optimized word phase took about 0.43 seconds, compared to about 36 seconds before.
Correctness checks on that revision matched between the old and new implementations for:
- matched_prev count
- vandalism flag
- token_id_delta
- tokens_delta
- per-sentence word totals

I also tested the full Google Play page locally against already-fetched revision history. The WikiWho package processing itself still takes roughly 60 seconds for that page, which reflects the cost of the current one-request-calculates-all design. Further improvement for very large pages such as Google Play or Barack Obama would require larger architectural changes in #2.

QZGao added 4 commits April 24, 2026 01:56

Optimize WikiWho diff algorithm

52b1642

Merge remote-tracking branch 'origin' into optimization

1ad7ee0

Multiple small fixes

359337c

Move tokenizer to global

89ca645

QZGao mentioned this pull request Jun 6, 2026

Optimize diff algorithm's time complexity, part 2 #2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize diff algorithm's time complexity, part 1#4

Optimize diff algorithm's time complexity, part 1#4
QZGao wants to merge 4 commits into
wikimedia:masterfrom
QZGao:optimization-without-differ

QZGao commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

QZGao commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant