Optimize diff algorithm's time complexity, part 1#4
Open
QZGao wants to merge 4 commits into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR partially addresses T342805 by removing several avoidable slow paths in WikiWho's revision analysis while keeping the existing
difflib.Differ-based word matcher.The main fix is in
analyse_words_in_sentences(). The old word phase had quadratic behavior for large revisions:Differ.compare(...)output was materialized withlist(...).unmatched_words_prevto find the first unused previousWordobject with the same value.'', but every token still started scanning from position 0.For a worst-case Google Play revision with about 22k current tokens and 22k previous tokens, this word phase alone took about 36 seconds.
This PR changes that phase to consume
Differ().compare(text_prev, text_curr)once, in order, withprev_indexandcurr_indexcursors.Differalready emits alignment-ordered rows, so the post-diff assignment can be handled directly:' 'Wordfor the next current token slot'-'Wordas deleted/outbound'+'Wordfor the next current token slot'?'Differhint lineTo make that possible, the code now builds
curr_slots, a flat ordered list of(sentence_curr, word_value)pairs, while buildingtext_curr. That preserves the destination sentence for each current token without later rediscovering it by looping back throughunmatched_sentences_curr.The pure-addition case also uses
curr_slots, removing a duplicated traversal over current sentences and split tokens.Additional performance fixes:
spam_hashes_setfor O(1) spam hash membership checks while preserving the existingspam_hasheslist._add_spam_revision(...).calculate_hash(text)now only runs when the API response does not includesha1.self.temp.append(...); self.temp.count(...)duplicate tracking with local counters for repeated paragraph and sentence hashes.TOKEN_SYMBOLSandTOKEN_SYMBOL_REPLACEMENTSto module scope so the long symbol list and formatted replacements are built once instead of on every tokenization call.split_into_tokens().Testing:
matched_prevcounttoken_id_deltatokens_deltaI also tested the full Google Play page locally against already-fetched revision history. The WikiWho package processing itself still takes roughly 60 seconds for that page, which reflects the cost of the current one-request-calculates-all design. Further improvement for very large pages such as Google Play or Barack Obama would require larger architectural changes in #2.