🐛 fix(search): align API result count with web UI#4937
🐛 fix(search): align API result count with web UI#4937gaborbernat wants to merge 6 commits intooracle:masterfrom
Conversation
|
@vladak is now ready |
|
Needs rebase. |
The REST API search endpoint returned hits.length (capped at hitsPerPage * cachePages, defaulting to 125) instead of Lucene's totalHits. In project-less mode this meant the API always reported at most 125 results regardless of actual matches, while the web UI correctly showed the true count. Additionally, the Lucene totalHitsThreshold was Short.MAX_VALUE (32,767), causing approximate counts for large result sets. The web UI already uses Integer.MAX_VALUE implicitly via the 3-arg searcher.search() overload, so this aligns the API to match. The endDocument field in the API response was derived from the grouped map's key count (unique file paths) rather than the actual document page size, making it incorrect when multiple results came from the same file. Fixes oracle#3239, fixes oracle#3170.
Cover the three bugs fixed in the previous commit: Bug 1 had testSearchReturnsTotalHitsNotCachedCount already. Bug 2 gets testTotalHitsIsExactForFullRetrieval which verifies that results() can retrieve every document using the count from search(), catching any inaccuracy from a low totalHitsThreshold. Bug 3 gets testResultCountAndEndDocument which hits the REST endpoint and asserts endDocument is derived from the document page size rather than from the grouped map key count.
|
@vladak done, thanks! |
|
The webapp is still returning different total hit count on each search when performed via UI, so it is hard to evaluate whether API search is equivalent to UI search. For example with my favorite AOSP test (indexer run with |
|
The reason for the non-stable hit count in the UI is in how search totals are produced in web search (thanks Codex): There is a way how to make the total hit count stable for the UI similarly to how this is proposed for the API: diff --git a/opengrok-indexer/src/main/java/org/opengrok/indexer/web/SearchHelper.java b/opengrok-indexer/src/main/java/org/opengrok/indexer/web/SearchHelper.java
index f8177ef2d..ee9a7bd18 100644
--- a/opengrok-indexer/src/main/java/org/opengrok/indexer/web/SearchHelper.java
+++ b/opengrok-indexer/src/main/java/org/opengrok/indexer/web/SearchHelper.java
@@ -61,7 +61,9 @@ import org.apache.lucene.search.Sort;
import org.apache.lucene.search.SortField;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.search.TopFieldCollectorManager;
import org.apache.lucene.search.TopFieldDocs;
+import org.apache.lucene.search.TopScoreDocCollectorManager;
import org.apache.lucene.search.spell.DirectSpellChecker;
import org.apache.lucene.search.spell.SuggestMode;
import org.apache.lucene.search.spell.SuggestWord;
@@ -475,9 +477,17 @@ public class SearchHelper {
return this;
}
try {
- TopFieldDocs fdocs = searcher.search(query, start + maxItems, sort);
- totalHits = fdocs.totalHits.value;
- hits = fdocs.scoreDocs;
+ int resultWindow = start + maxItems;
+ TopDocs topDocs;
+ if (Sort.RELEVANCE.equals(sort)) {
+ topDocs = searcher.search(query,
+ new TopScoreDocCollectorManager(resultWindow, Integer.MAX_VALUE));
+ } else {
+ topDocs = searcher.search(query,
+ new TopFieldCollectorManager(sort, resultWindow, Integer.MAX_VALUE));
+ }
+ totalHits = topDocs.totalHits.value;
+ hits = topDocs.scoreDocs;
/*
* Determine if possibly a single-result redirect to xref isFor my test case that leads to stable 200483 hits in the UI, versus 200531 via the API which is pretty close. To fix #3239 this patch would be needed. The cons being:
Personally I find the varying result count in the UI confusing. |
|
Yeah, I agree, I'd consider it a bug that now is unreliable, it communicates the wrong information to the end user. |
|
Also, benchmarking the UI search latency before and after the change, the original was actually slighly slower (like 15%) for some reason. |
The search REST endpoint declared "maxresults" and "maxhitsperfile" as inline string literals, so test code had to duplicate the same literals to construct requests. Callers drifting out of sync with the controller would only surface as a silent default-value fallback at runtime. Promote both to QueryParameters so the controller and its tests share a single source of truth, following the convention already established for the other search parameters.
The web UI reported a fluctuating "of N" total when searches matched more than about a thousand documents, so the same query could show 20k-45k hits on reload while the REST API reported the true count. The drift came from SearchHelper.executeQuery using the searcher.search(Query, int, Sort) overload, whose hardcoded totalHitsThreshold of 1000 lets Lucene early-terminate via block-max WAND and surface TotalHits as a lower-bound estimate. Switch to explicit collector managers with Integer.MAX_VALUE as the threshold so totalHits is always exact, mirroring the pattern already used by SearchEngine for the REST path. Exact total counting does add work for very common terms, but consistency between UI and API wins over the saved scoring work on broad queries. The regression test uses a synthetic 1500-doc corpus with one doc carrying extreme term frequency, which is what block-max WAND needs to actually skip; with a flat corpus the default threshold does not trigger the bug even with more than a thousand hits.
|
@vladak pushed fixes, we should be good now. |
|
Also, when we merge this can we get a release cut? Thanks! |
That's my plan. Will take another look at the changes. |
Add SearchHelperStableCountTest to checkstyle Header suppressions since the file was not authored within Oracle's realm.
Reviewer feedback flagged the references to totalHitsThreshold and block-max WAND as both unclear (without the constructor in view) and likely to age poorly as Lucene evolves. Replace with descriptions of the observable behavior — approximate lower-bound count, score variance needed to reproduce the bug — without naming the underlying mechanism.
|
I have addressed the change requests. |
The REST API search endpoint (
/api/v1/search) reports different result counts than the web UI for identical queries. In project-less mode the API always capsresultCountathitsPerPage * cachePages(defaulting to 125) becauseSearchEngine.search()returns the collectedhits.lengthrather than Lucene'stotalHits. 🔍 This is the root cause of both #3239 and #3170.A second source of inaccuracy comes from using
Short.MAX_VALUE(32,767) as the LucenetotalHitsThresholdin all collector constructions. For queries exceeding that threshold,totalHits.valuebecomes an approximate lower bound. The web UI already usesInteger.MAX_VALUEimplicitly through the 3-argsearcher.search(query, n, sort)overload, so switching the API collectors to match eliminates the discrepancy with zero performance impact for the common case and marginal cost for very large result sets.The
endDocumentfield in the JSON response was also wrong, computed from the grouped result map's key count (unique file paths) instead of the actual document page size. A page of 10 documents spanning 5 files would reportendDocumentasstartDocIndex + 4instead ofstartDocIndex + 9. ✅ Now derived from the same page-size arithmetic used internally bySearchEngineWrapper.search().Fixes #3239. Fixes #3170.