Skip to content

🐛 fix(search): align API result count with web UI#4937

Open
gaborbernat wants to merge 6 commits intooracle:masterfrom
gaborbernat:api-diff
Open

🐛 fix(search): align API result count with web UI#4937
gaborbernat wants to merge 6 commits intooracle:masterfrom
gaborbernat:api-diff

Conversation

@gaborbernat
Copy link
Copy Markdown
Contributor

The REST API search endpoint (/api/v1/search) reports different result counts than the web UI for identical queries. In project-less mode the API always caps resultCount at hitsPerPage * cachePages (defaulting to 125) because SearchEngine.search() returns the collected hits.length rather than Lucene's totalHits. 🔍 This is the root cause of both #3239 and #3170.

A second source of inaccuracy comes from using Short.MAX_VALUE (32,767) as the Lucene totalHitsThreshold in all collector constructions. For queries exceeding that threshold, totalHits.value becomes an approximate lower bound. The web UI already uses Integer.MAX_VALUE implicitly through the 3-arg searcher.search(query, n, sort) overload, so switching the API collectors to match eliminates the discrepancy with zero performance impact for the common case and marginal cost for very large result sets.

The endDocument field in the JSON response was also wrong, computed from the grouped result map's key count (unique file paths) instead of the actual document page size. A page of 10 documents spanning 5 files would report endDocument as startDocIndex + 4 instead of startDocIndex + 9. ✅ Now derived from the same page-size arithmetic used internally by SearchEngineWrapper.search().

Fixes #3239. Fixes #3170.

@oracle-contributor-agreement oracle-contributor-agreement Bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Apr 19, 2026
@gaborbernat
Copy link
Copy Markdown
Contributor Author

@vladak is now ready

Comment thread opengrok-indexer/src/test/java/org/opengrok/indexer/search/SearchEngineTest.java Outdated
@vladak
Copy link
Copy Markdown
Member

vladak commented Apr 22, 2026

Needs rebase.

The REST API search endpoint returned hits.length (capped at
hitsPerPage * cachePages, defaulting to 125) instead of Lucene's
totalHits. In project-less mode this meant the API always reported
at most 125 results regardless of actual matches, while the web UI
correctly showed the true count.

Additionally, the Lucene totalHitsThreshold was Short.MAX_VALUE
(32,767), causing approximate counts for large result sets. The
web UI already uses Integer.MAX_VALUE implicitly via the 3-arg
searcher.search() overload, so this aligns the API to match.

The endDocument field in the API response was derived from the
grouped map's key count (unique file paths) rather than the actual
document page size, making it incorrect when multiple results
came from the same file.

Fixes oracle#3239, fixes oracle#3170.
Cover the three bugs fixed in the previous commit:

Bug 1 had testSearchReturnsTotalHitsNotCachedCount already.
Bug 2 gets testTotalHitsIsExactForFullRetrieval which verifies
that results() can retrieve every document using the count from
search(), catching any inaccuracy from a low totalHitsThreshold.
Bug 3 gets testResultCountAndEndDocument which hits the REST
endpoint and asserts endDocument is derived from the document
page size rather than from the grouped map key count.
@gaborbernat
Copy link
Copy Markdown
Contributor Author

@vladak done, thanks!

@vladak
Copy link
Copy Markdown
Member

vladak commented Apr 23, 2026

The webapp is still returning different total hit count on each search when performed via UI, so it is hard to evaluate whether API search is equivalent to UI search. For example with my favorite AOSP test (indexer run with -s /var/opengrok/src -d /var/opengrok/data -P --economical -c /opt/homebrew/bin/ctags -W /var/opengrok/etc/configuration.xml -U http://localhost:8080/opengrok_web_war), searching for "google" yields between 20k and 45k hits when done via UI and with your changes always the resultCount in the API search http://localhost:8080/opengrok_web_war/api/v1/search?project=AOSP&full=google&searchall=true&start=0&maxresults=1 results is always 200531 in my case.

@vladak
Copy link
Copy Markdown
Member

vladak commented Apr 23, 2026

The reason for the non-stable hit count in the UI is in how search totals are produced in web search (thanks Codex): SearchHelper.executeQuery() uses searcher.search(query, start + maxItems, sort) and then prints TopFieldDocs.totalHits.value. In Lucene this value can be an estimated/lower-bound total unless total-hit tracking is explicitly forced, so the displayed “of N” can drift between requests (especially with relevance sorting).

There is a way how to make the total hit count stable for the UI similarly to how this is proposed for the API:

diff --git a/opengrok-indexer/src/main/java/org/opengrok/indexer/web/SearchHelper.java b/opengrok-indexer/src/main/java/org/opengrok/indexer/web/SearchHelper.java
index f8177ef2d..ee9a7bd18 100644
--- a/opengrok-indexer/src/main/java/org/opengrok/indexer/web/SearchHelper.java
+++ b/opengrok-indexer/src/main/java/org/opengrok/indexer/web/SearchHelper.java
@@ -61,7 +61,9 @@ import org.apache.lucene.search.Sort;
 import org.apache.lucene.search.SortField;
 import org.apache.lucene.search.TermQuery;
 import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.search.TopFieldCollectorManager;
 import org.apache.lucene.search.TopFieldDocs;
+import org.apache.lucene.search.TopScoreDocCollectorManager;
 import org.apache.lucene.search.spell.DirectSpellChecker;
 import org.apache.lucene.search.spell.SuggestMode;
 import org.apache.lucene.search.spell.SuggestWord;
@@ -475,9 +477,17 @@ public class SearchHelper {
             return this;
         }
         try {
-            TopFieldDocs fdocs = searcher.search(query, start + maxItems, sort);
-            totalHits = fdocs.totalHits.value;
-            hits = fdocs.scoreDocs;
+            int resultWindow = start + maxItems;
+            TopDocs topDocs;
+            if (Sort.RELEVANCE.equals(sort)) {
+                topDocs = searcher.search(query,
+                        new TopScoreDocCollectorManager(resultWindow, Integer.MAX_VALUE));
+            } else {
+                topDocs = searcher.search(query,
+                        new TopFieldCollectorManager(sort, resultWindow, Integer.MAX_VALUE));
+            }
+            totalHits = topDocs.totalHits.value;
+            hits = topDocs.scoreDocs;
 
             /*
              * Determine if possibly a single-result redirect to xref is

For my test case that leads to stable 200483 hits in the UI, versus 200531 via the API which is pretty close.

To fix #3239 this patch would be needed.

The cons being:

  • Higher query cost on large result sets because Lucene must track total hits exactly.
  • Potentially increased latency and CPU usage for broad queries (especially very common terms).

Personally I find the varying result count in the UI confusing.

@gaborbernat
Copy link
Copy Markdown
Contributor Author

Yeah, I agree, I'd consider it a bug that now is unreliable, it communicates the wrong information to the end user.

@vladak
Copy link
Copy Markdown
Member

vladak commented Apr 23, 2026

Also, benchmarking the UI search latency before and after the change, the original was actually slighly slower (like 15%) for some reason.

The search REST endpoint declared "maxresults" and "maxhitsperfile"
as inline string literals, so test code had to duplicate the same
literals to construct requests. Callers drifting out of sync with
the controller would only surface as a silent default-value fallback
at runtime.

Promote both to QueryParameters so the controller and its tests share
a single source of truth, following the convention already established
for the other search parameters.
The web UI reported a fluctuating "of N" total when searches matched
more than about a thousand documents, so the same query could show
20k-45k hits on reload while the REST API reported the true count.
The drift came from SearchHelper.executeQuery using the
searcher.search(Query, int, Sort) overload, whose hardcoded
totalHitsThreshold of 1000 lets Lucene early-terminate via block-max
WAND and surface TotalHits as a lower-bound estimate.

Switch to explicit collector managers with Integer.MAX_VALUE as the
threshold so totalHits is always exact, mirroring the pattern already
used by SearchEngine for the REST path. Exact total counting does add
work for very common terms, but consistency between UI and API wins
over the saved scoring work on broad queries.

The regression test uses a synthetic 1500-doc corpus with one doc
carrying extreme term frequency, which is what block-max WAND needs
to actually skip; with a flat corpus the default threshold does not
trigger the bug even with more than a thousand hits.
@gaborbernat
Copy link
Copy Markdown
Contributor Author

@vladak pushed fixes, we should be good now.

@gaborbernat
Copy link
Copy Markdown
Contributor Author

Also, when we merge this can we get a release cut? Thanks!

@vladak
Copy link
Copy Markdown
Member

vladak commented Apr 28, 2026

Also, when we merge this can we get a release cut? Thanks!

That's my plan. Will take another look at the changes.

Add SearchHelperStableCountTest to checkstyle Header suppressions
since the file was not authored within Oracle's realm.
Reviewer feedback flagged the references to totalHitsThreshold and
block-max WAND as both unclear (without the constructor in view) and
likely to age poorly as Lucene evolves.

Replace with descriptions of the observable behavior — approximate
lower-bound count, score variance needed to reproduce the bug —
without naming the underlying mechanism.
@gaborbernat gaborbernat requested a review from vladak May 5, 2026 17:47
@gaborbernat
Copy link
Copy Markdown
Contributor Author

I have addressed the change requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

OCA Verified All contributors have signed the Oracle Contributor Agreement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Search API returns different results from Web UI UI and API /search return vastly different result sizes in project-less mode

2 participants