🐛 fix(search): align API result count with web UI by gaborbernat · Pull Request #4937 · oracle/opengrok

gaborbernat · 2026-04-19T04:22:54Z

The REST API search endpoint (/api/v1/search) reports different result counts than the web UI for identical queries. In project-less mode the API always caps resultCount at hitsPerPage * cachePages (defaulting to 125) because SearchEngine.search() returns the collected hits.length rather than Lucene's totalHits. 🔍 This is the root cause of both #3239 and #3170.

A second source of inaccuracy comes from using Short.MAX_VALUE (32,767) as the Lucene totalHitsThreshold in all collector constructions. For queries exceeding that threshold, totalHits.value becomes an approximate lower bound. The web UI already uses Integer.MAX_VALUE implicitly through the 3-arg searcher.search(query, n, sort) overload, so switching the API collectors to match eliminates the discrepancy with zero performance impact for the common case and marginal cost for very large result sets.

The endDocument field in the JSON response was also wrong, computed from the grouped result map's key count (unique file paths) instead of the actual document page size. A page of 10 documents spanning 5 files would report endDocument as startDocIndex + 4 instead of startDocIndex + 9. ✅ Now derived from the same page-size arithmetic used internally by SearchEngineWrapper.search().

Fixes #3239. Fixes #3170.

gaborbernat · 2026-04-21T21:48:28Z

@vladak is now ready

vladak · 2026-04-22T17:56:48Z

Needs rebase.

The REST API search endpoint returned hits.length (capped at hitsPerPage * cachePages, defaulting to 125) instead of Lucene's totalHits. In project-less mode this meant the API always reported at most 125 results regardless of actual matches, while the web UI correctly showed the true count. Additionally, the Lucene totalHitsThreshold was Short.MAX_VALUE (32,767), causing approximate counts for large result sets. The web UI already uses Integer.MAX_VALUE implicitly via the 3-arg searcher.search() overload, so this aligns the API to match. The endDocument field in the API response was derived from the grouped map's key count (unique file paths) rather than the actual document page size, making it incorrect when multiple results came from the same file. Fixes oracle#3239, fixes oracle#3170.

Cover the three bugs fixed in the previous commit: Bug 1 had testSearchReturnsTotalHitsNotCachedCount already. Bug 2 gets testTotalHitsIsExactForFullRetrieval which verifies that results() can retrieve every document using the count from search(), catching any inaccuracy from a low totalHitsThreshold. Bug 3 gets testResultCountAndEndDocument which hits the REST endpoint and asserts endDocument is derived from the document page size rather than from the grouped map key count.

gaborbernat · 2026-04-22T18:09:22Z

@vladak done, thanks!

vladak · 2026-04-23T11:25:52Z

The webapp is still returning different total hit count on each search when performed via UI, so it is hard to evaluate whether API search is equivalent to UI search. For example with my favorite AOSP test (indexer run with -s /var/opengrok/src -d /var/opengrok/data -P --economical -c /opt/homebrew/bin/ctags -W /var/opengrok/etc/configuration.xml -U http://localhost:8080/opengrok_web_war), searching for "google" yields between 20k and 45k hits when done via UI and with your changes always the resultCount in the API search http://localhost:8080/opengrok_web_war/api/v1/search?project=AOSP&full=google&searchall=true&start=0&maxresults=1 results is always 200531 in my case.

vladak · 2026-04-23T14:44:22Z

The reason for the non-stable hit count in the UI is in how search totals are produced in web search (thanks Codex): SearchHelper.executeQuery() uses searcher.search(query, start + maxItems, sort) and then prints TopFieldDocs.totalHits.value. In Lucene this value can be an estimated/lower-bound total unless total-hit tracking is explicitly forced, so the displayed “of N” can drift between requests (especially with relevance sorting).

There is a way how to make the total hit count stable for the UI similarly to how this is proposed for the API:

diff --git a/opengrok-indexer/src/main/java/org/opengrok/indexer/web/SearchHelper.java b/opengrok-indexer/src/main/java/org/opengrok/indexer/web/SearchHelper.java
index f8177ef2d..ee9a7bd18 100644
--- a/opengrok-indexer/src/main/java/org/opengrok/indexer/web/SearchHelper.java
+++ b/opengrok-indexer/src/main/java/org/opengrok/indexer/web/SearchHelper.java
@@ -61,7 +61,9 @@ import org.apache.lucene.search.Sort;
 import org.apache.lucene.search.SortField;
 import org.apache.lucene.search.TermQuery;
 import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.search.TopFieldCollectorManager;
 import org.apache.lucene.search.TopFieldDocs;
+import org.apache.lucene.search.TopScoreDocCollectorManager;
 import org.apache.lucene.search.spell.DirectSpellChecker;
 import org.apache.lucene.search.spell.SuggestMode;
 import org.apache.lucene.search.spell.SuggestWord;
@@ -475,9 +477,17 @@ public class SearchHelper {
             return this;
         }
         try {
-            TopFieldDocs fdocs = searcher.search(query, start + maxItems, sort);
-            totalHits = fdocs.totalHits.value;
-            hits = fdocs.scoreDocs;
+            int resultWindow = start + maxItems;
+            TopDocs topDocs;
+            if (Sort.RELEVANCE.equals(sort)) {
+                topDocs = searcher.search(query,
+                        new TopScoreDocCollectorManager(resultWindow, Integer.MAX_VALUE));
+            } else {
+                topDocs = searcher.search(query,
+                        new TopFieldCollectorManager(sort, resultWindow, Integer.MAX_VALUE));
+            }
+            totalHits = topDocs.totalHits.value;
+            hits = topDocs.scoreDocs;
 
             /*
              * Determine if possibly a single-result redirect to xref is

For my test case that leads to stable 200483 hits in the UI, versus 200531 via the API which is pretty close.

To fix #3239 this patch would be needed.

The cons being:

Higher query cost on large result sets because Lucene must track total hits exactly.
Potentially increased latency and CPU usage for broad queries (especially very common terms).

Personally I find the varying result count in the UI confusing.

gaborbernat · 2026-04-23T15:00:13Z

Yeah, I agree, I'd consider it a bug that now is unreliable, it communicates the wrong information to the end user.

vladak · 2026-04-23T16:08:05Z

Also, benchmarking the UI search latency before and after the change, the original was actually slighly slower (like 15%) for some reason.

The search REST endpoint declared "maxresults" and "maxhitsperfile" as inline string literals, so test code had to duplicate the same literals to construct requests. Callers drifting out of sync with the controller would only surface as a silent default-value fallback at runtime. Promote both to QueryParameters so the controller and its tests share a single source of truth, following the convention already established for the other search parameters.

The web UI reported a fluctuating "of N" total when searches matched more than about a thousand documents, so the same query could show 20k-45k hits on reload while the REST API reported the true count. The drift came from SearchHelper.executeQuery using the searcher.search(Query, int, Sort) overload, whose hardcoded totalHitsThreshold of 1000 lets Lucene early-terminate via block-max WAND and surface TotalHits as a lower-bound estimate. Switch to explicit collector managers with Integer.MAX_VALUE as the threshold so totalHits is always exact, mirroring the pattern already used by SearchEngine for the REST path. Exact total counting does add work for very common terms, but consistency between UI and API wins over the saved scoring work on broad queries. The regression test uses a synthetic 1500-doc corpus with one doc carrying extreme term frequency, which is what block-max WAND needs to actually skip; with a flat corpus the default threshold does not trigger the bug even with more than a thousand hits.

gaborbernat · 2026-04-23T23:42:28Z

@vladak pushed fixes, we should be good now.

gaborbernat · 2026-04-27T16:24:53Z

Also, when we merge this can we get a release cut? Thanks!

vladak · 2026-04-28T10:42:13Z

Also, when we merge this can we get a release cut? Thanks!

That's my plan. Will take another look at the changes.

Add SearchHelperStableCountTest to checkstyle Header suppressions since the file was not authored within Oracle's realm.

Reviewer feedback flagged the references to totalHitsThreshold and block-max WAND as both unclear (without the constructor in view) and likely to age poorly as Lucene evolves. Replace with descriptions of the observable behavior — approximate lower-bound count, score variance needed to reproduce the bug — without naming the underlying mechanism.

gaborbernat · 2026-05-05T20:17:32Z

I have addressed the change requests.

oracle-contributor-agreement Bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Apr 19, 2026

vladak reviewed Apr 22, 2026

View reviewed changes

Comment thread opengrok-indexer/src/test/java/org/opengrok/indexer/search/SearchEngineTest.java Outdated

gaborbernat added 2 commits April 22, 2026 11:05

gaborbernat force-pushed the api-diff branch from f7d03cb to 8088de8 Compare April 22, 2026 18:08

gaborbernat requested a review from vladak April 22, 2026 18:08

vladak reviewed Apr 23, 2026

View reviewed changes

Comment thread opengrok-web/src/test/java/org/opengrok/web/api/v1/controller/SearchControllerTest.java Outdated

gaborbernat added 2 commits April 23, 2026 16:40

vladak reviewed Apr 29, 2026

View reviewed changes

Comment thread opengrok-indexer/src/test/java/org/opengrok/indexer/web/SearchHelperStableCountTest.java Outdated

Remove Oracle copyright from non-Oracle authored test

875145a

Add SearchHelperStableCountTest to checkstyle Header suppressions since the file was not authored within Oracle's realm.

vladak reviewed May 5, 2026

View reviewed changes

Comment thread opengrok-indexer/src/test/java/org/opengrok/indexer/web/SearchHelperStableCountTest.java Outdated

vladak reviewed May 5, 2026

View reviewed changes

Comment thread opengrok-indexer/src/test/java/org/opengrok/indexer/web/SearchHelperStableCountTest.java Outdated

vladak reviewed May 5, 2026

View reviewed changes

Comment thread opengrok-indexer/src/test/java/org/opengrok/indexer/web/SearchHelperStableCountTest.java Outdated

gaborbernat requested a review from vladak May 5, 2026 17:47

Conversation

gaborbernat commented Apr 19, 2026

Uh oh!

gaborbernat commented Apr 21, 2026

Uh oh!

Uh oh!

vladak commented Apr 22, 2026

Uh oh!

gaborbernat commented Apr 22, 2026

Uh oh!

Uh oh!

vladak commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vladak commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gaborbernat commented Apr 23, 2026

Uh oh!

vladak commented Apr 23, 2026

Uh oh!

gaborbernat commented Apr 23, 2026

Uh oh!

gaborbernat commented Apr 27, 2026

Uh oh!

vladak commented Apr 28, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gaborbernat commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vladak commented Apr 23, 2026 •

edited

Loading

vladak commented Apr 23, 2026 •

edited

Loading