Skip to content

test: fix flaky https tests on windows#62451

Open
jasnell wants to merge 2 commits intonodejs:mainfrom
jasnell:jasnell/hopefully-fix-flaky-windows-tests
Open

test: fix flaky https tests on windows#62451
jasnell wants to merge 2 commits intonodejs:mainfrom
jasnell:jasnell/hopefully-fix-flaky-windows-tests

Conversation

@jasnell
Copy link
Copy Markdown
Member

@jasnell jasnell commented Mar 27, 2026

The test-https-server-options-incoming-message and
test-https-server-options-server-response tests can
be flaky on Windows.

Opencode/Opus is helping figure out why.

If this PR makes the tests non-flaky it's just a bandaid. There appears to be a race condition on cleanup on windows where shutting down connections races with shutting down the server. We'd still need to identify specifically where that race is occurring and where to fix it specifically. Right now, however, my goal is to unblock CI flakes.

The flakiness does seem generalized to any test-http(s)-* and not just one... which makes me strongly suspect the race is quite low in the stack, possibly even in libuv but that's yet to be determined precisely.

jasnell added 2 commits March 26, 2026 22:01
The test-https-server-options-incoming-message and
test-https-server-options-server-response tests can
be flaky on Windows.

Opencode/Opus helped figure out why.
test-http-agent-scheduling can be flaky on windows.
@nodejs-github-bot nodejs-github-bot added needs-ci PRs that need a full CI run. test Issues and PRs related to the tests. labels Mar 27, 2026
@jasnell jasnell requested a review from mcollina March 27, 2026 05:15
@nodejs-github-bot
Copy link
Copy Markdown
Collaborator

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.70%. Comparing base (53bcd11) to head (8297908).

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #62451   +/-   ##
=======================================
  Coverage   89.70%   89.70%           
=======================================
  Files         678      678           
  Lines      207195   207195           
  Branches    39730    39730           
=======================================
+ Hits       185862   185868    +6     
- Misses      13438    13441    +3     
+ Partials     7895     7886    -9     

see 28 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

assert.strictEqual(ports[ports.length - 1], port);
makeRequest(url, agent, common.mustCall((port) => {
assert.strictEqual(ports[ports.length - 1], port);
server.closeAllConnections();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't agent.destroy() below sufficient?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's part of what I'm investigating. I'm running a stress test of this change on windows to determine if that deals with the flakiness. If it does, a follow up will be to figure out why the destroy isn't sufficient.

What's apparent is that there is a race condition on cleanup. Where and exactly why still needs to be determined along with a long term fix.

}, common.mustCall((res) => {
assert.strictEqual(res.statusCode, 200);
res.on('end', () => {
server.closeAllConnections();
Copy link
Copy Markdown
Member

@lpinca lpinca Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is preventing the socket from being closed normally? Is the keep-alive timeout longer than the test timeout? and if it is the case, can keep-alive be disable? I don't think it is needed for this test.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking through this... disabling keep-alive might address the issue for the various test-http(s)-* tests but there are tcp/tls tests also failing with the same crash that follow the same pattern, suggesting that the issue is a bit more fundamental than just using keep-alive.

@nodejs-github-bot

This comment was marked as outdated.

@jasnell
Copy link
Copy Markdown
Member Author

jasnell commented Mar 27, 2026

Hmm... the CI stress test job seems to not actually be working correctly on most of the Windows configurations.

@nodejs-github-bot
Copy link
Copy Markdown
Collaborator

nodejs-github-bot commented Mar 27, 2026

@jasnell

This comment was marked as outdated.

@jasnell
Copy link
Copy Markdown
Member Author

jasnell commented Mar 27, 2026

@joyeecheung or @addaleax ... curious if either of you have thoughts on the root cause here. The flakes are just test-http(s)-* and test-tls-* or test-net-* tests in general on windows. It's not a single or particular set of sets failing each time, they all appear to be following the same pattern, and all fail with the same error code. It's definitely a race condition somewhere, I'm thinking it's most likely in libuv but it's not entirely clear. The closeAllConnections approach in this PR appears to resolve the flakiness in these particular tests but it's just a bandaid.

@lpinca
Copy link
Copy Markdown
Member

lpinca commented Mar 27, 2026

I don't have a Windows machine at hand at the moment but can you reproduce locally or is it only in CI? Did you try with autoSelectFamily disabled?

@jasnell
Copy link
Copy Markdown
Member Author

jasnell commented Mar 27, 2026

I haven't even booted my windows machine in over two years. Once I get it back up and updated I'll see if I can reproduce locally and, if so, I'll try to get a usable dump from from it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-ci PRs that need a full CI run. test Issues and PRs related to the tests.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants