test: fix flaky https tests on windows#62451
Conversation
The test-https-server-options-incoming-message and test-https-server-options-server-response tests can be flaky on Windows. Opencode/Opus helped figure out why.
test-http-agent-scheduling can be flaky on windows.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #62451 +/- ##
=======================================
Coverage 89.70% 89.70%
=======================================
Files 678 678
Lines 207195 207195
Branches 39730 39730
=======================================
+ Hits 185862 185868 +6
- Misses 13438 13441 +3
+ Partials 7895 7886 -9 🚀 New features to boost your workflow:
|
| assert.strictEqual(ports[ports.length - 1], port); | ||
| makeRequest(url, agent, common.mustCall((port) => { | ||
| assert.strictEqual(ports[ports.length - 1], port); | ||
| server.closeAllConnections(); |
There was a problem hiding this comment.
Why isn't agent.destroy() below sufficient?
There was a problem hiding this comment.
That's part of what I'm investigating. I'm running a stress test of this change on windows to determine if that deals with the flakiness. If it does, a follow up will be to figure out why the destroy isn't sufficient.
What's apparent is that there is a race condition on cleanup. Where and exactly why still needs to be determined along with a long term fix.
| }, common.mustCall((res) => { | ||
| assert.strictEqual(res.statusCode, 200); | ||
| res.on('end', () => { | ||
| server.closeAllConnections(); |
There was a problem hiding this comment.
What is preventing the socket from being closed normally? Is the keep-alive timeout longer than the test timeout? and if it is the case, can keep-alive be disable? I don't think it is needed for this test.
There was a problem hiding this comment.
Thinking through this... disabling keep-alive might address the issue for the various test-http(s)-* tests but there are tcp/tls tests also failing with the same crash that follow the same pattern, suggesting that the issue is a bit more fundamental than just using keep-alive.
This comment was marked as outdated.
This comment was marked as outdated.
|
Hmm... the CI stress test job seems to not actually be working correctly on most of the Windows configurations. |
This comment was marked as outdated.
This comment was marked as outdated.
|
@joyeecheung or @addaleax ... curious if either of you have thoughts on the root cause here. The flakes are just |
|
I don't have a Windows machine at hand at the moment but can you reproduce locally or is it only in CI? Did you try with |
|
I haven't even booted my windows machine in over two years. Once I get it back up and updated I'll see if I can reproduce locally and, if so, I'll try to get a usable dump from from it. |
The test-https-server-options-incoming-message and
test-https-server-options-server-response tests can
be flaky on Windows.
Opencode/Opus is helping figure out why.
If this PR makes the tests non-flaky it's just a bandaid. There appears to be a race condition on cleanup on windows where shutting down connections races with shutting down the server. We'd still need to identify specifically where that race is occurring and where to fix it specifically. Right now, however, my goal is to unblock CI flakes.
The flakiness does seem generalized to any
test-http(s)-*and not just one... which makes me strongly suspect the race is quite low in the stack, possibly even in libuv but that's yet to be determined precisely.