JIT: PGO value-profiling for non-constant zero-init stackalloc by EgorBo · Pull Request #127970 · dotnet/runtime

EgorBo · 2026-05-08T19:54:21Z

Revive my old PR that tried to do this, now that #127959 is merged.
Marking as reduce-unsafe because the motivation is to make SkipLocalInits less useful.

Instrument stackallocs of variable size for PGO + Generalize value probing so we can reuse it for other future ideas.

void Test(int len)
{
    Span<byte> buffer = stackalloc byte[len];
    Consume(buffer);
}

is optimized into additional fast path for most common length:

       cmp      rax, 100
       je       SHORT G_M44326_IG06
       ...
G_M44326_IG06:  ;; offset=0x004F
       test     dword ptr [rsp], esp
       sub      rsp, 112
       lea      rax, [rsp+0x20]
       vxorps   ymm0, ymm0, ymm0
       vmovdqu32 zmmword ptr [rax], zmm0
       vmovdqu32 zmmword ptr [rax+0x30], zmm0

when we call it mostly for the same size (100 in this example) so we zero it with just 3 AVX512 instructions instead of slow loop

Benchmark

using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

public class Bench
{
    [Benchmark] public void Stackalloc32()     => Consume(stackalloc byte[GetVar(32)]);
    [Benchmark] public void Stackalloc40()     => Consume(stackalloc byte[GetVar(40)]);
    [Benchmark] public void Stackalloc50()     => Consume(stackalloc byte[GetVar(50)]);
    [Benchmark] public void Stackalloc64()     => Consume(stackalloc byte[GetVar(64)]);
    [Benchmark] public void Stackalloc128()    => Consume(stackalloc byte[GetVar(128)]);
    [Benchmark] public void Stackalloc200()    => Consume(stackalloc byte[GetVar(200)]);
    [Benchmark] public void Stackalloc256()    => Consume(stackalloc byte[GetVar(256)]);
    [Benchmark] public void Stackalloc512()    => Consume(stackalloc byte[GetVar(512)]);
    [Benchmark] public void Stackalloc1024()   => Consume(stackalloc byte[GetVar(1024)]);
    [Benchmark] public void Stackalloc16384()  => Consume(stackalloc byte[GetVar(16384)]);
    [Benchmark] public void Stackalloc524288() => Consume(stackalloc byte[GetVar(524288)]);

    [MethodImpl(MethodImplOptions.NoInlining)] static int GetVar(int a) => a;
    [MethodImpl(MethodImplOptions.NoInlining)] static void Consume(Span<byte> buffer) { }
}

Results

Speedup of PR vs main (>1.00× = PR is faster, bold). Sizes <= 32 bytes are intentionally not specialized (variable-size loop is already cheap there).

Size	Linux x64 (EPYC 9V45)	Apple M4 arm64
32	1.00×	0.92×
40	1.25×	3.23×
50	1.37×	2.32×
64	1.34×	2.28×
128	1.77×	0.87×
200	2.26×	2.03×
256	2.46×	0.89×
512	3.76×	1.34×
1,024	2.43×	1.27×
16,384	2.47×	1.69×
524,288	2.54×	1.24×

Raw runs: EgorBot/Benchmarks#197.

Wins are large on Linux x64 across the board (up to 3.76× at 512 bytes). On Apple M4 the wins are also significant for non-power-of-two sizes (40 / 50 / 64 / 200 -> 2-3×), with two minor regressions at 128 and 256 (-13% / -11%) and parity-or-better elsewhere.

When stackalloc has a non-constant size and the method has compInitMem, codegen falls back to a slow per-stack-alignment zero-init loop. Use PGO value probing to record the most popular size, and in Tier1 specialize: size == popularSize ? LCLHEAP(popularSize) : LCLHEAP(size) The constant-size LCLHEAP fast path is unrolled by Lowering as STORE_BLK, producing efficient SIMD zero-init. Generalized value-probe infrastructure: * Introduce GenTreeOpWithILOffset (a GenTreeOp that carries an IL offset), used by GT_LCLHEAP so it can participate in value-histogram instrumentation alongside calls without a side hash table. * Centralize value-probe candidate identification in fgprofile.cpp via a single IsValueHistogramProbeCandidate helper used by the visitor, schema builder, and probe inserter. * Extract pickProfiledValue helper for sharing between specializations. * Make GT_LCLHEAP report its potential stack-overflow exception via OperExceptions so GTF_EXCEPT survives gtUpdateNodeOperSideEffects and if-conversion does not collapse the QMARK arms into a CMOV. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* pickProfiledValue: avoid copying likelyValues[0] before the empty-result check; read directly from likelyValues[0] after the early return. * pickProfiledValue: fix %u format specifier for ssize_t value (use %zd). * pickProfiledValue: doc comment no longer implies the helper is call-only. * impProfileLclHeap: replace truncating ((uint32_t)profiledValue > INT_MAX) range check with !FitsIn<int>(profiledValue). * GT_LCLHEAP OperExceptions: reword comment to reflect unconditional modeling of stack overflow. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add ValueProfiledLclHeap, ValueProfiledMemmove, ValueProfiledSequenceEqual metrics so SPMI replay/diff can show how often value-profile specialization fires per shape. * impProfileLclHeap: clamp profiledValue to [0, getUnrollThreshold(Memset)] so we only specialize when the constant-size LCLHEAP fast path is actually unrolled by Lowering. Outside that range the cmp/jne guard adds overhead with no codegen win. Also reject negative values that FitsIn<int> would otherwise allow. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Removes the duplicated getLikelyValues+stress-test+threshold-check block and the unguarded likelyValues[0] read that could observe uninitialized data when getLikelyValues returns 0. Now goes through the shared pickProfiledValue helper which handles the empty-result case correctly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

For profiled values <= DEFAULT_MAX_LOCALLOC_TO_LOCAL_SIZE (32 bytes), use a zero-init must-init local for the fast path arm instead of LCLHEAP(N). On ARM64 (and to a lesser extent on x64), the contained-constant LCLHEAP codegen carries non-trivial overhead (stack probes, outgoing-arg-area adjustment, separate STORE_BLK) that dwarfs the cost of the slow variable-size loop for very small allocations. The local is zeroed at function entry by must-init regardless of which arm runs; for sizes <= 32 bytes that's a single SIMD store at most. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Both impDuplicateWithProfiledArg and impProfileLclHeap had a TODO to set weights for the QMARK branches and were leaving the default 50/50 split. Use the histogram likelihood (already computed and the gating threshold for creating the QMARK in the first place) so morph's QMARK->control-flow expansion sets accurate edge probabilities for the cond/then/else blocks. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

The promote-to-local fast path turned out to be more complexity than it was worth: at sizes <= DEFAULT_MAX_LOCALLOC_TO_LOCAL_SIZE (32) the variable-size LCLHEAP loop is already fast enough that the cmp/jne guard plus any constant-size codegen does not pay off. Just bail out in that range. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

EgorBo · 2026-05-08T22:28:35Z

Note

AI-generated benchmark (Copilot CLI). Re-run with larger size set after skipping <= 32 byte specialization.

@EgorBot -linux_amd -osx_arm64

using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

public class Bench
{
    [Benchmark] public void Stackalloc32()     => Consume(stackalloc byte[GetVar(32)]);
    [Benchmark] public void Stackalloc40()     => Consume(stackalloc byte[GetVar(40)]);
    [Benchmark] public void Stackalloc50()     => Consume(stackalloc byte[GetVar(50)]);
    [Benchmark] public void Stackalloc64()     => Consume(stackalloc byte[GetVar(64)]);
    [Benchmark] public void Stackalloc128()    => Consume(stackalloc byte[GetVar(128)]);
    [Benchmark] public void Stackalloc200()    => Consume(stackalloc byte[GetVar(200)]);
    [Benchmark] public void Stackalloc256()    => Consume(stackalloc byte[GetVar(256)]);
    [Benchmark] public void Stackalloc512()    => Consume(stackalloc byte[GetVar(512)]);
    [Benchmark] public void Stackalloc1024()   => Consume(stackalloc byte[GetVar(1024)]);
    [Benchmark] public void Stackalloc16384()  => Consume(stackalloc byte[GetVar(16384)]);
    [Benchmark] public void Stackalloc524288() => Consume(stackalloc byte[GetVar(524288)]);

    [MethodImpl(MethodImplOptions.NoInlining)] static int GetVar(int a) => a;
    [MethodImpl(MethodImplOptions.NoInlining)] static void Consume(Span<byte> buffer) { }
}

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

The non-nint length / size operands fed into CORINFO_HELP_VALUEPROFILE32/64 (Memmove len, SequenceEqual len, LCLHEAP size) are non-negative. Widening them with a signed cast would sign-extend any int value with the high bit set into a huge negative ssize_t and skew the recorded histogram (and the subsequent getLikelyValues consumption). Switch to an unsigned widening. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Marking GT_LCLHEAP with GTF_ORDER_SIDEEFF (and adding it to OperSupportsOrderingSideEffect) was overstating its semantics: the only phase that actually misbehaves on two LCLHEAPs in QMARK arms today is if-conversion, and that pass already has a CostEx > 7 cutoff designed exactly for "this would be too expensive to evaluate unconditionally". Drop the GTF_ORDER_SIDEEFF marker and the OperSupportsOrderingSideEffect entry; instead give GT_LCLHEAP a realistic cost in gtSetEvalOrder: * For constant size with compInitMem we use 8 + alignedSize/REGSIZE (Lowering will unroll the zero-init via STORE_BLK). * For non-constant size we use a fixed high cost representing the runtime zero-init loop. Both branches stay above the cutoff, so if-conversion declines on the two-LCLHEAP QMARK shape impProfileLclHeap produces. Verified locally: Skipping if-conversion that will evaluate RHS unconditionally at costs 23,37 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Temporary change to flip the implicit module-level [SkipLocalsInit] off for all NETCoreApp libraries, so CI exercises compInitMem=true code paths end-to-end against this PR's GT_LCLHEAP changes (cost model, value-profile specialization, QMARK arms). Must be reverted before merge. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

EgorBo · 2026-05-09T10:00:45Z

/azp run runtime-coreclr pgo, runtime-coreclr libraries-pgo

azure-pipelines · 2026-05-09T10:00:58Z

Azure Pipelines successfully started running 2 pipeline(s).

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

…ing compInitMem The previous attempt to flip the SkipLocalsInit default to false broke CI because the same <When SkipLocalsInit='true'> block also implicitly sets AllowUnsafeBlocks=true (which many libraries silently rely on). Without it, every project that uses unsafe code without explicitly setting the flag fails with CS0227. Restructure: keep SkipLocalsInit=true (so AllowUnsafeBlocks stays set), but skip including the SkipLocalsInit.cs module-attribute file. That way [module: SkipLocalsInit] is not emitted -> compInitMem becomes true, without breaking any unsafe-code-using libraries. Also apply the JIT format patch (clang-format alignment in gentree.cpp's new GT_LCLHEAP cost case). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

EgorBo · 2026-05-09T11:13:59Z

/azp run runtime-coreclr pgo, runtime-coreclr libraries-pgo

azure-pipelines · 2026-05-09T11:14:13Z

Azure Pipelines successfully started running 2 pipeline(s).

Collapse the previous per-shape impDuplicateWithProfiledArg / impProfileLclHeap into a single Compiler::impProfileValueGuardedTree that callers parameterize declaratively: op1 = impProfileValueGuardedTree(node, &operandRef, ilOffset, minProfitable, maxProfitable, &successMetric DEBUGARG(tmpName)); The helper handles both the Instrumented Tier0 marking (sets BBF_HAS_VALUE_PROFILE; for calls also attaches gtHandleHistogramProfileCandidateInfo) and the Tier1+PGO specialization: * pickProfiledValue + likelihood/range filtering. * For calls, pre-spills the non-profiled args so their side effects stay in evaluation order. * Spills *operandRef into a temp. * slowTree = gtCloneExpr(node) (clone references the spilled temp). * Mutates *operandRef to a constant -> node itself becomes fastTree. * Builds (operand == popular) ? fastTree : slowTree, sets the QMARK likelihood from the histogram, bumps the supplied metric, and wraps value-typed results in a STORE_LCL_VAR/LCL_VAR pair. No oper switch, no lambdas, no IsValueHistogramProbeCandidate dependency on the importer side. Per-shape concerns (which operand is profiled, the profitable range, which metric to bump) live with the caller. Adding a new probe site is now a single declarative call. IsValueHistogramProbeCandidate stays in fgprofile.cpp as the per-oper registry needed by the value-histogram instrumentor visitor (which walks all nodes in BBF_HAS_VALUE_PROFILE blocks). Verified: * SPMI benchmarks.run.: All replays clean. * LCLHEAP repro: Tier1 still emits cmp/je guard + unrolled vmovdqu32 fast path + push-loop slow path (identical codegen to pre-refactor). * Memmove repro under JitRandomGuardedDevirtualization=1 stress: builds QMARK and bumps ValueProfiledMemmove. Also picks up an incidental fix for the IsValueHistogramProbeCandidate default-arg/null-guard regression introduced by the prior 'clean up' commit: the visitor at fgprofile.cpp:1969 calls with two args, but 'clean up' had removed both the default value and the null guard, breaking the build at HEAD. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

+            if (ilOffset != nullptr)
+            {
+                assert(node->AsCall()->gtHandleHistogramProfileCandidateInfo != nullptr);
+                *ilOffset = node->AsCall()->gtHandleHistogramProfileCandidateInfo->ilOffset;


+                        const ssize_t size        = op1->AsIntCon()->IconValue();
+                        const ssize_t alignedSize = (size + (STACK_ALIGN - 1)) & ~(ssize_t)(STACK_ALIGN - 1);
+                        costEx = 8 + (int)(alignedSize / REGSIZE_BYTES); // > 7 to block if-conversion
+                        costSz = 4 + (int)(alignedSize / REGSIZE_BYTES);
+                    }


Previously the sibling-spill loop in impProfileValueGuardedTree only walked GetUserArgByIndex(i) over CountUserArgs(), which misses: * gtControlExpr (indirect / fat / expanded-virtual call targets) * Non-user CallArg slots (this pointer, ret buffer, well-known args like PInvoke cookie / stack-array-local / etc.) Replace with a generic GenTreeUseEdgeIterator walk. Documented as "each use edge of a GenTree node in the order in which they are used" (GTF_REVERSE_OPS-aware), so eval order is preserved. As a bonus the helper now works uniformly across node kinds (calls, unary nodes, HWINTRINSICs, ...) without an IsCall() special case. Also skip side-effect-free siblings (e.g. CNS_INT, fresh LCL_VARs): no reordering hazard, no need to allocate a temp. Verified: * SPMI benchmarks.run.: All replays clean. * LCLHEAP repro: ValueProfiledLclHeap = 1, Tier1 codegen unchanged. * Memmove repro under JitRandomGuardedDevirtualization=1 stress: builds QMARK and bumps ValueProfiledMemmove. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

IsValueHistogramProbeCandidate matched Memmove/SequenceEqual purely on NamedIntrinsic. Since impProfileValueGuardedTree skips marking during inline import (compIsForInlining), an inlined Memmove can survive into a caller's BB without gtHandleHistogramProfileCandidateInfo. If that caller's BB is marked BBF_HAS_VALUE_PROFILE because of its own probe site, the value-instrumentor visitor walks the inlined call, matches on the intrinsic, and then derefs/asserts the null candidate info in the schema-builder / inserter. Treat presence of gtHandleHistogramProfileCandidateInfo as the explicit "importer marked this call for value profiling" signal: if it's null, return false. Inlined unmarked sites are now skipped cleanly. (LCLHEAP doesn't have an analogous explicit-mark bit -- it always carries an IL offset on GenTreeOpWithILOffset -- so an inlined LCLHEAP in a marked BB still gets a schema slot. That's wasted instrumentation, but not a crash; needs a separate mechanism to fully suppress.) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

@@ -256,7 +266,7 @@
        <AllowUnsafeBlocks>true</AllowUnsafeBlocks>
      </PropertyGroup>

-      <ItemGroup>
+      <ItemGroup Condition="'$(SkipLocalsInitForCI)' != 'true'">
        <Compile Include="$(CommonPath)SkipLocalsInit.cs" Link="Common\SkipLocalsInit.cs" />
      </ItemGroup>


+                    if (op1->IsCnsIntOrI() && info.compInitMem)
+                    {
+                        const ssize_t size        = op1->AsIntCon()->IconValue();
+                        const ssize_t alignedSize = (size + (STACK_ALIGN - 1)) & ~(ssize_t)(STACK_ALIGN - 1);
+                        costEx = 8 + (int)(alignedSize / REGSIZE_BYTES); // > 7 to block if-conversion
+                        costSz = 4 + (int)(alignedSize / REGSIZE_BYTES);
+                    }
+                    else
+                    {
+                        costEx = 36;
+                        costSz = 8;
+                    }


+        // Mirror the gating in impProfileLclHeap: only zero-init, non-constant
+        // locallocs are value-profile candidates. Anything else would have its
+        // probe data ignored by the consumer and just waste a schema slot.


Copilot AI review requested due to automatic review settings May 8, 2026 19:54

github-actions Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 8, 2026

dotnet-policy-service Bot assigned EgorBo May 8, 2026

This comment was marked as resolved.

Sign in to view

Copilot started reviewing on behalf of EgorBo May 8, 2026 19:56 View session

This comment was marked as outdated.

Sign in to view

EgorBot mentioned this pull request May 8, 2026

Benchmarks for dotnet/runtime#127970 (for @EgorBo) EgorBot/Benchmarks#194

Open

This comment was marked as resolved.

Sign in to view

EgorBo and others added 2 commits May 8, 2026 22:12

Copilot AI review requested due to automatic review settings May 8, 2026 20:39

Copilot started reviewing on behalf of EgorBo May 8, 2026 20:40 View session

This comment was marked as resolved.

Sign in to view

EgorBot mentioned this pull request May 8, 2026

Benchmarks for dotnet/runtime#127970 (for @EgorBo) EgorBot/Benchmarks#195

Open

Copilot AI reviewed May 8, 2026

View reviewed changes

Comment thread src/coreclr/jit/importercalls.cpp Outdated

EgorBo and others added 2 commits May 8, 2026 23:14

Copilot AI review requested due to automatic review settings May 8, 2026 21:23

Copilot started reviewing on behalf of EgorBo May 8, 2026 21:23 View session

This comment was marked as resolved.

Sign in to view

EgorBot mentioned this pull request May 8, 2026

Benchmarks for dotnet/runtime#127970 (for @EgorBo) EgorBot/Benchmarks#196

Open

Copilot AI reviewed May 8, 2026

View reviewed changes

Comment thread src/coreclr/jit/importercalls.cpp Outdated

Comment thread src/coreclr/jit/importer.cpp Outdated

Comment thread src/coreclr/jit/fgprofile.cpp

Comment thread src/coreclr/jit/gentree.h Outdated

EgorBo and others added 2 commits May 9, 2026 00:01

clean up

1fcbcb2

Copilot AI review requested due to automatic review settings May 8, 2026 22:23

Copilot started reviewing on behalf of EgorBo May 8, 2026 22:23 View session

EgorBot mentioned this pull request May 8, 2026

Benchmarks for dotnet/runtime#127970 (for @EgorBo) EgorBot/Benchmarks#197

Open

Copilot AI review requested due to automatic review settings May 8, 2026 23:27

Copilot started reviewing on behalf of EgorBo May 8, 2026 23:27 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

Comment thread src/coreclr/jit/fgprofile.cpp Outdated

Comment thread src/coreclr/jit/importer.cpp Outdated

This was referenced May 9, 2026

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

browser-wasm linux Release LibraryTests queues timing out #117974

Open

Copilot AI review requested due to automatic review settings May 9, 2026 09:54

Copilot started reviewing on behalf of EgorBo May 9, 2026 09:54 View session

This comment was marked as resolved.

Sign in to view

EgorBo added the reduce-unsafe label May 9, 2026

Copilot AI reviewed May 9, 2026

View reviewed changes

Comment thread src/coreclr/jit/importercalls.cpp Outdated

build-analysis Bot mentioned this pull request May 9, 2026

Unable to pull image from mcr.microsoft.com #117164

Open

EgorBo and others added 2 commits May 9, 2026 16:02

clean up

f4de676

Copilot AI review requested due to automatic review settings May 9, 2026 15:38

Copilot started reviewing on behalf of EgorBo May 9, 2026 15:38 View session

Copilot AI reviewed May 9, 2026

View reviewed changes

EgorBo and others added 2 commits May 9, 2026 17:49

Copilot AI review requested due to automatic review settings May 9, 2026 15:53

Copilot started reviewing on behalf of EgorBo May 9, 2026 15:54 View session

Copilot AI reviewed May 9, 2026

View reviewed changes

Conversation

EgorBo commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Results

Uh oh!

This comment was marked as resolved.

This comment was marked as outdated.

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

This comment was marked as resolved.

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EgorBo commented May 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

EgorBo commented May 9, 2026

Uh oh!

azure-pipelines Bot commented May 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

EgorBo commented May 9, 2026

Uh oh!

azure-pipelines Bot commented May 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EgorBo commented May 8, 2026 •

edited

Loading