Skip to content

JIT: PGO value-profiling for non-constant zero-init stackalloc#127970

Draft
EgorBo wants to merge 20 commits intodotnet:mainfrom
EgorBo:lclheap-pgo-value-probing
Draft

JIT: PGO value-profiling for non-constant zero-init stackalloc#127970
EgorBo wants to merge 20 commits intodotnet:mainfrom
EgorBo:lclheap-pgo-value-probing

Conversation

@EgorBo
Copy link
Copy Markdown
Member

@EgorBo EgorBo commented May 8, 2026

Revive my old PR that tried to do this, now that #127959 is merged.
Marking as reduce-unsafe because the motivation is to make SkipLocalInits less useful.

Instrument stackallocs of variable size for PGO + Generalize value probing so we can reuse it for other future ideas.

void Test(int len)
{
    Span<byte> buffer = stackalloc byte[len];
    Consume(buffer);
}

is optimized into additional fast path for most common length:

       cmp      rax, 100
       je       SHORT G_M44326_IG06
       ...
G_M44326_IG06:  ;; offset=0x004F
       test     dword ptr [rsp], esp
       sub      rsp, 112
       lea      rax, [rsp+0x20]
       vxorps   ymm0, ymm0, ymm0
       vmovdqu32 zmmword ptr [rax], zmm0
       vmovdqu32 zmmword ptr [rax+0x30], zmm0

when we call it mostly for the same size (100 in this example) so we zero it with just 3 AVX512 instructions instead of slow loop

Benchmark

using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

public class Bench
{
    [Benchmark] public void Stackalloc32()     => Consume(stackalloc byte[GetVar(32)]);
    [Benchmark] public void Stackalloc40()     => Consume(stackalloc byte[GetVar(40)]);
    [Benchmark] public void Stackalloc50()     => Consume(stackalloc byte[GetVar(50)]);
    [Benchmark] public void Stackalloc64()     => Consume(stackalloc byte[GetVar(64)]);
    [Benchmark] public void Stackalloc128()    => Consume(stackalloc byte[GetVar(128)]);
    [Benchmark] public void Stackalloc200()    => Consume(stackalloc byte[GetVar(200)]);
    [Benchmark] public void Stackalloc256()    => Consume(stackalloc byte[GetVar(256)]);
    [Benchmark] public void Stackalloc512()    => Consume(stackalloc byte[GetVar(512)]);
    [Benchmark] public void Stackalloc1024()   => Consume(stackalloc byte[GetVar(1024)]);
    [Benchmark] public void Stackalloc16384()  => Consume(stackalloc byte[GetVar(16384)]);
    [Benchmark] public void Stackalloc524288() => Consume(stackalloc byte[GetVar(524288)]);

    [MethodImpl(MethodImplOptions.NoInlining)] static int GetVar(int a) => a;
    [MethodImpl(MethodImplOptions.NoInlining)] static void Consume(Span<byte> buffer) { }
}

Results

Speedup of PR vs main (>1.00× = PR is faster, bold). Sizes <= 32 bytes are intentionally not specialized (variable-size loop is already cheap there).

Size Linux x64 (EPYC 9V45) Apple M4 arm64
32 1.00× 0.92×
40 1.25× 3.23×
50 1.37× 2.32×
64 1.34× 2.28×
128 1.77× 0.87×
200 2.26× 2.03×
256 2.46× 0.89×
512 3.76× 1.34×
1,024 2.43× 1.27×
16,384 2.47× 1.69×
524,288 2.54× 1.24×

Raw runs: EgorBot/Benchmarks#197.

Wins are large on Linux x64 across the board (up to 3.76× at 512 bytes). On Apple M4 the wins are also significant for non-power-of-two sizes (40 / 50 / 64 / 200 -> 2-3×), with two minor regressions at 128 and 256 (-13% / -11%) and parity-or-better elsewhere.

When stackalloc has a non-constant size and the method has compInitMem,
codegen falls back to a slow per-stack-alignment zero-init loop. Use PGO
value probing to record the most popular size, and in Tier1 specialize:

    size == popularSize ? LCLHEAP(popularSize) : LCLHEAP(size)

The constant-size LCLHEAP fast path is unrolled by Lowering as STORE_BLK,
producing efficient SIMD zero-init.

Generalized value-probe infrastructure:
* Introduce GenTreeOpWithILOffset (a GenTreeOp that carries an IL offset),
  used by GT_LCLHEAP so it can participate in value-histogram instrumentation
  alongside calls without a side hash table.
* Centralize value-probe candidate identification in fgprofile.cpp via a
  single IsValueHistogramProbeCandidate helper used by the visitor, schema
  builder, and probe inserter.
* Extract pickProfiledValue helper for sharing between specializations.
* Make GT_LCLHEAP report its potential stack-overflow exception via
  OperExceptions so GTF_EXCEPT survives gtUpdateNodeOperSideEffects and
  if-conversion does not collapse the QMARK arms into a CMOV.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 8, 2026 19:54
@github-actions github-actions Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 8, 2026
@dotnet-policy-service

This comment was marked as resolved.

@EgorBo

This comment was marked as outdated.

This comment was marked as resolved.

EgorBo and others added 2 commits May 8, 2026 22:12
* pickProfiledValue: avoid copying likelyValues[0] before the empty-result
  check; read directly from likelyValues[0] after the early return.
* pickProfiledValue: fix %u format specifier for ssize_t value (use %zd).
* pickProfiledValue: doc comment no longer implies the helper is call-only.
* impProfileLclHeap: replace truncating ((uint32_t)profiledValue > INT_MAX)
  range check with !FitsIn<int>(profiledValue).
* GT_LCLHEAP OperExceptions: reword comment to reflect unconditional
  modeling of stack overflow.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Add ValueProfiledLclHeap, ValueProfiledMemmove, ValueProfiledSequenceEqual
  metrics so SPMI replay/diff can show how often value-profile specialization
  fires per shape.
* impProfileLclHeap: clamp profiledValue to [0, getUnrollThreshold(Memset)]
  so we only specialize when the constant-size LCLHEAP fast path is actually
  unrolled by Lowering. Outside that range the cmp/jne guard adds overhead
  with no codegen win. Also reject negative values that FitsIn<int> would
  otherwise allow.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 8, 2026 20:39
@EgorBo

This comment was marked as resolved.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Comment thread src/coreclr/jit/importercalls.cpp Outdated
EgorBo and others added 2 commits May 8, 2026 23:14
Removes the duplicated getLikelyValues+stress-test+threshold-check block
and the unguarded likelyValues[0] read that could observe uninitialized
data when getLikelyValues returns 0. Now goes through the shared
pickProfiledValue helper which handles the empty-result case correctly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
For profiled values <= DEFAULT_MAX_LOCALLOC_TO_LOCAL_SIZE (32 bytes), use
a zero-init must-init local for the fast path arm instead of LCLHEAP(N).
On ARM64 (and to a lesser extent on x64), the contained-constant LCLHEAP
codegen carries non-trivial overhead (stack probes, outgoing-arg-area
adjustment, separate STORE_BLK) that dwarfs the cost of the slow
variable-size loop for very small allocations.

The local is zeroed at function entry by must-init regardless of which
arm runs; for sizes <= 32 bytes that's a single SIMD store at most.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 8, 2026 21:23
@EgorBo

This comment was marked as resolved.

Both impDuplicateWithProfiledArg and impProfileLclHeap had a TODO to set
weights for the QMARK branches and were leaving the default 50/50 split.
Use the histogram likelihood (already computed and the gating threshold
for creating the QMARK in the first place) so morph's QMARK->control-flow
expansion sets accurate edge probabilities for the cond/then/else blocks.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Comment thread src/coreclr/jit/importercalls.cpp Outdated
Comment thread src/coreclr/jit/importer.cpp Outdated
Comment thread src/coreclr/jit/fgprofile.cpp
Comment thread src/coreclr/jit/gentree.h Outdated
EgorBo and others added 2 commits May 9, 2026 00:01
The promote-to-local fast path turned out to be more complexity than it
was worth: at sizes <= DEFAULT_MAX_LOCALLOC_TO_LOCAL_SIZE (32) the
variable-size LCLHEAP loop is already fast enough that the cmp/jne
guard plus any constant-size codegen does not pay off. Just bail out
in that range.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 8, 2026 22:23
@EgorBo
Copy link
Copy Markdown
Member Author

EgorBo commented May 8, 2026

Note

AI-generated benchmark (Copilot CLI). Re-run with larger size set after skipping <= 32 byte specialization.

@EgorBot -linux_amd -osx_arm64

using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

public class Bench
{
    [Benchmark] public void Stackalloc32()     => Consume(stackalloc byte[GetVar(32)]);
    [Benchmark] public void Stackalloc40()     => Consume(stackalloc byte[GetVar(40)]);
    [Benchmark] public void Stackalloc50()     => Consume(stackalloc byte[GetVar(50)]);
    [Benchmark] public void Stackalloc64()     => Consume(stackalloc byte[GetVar(64)]);
    [Benchmark] public void Stackalloc128()    => Consume(stackalloc byte[GetVar(128)]);
    [Benchmark] public void Stackalloc200()    => Consume(stackalloc byte[GetVar(200)]);
    [Benchmark] public void Stackalloc256()    => Consume(stackalloc byte[GetVar(256)]);
    [Benchmark] public void Stackalloc512()    => Consume(stackalloc byte[GetVar(512)]);
    [Benchmark] public void Stackalloc1024()   => Consume(stackalloc byte[GetVar(1024)]);
    [Benchmark] public void Stackalloc16384()  => Consume(stackalloc byte[GetVar(16384)]);
    [Benchmark] public void Stackalloc524288() => Consume(stackalloc byte[GetVar(524288)]);

    [MethodImpl(MethodImplOptions.NoInlining)] static int GetVar(int a) => a;
    [MethodImpl(MethodImplOptions.NoInlining)] static void Consume(Span<byte> buffer) { }
}

Copilot AI review requested due to automatic review settings May 8, 2026 23:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Comment thread src/coreclr/jit/fgprofile.cpp Outdated
Comment thread src/coreclr/jit/importer.cpp Outdated
The non-nint length / size operands fed into CORINFO_HELP_VALUEPROFILE32/64
(Memmove len, SequenceEqual len, LCLHEAP size) are non-negative. Widening
them with a signed cast would sign-extend any int value with the high bit
set into a huge negative ssize_t and skew the recorded histogram (and the
subsequent getLikelyValues consumption). Switch to an unsigned widening.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Marking GT_LCLHEAP with GTF_ORDER_SIDEEFF (and adding it to
OperSupportsOrderingSideEffect) was overstating its semantics: the only
phase that actually misbehaves on two LCLHEAPs in QMARK arms today is
if-conversion, and that pass already has a CostEx > 7 cutoff designed
exactly for "this would be too expensive to evaluate unconditionally".

Drop the GTF_ORDER_SIDEEFF marker and the OperSupportsOrderingSideEffect
entry; instead give GT_LCLHEAP a realistic cost in gtSetEvalOrder:
* For constant size with compInitMem we use 8 + alignedSize/REGSIZE
  (Lowering will unroll the zero-init via STORE_BLK).
* For non-constant size we use a fixed high cost representing the
  runtime zero-init loop.

Both branches stay above the cutoff, so if-conversion declines on the
two-LCLHEAP QMARK shape impProfileLclHeap produces. Verified locally:

    Skipping if-conversion that will evaluate RHS unconditionally at costs 23,37

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 9, 2026 09:54
Temporary change to flip the implicit module-level [SkipLocalsInit] off
for all NETCoreApp libraries, so CI exercises compInitMem=true code paths
end-to-end against this PR's GT_LCLHEAP changes (cost model, value-profile
specialization, QMARK arms).

Must be reverted before merge.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@EgorBo

This comment was marked as resolved.

@azure-pipelines

This comment was marked as resolved.

@EgorBo
Copy link
Copy Markdown
Member Author

EgorBo commented May 9, 2026

/azp run runtime-coreclr pgo, runtime-coreclr libraries-pgo

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 2 pipeline(s).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Comment thread src/coreclr/jit/importercalls.cpp Outdated
…ing compInitMem

The previous attempt to flip the SkipLocalsInit default to false broke CI
because the same <When SkipLocalsInit='true'> block also implicitly sets
AllowUnsafeBlocks=true (which many libraries silently rely on). Without it,
every project that uses unsafe code without explicitly setting the flag
fails with CS0227.

Restructure: keep SkipLocalsInit=true (so AllowUnsafeBlocks stays set),
but skip including the SkipLocalsInit.cs module-attribute file. That way
[module: SkipLocalsInit] is not emitted -> compInitMem becomes true,
without breaking any unsafe-code-using libraries.

Also apply the JIT format patch (clang-format alignment in gentree.cpp's
new GT_LCLHEAP cost case).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@EgorBo
Copy link
Copy Markdown
Member Author

EgorBo commented May 9, 2026

/azp run runtime-coreclr pgo, runtime-coreclr libraries-pgo

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 2 pipeline(s).

EgorBo and others added 2 commits May 9, 2026 16:02
Collapse the previous per-shape impDuplicateWithProfiledArg /
impProfileLclHeap into a single Compiler::impProfileValueGuardedTree
that callers parameterize declaratively:

  op1 = impProfileValueGuardedTree(node, &operandRef, ilOffset,
                                   minProfitable, maxProfitable,
                                   &successMetric DEBUGARG(tmpName));

The helper handles both the Instrumented Tier0 marking (sets
BBF_HAS_VALUE_PROFILE; for calls also attaches
gtHandleHistogramProfileCandidateInfo) and the Tier1+PGO specialization:

  * pickProfiledValue + likelihood/range filtering.
  * For calls, pre-spills the non-profiled args so their side effects
    stay in evaluation order.
  * Spills *operandRef into a temp.
  * slowTree = gtCloneExpr(node) (clone references the spilled temp).
  * Mutates *operandRef to a constant -> node itself becomes fastTree.
  * Builds (operand == popular) ? fastTree : slowTree, sets the QMARK
    likelihood from the histogram, bumps the supplied metric, and wraps
    value-typed results in a STORE_LCL_VAR/LCL_VAR pair.

No oper switch, no lambdas, no IsValueHistogramProbeCandidate dependency
on the importer side. Per-shape concerns (which operand is profiled, the
profitable range, which metric to bump) live with the caller. Adding a
new probe site is now a single declarative call.

IsValueHistogramProbeCandidate stays in fgprofile.cpp as the per-oper
registry needed by the value-histogram instrumentor visitor (which walks
all nodes in BBF_HAS_VALUE_PROFILE blocks).

Verified:
  * SPMI benchmarks.run.: All replays clean.
  * LCLHEAP repro: Tier1 still emits cmp/je guard + unrolled vmovdqu32
    fast path + push-loop slow path (identical codegen to pre-refactor).
  * Memmove repro under JitRandomGuardedDevirtualization=1 stress: builds
    QMARK and bumps ValueProfiledMemmove.

Also picks up an incidental fix for the IsValueHistogramProbeCandidate
default-arg/null-guard regression introduced by the prior 'clean up'
commit: the visitor at fgprofile.cpp:1969 calls with two args, but
'clean up' had removed both the default value and the null guard,
breaking the build at HEAD.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 9, 2026 15:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Comment thread src/libraries/Directory.Build.targets
Comment thread src/libraries/Directory.Build.targets
Comment on lines +1927 to +1930
if (ilOffset != nullptr)
{
assert(node->AsCall()->gtHandleHistogramProfileCandidateInfo != nullptr);
*ilOffset = node->AsCall()->gtHandleHistogramProfileCandidateInfo->ilOffset;
Comment on lines +6937 to +6941
const ssize_t size = op1->AsIntCon()->IconValue();
const ssize_t alignedSize = (size + (STACK_ALIGN - 1)) & ~(ssize_t)(STACK_ALIGN - 1);
costEx = 8 + (int)(alignedSize / REGSIZE_BYTES); // > 7 to block if-conversion
costSz = 4 + (int)(alignedSize / REGSIZE_BYTES);
}
EgorBo and others added 2 commits May 9, 2026 17:49
Previously the sibling-spill loop in impProfileValueGuardedTree only
walked GetUserArgByIndex(i) over CountUserArgs(), which misses:
  * gtControlExpr (indirect / fat / expanded-virtual call targets)
  * Non-user CallArg slots (this pointer, ret buffer, well-known args
    like PInvoke cookie / stack-array-local / etc.)

Replace with a generic GenTreeUseEdgeIterator walk. Documented as
"each use edge of a GenTree node in the order in which they are used"
(GTF_REVERSE_OPS-aware), so eval order is preserved. As a bonus the
helper now works uniformly across node kinds (calls, unary nodes,
HWINTRINSICs, ...) without an IsCall() special case.

Also skip side-effect-free siblings (e.g. CNS_INT, fresh LCL_VARs):
no reordering hazard, no need to allocate a temp.

Verified:
  * SPMI benchmarks.run.: All replays clean.
  * LCLHEAP repro: ValueProfiledLclHeap = 1, Tier1 codegen unchanged.
  * Memmove repro under JitRandomGuardedDevirtualization=1 stress:
    builds QMARK and bumps ValueProfiledMemmove.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
IsValueHistogramProbeCandidate matched Memmove/SequenceEqual purely on
NamedIntrinsic. Since impProfileValueGuardedTree skips marking during
inline import (compIsForInlining), an inlined Memmove can survive into
a caller's BB without gtHandleHistogramProfileCandidateInfo. If that
caller's BB is marked BBF_HAS_VALUE_PROFILE because of its own probe
site, the value-instrumentor visitor walks the inlined call, matches
on the intrinsic, and then derefs/asserts the null candidate info in
the schema-builder / inserter.

Treat presence of gtHandleHistogramProfileCandidateInfo as the explicit
"importer marked this call for value profiling" signal: if it's null,
return false. Inlined unmarked sites are now skipped cleanly.

(LCLHEAP doesn't have an analogous explicit-mark bit -- it always
carries an IL offset on GenTreeOpWithILOffset -- so an inlined LCLHEAP
in a marked BB still gets a schema slot. That's wasted instrumentation,
but not a crash; needs a separate mechanism to fully suppress.)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 9, 2026 15:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Comment on lines 251 to 271
@@ -256,7 +266,7 @@
<AllowUnsafeBlocks>true</AllowUnsafeBlocks>
</PropertyGroup>

<ItemGroup>
<ItemGroup Condition="'$(SkipLocalsInitForCI)' != 'true'">
<Compile Include="$(CommonPath)SkipLocalsInit.cs" Link="Common\SkipLocalsInit.cs" />
</ItemGroup>
Comment on lines +6935 to +6946
if (op1->IsCnsIntOrI() && info.compInitMem)
{
const ssize_t size = op1->AsIntCon()->IconValue();
const ssize_t alignedSize = (size + (STACK_ALIGN - 1)) & ~(ssize_t)(STACK_ALIGN - 1);
costEx = 8 + (int)(alignedSize / REGSIZE_BYTES); // > 7 to block if-conversion
costSz = 4 + (int)(alignedSize / REGSIZE_BYTES);
}
else
{
costEx = 36;
costSz = 8;
}
Comment on lines +1903 to +1905
// Mirror the gating in impProfileLclHeap: only zero-init, non-constant
// locallocs are value-profile candidates. Anything else would have its
// probe data ignored by the consumer and just waste a schema slot.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI reduce-unsafe

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants