JIT: PGO value-profiling for non-constant zero-init stackalloc#127970
Draft
EgorBo wants to merge 20 commits intodotnet:mainfrom
Draft
JIT: PGO value-profiling for non-constant zero-init stackalloc#127970EgorBo wants to merge 20 commits intodotnet:mainfrom
EgorBo wants to merge 20 commits intodotnet:mainfrom
Conversation
When stackalloc has a non-constant size and the method has compInitMem,
codegen falls back to a slow per-stack-alignment zero-init loop. Use PGO
value probing to record the most popular size, and in Tier1 specialize:
size == popularSize ? LCLHEAP(popularSize) : LCLHEAP(size)
The constant-size LCLHEAP fast path is unrolled by Lowering as STORE_BLK,
producing efficient SIMD zero-init.
Generalized value-probe infrastructure:
* Introduce GenTreeOpWithILOffset (a GenTreeOp that carries an IL offset),
used by GT_LCLHEAP so it can participate in value-histogram instrumentation
alongside calls without a side hash table.
* Centralize value-probe candidate identification in fgprofile.cpp via a
single IsValueHistogramProbeCandidate helper used by the visitor, schema
builder, and probe inserter.
* Extract pickProfiledValue helper for sharing between specializations.
* Make GT_LCLHEAP report its potential stack-overflow exception via
OperExceptions so GTF_EXCEPT survives gtUpdateNodeOperSideEffects and
if-conversion does not collapse the QMARK arms into a CMOV.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as outdated.
This comment was marked as outdated.
* pickProfiledValue: avoid copying likelyValues[0] before the empty-result check; read directly from likelyValues[0] after the early return. * pickProfiledValue: fix %u format specifier for ssize_t value (use %zd). * pickProfiledValue: doc comment no longer implies the helper is call-only. * impProfileLclHeap: replace truncating ((uint32_t)profiledValue > INT_MAX) range check with !FitsIn<int>(profiledValue). * GT_LCLHEAP OperExceptions: reword comment to reflect unconditional modeling of stack overflow. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Add ValueProfiledLclHeap, ValueProfiledMemmove, ValueProfiledSequenceEqual metrics so SPMI replay/diff can show how often value-profile specialization fires per shape. * impProfileLclHeap: clamp profiledValue to [0, getUnrollThreshold(Memset)] so we only specialize when the constant-size LCLHEAP fast path is actually unrolled by Lowering. Outside that range the cmp/jne guard adds overhead with no codegen win. Also reject negative values that FitsIn<int> would otherwise allow. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This comment was marked as resolved.
This comment was marked as resolved.
Removes the duplicated getLikelyValues+stress-test+threshold-check block and the unguarded likelyValues[0] read that could observe uninitialized data when getLikelyValues returns 0. Now goes through the shared pickProfiledValue helper which handles the empty-result case correctly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
For profiled values <= DEFAULT_MAX_LOCALLOC_TO_LOCAL_SIZE (32 bytes), use a zero-init must-init local for the fast path arm instead of LCLHEAP(N). On ARM64 (and to a lesser extent on x64), the contained-constant LCLHEAP codegen carries non-trivial overhead (stack probes, outgoing-arg-area adjustment, separate STORE_BLK) that dwarfs the cost of the slow variable-size loop for very small allocations. The local is zeroed at function entry by must-init regardless of which arm runs; for sizes <= 32 bytes that's a single SIMD store at most. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This comment was marked as resolved.
This comment was marked as resolved.
Both impDuplicateWithProfiledArg and impProfileLclHeap had a TODO to set weights for the QMARK branches and were leaving the default 50/50 split. Use the histogram likelihood (already computed and the gating threshold for creating the QMARK in the first place) so morph's QMARK->control-flow expansion sets accurate edge probabilities for the cond/then/else blocks. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The promote-to-local fast path turned out to be more complexity than it was worth: at sizes <= DEFAULT_MAX_LOCALLOC_TO_LOCAL_SIZE (32) the variable-size LCLHEAP loop is already fast enough that the cmp/jne guard plus any constant-size codegen does not pay off. Just bail out in that range. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
Note AI-generated benchmark (Copilot CLI). Re-run with larger size set after skipping <= 32 byte specialization. @EgorBot -linux_amd -osx_arm64 using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);
public class Bench
{
[Benchmark] public void Stackalloc32() => Consume(stackalloc byte[GetVar(32)]);
[Benchmark] public void Stackalloc40() => Consume(stackalloc byte[GetVar(40)]);
[Benchmark] public void Stackalloc50() => Consume(stackalloc byte[GetVar(50)]);
[Benchmark] public void Stackalloc64() => Consume(stackalloc byte[GetVar(64)]);
[Benchmark] public void Stackalloc128() => Consume(stackalloc byte[GetVar(128)]);
[Benchmark] public void Stackalloc200() => Consume(stackalloc byte[GetVar(200)]);
[Benchmark] public void Stackalloc256() => Consume(stackalloc byte[GetVar(256)]);
[Benchmark] public void Stackalloc512() => Consume(stackalloc byte[GetVar(512)]);
[Benchmark] public void Stackalloc1024() => Consume(stackalloc byte[GetVar(1024)]);
[Benchmark] public void Stackalloc16384() => Consume(stackalloc byte[GetVar(16384)]);
[Benchmark] public void Stackalloc524288() => Consume(stackalloc byte[GetVar(524288)]);
[MethodImpl(MethodImplOptions.NoInlining)] static int GetVar(int a) => a;
[MethodImpl(MethodImplOptions.NoInlining)] static void Consume(Span<byte> buffer) { }
} |
The non-nint length / size operands fed into CORINFO_HELP_VALUEPROFILE32/64 (Memmove len, SequenceEqual len, LCLHEAP size) are non-negative. Widening them with a signed cast would sign-extend any int value with the high bit set into a huge negative ssize_t and skew the recorded histogram (and the subsequent getLikelyValues consumption). Switch to an unsigned widening. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This was referenced May 9, 2026
Open
Marking GT_LCLHEAP with GTF_ORDER_SIDEEFF (and adding it to
OperSupportsOrderingSideEffect) was overstating its semantics: the only
phase that actually misbehaves on two LCLHEAPs in QMARK arms today is
if-conversion, and that pass already has a CostEx > 7 cutoff designed
exactly for "this would be too expensive to evaluate unconditionally".
Drop the GTF_ORDER_SIDEEFF marker and the OperSupportsOrderingSideEffect
entry; instead give GT_LCLHEAP a realistic cost in gtSetEvalOrder:
* For constant size with compInitMem we use 8 + alignedSize/REGSIZE
(Lowering will unroll the zero-init via STORE_BLK).
* For non-constant size we use a fixed high cost representing the
runtime zero-init loop.
Both branches stay above the cutoff, so if-conversion declines on the
two-LCLHEAP QMARK shape impProfileLclHeap produces. Verified locally:
Skipping if-conversion that will evaluate RHS unconditionally at costs 23,37
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Temporary change to flip the implicit module-level [SkipLocalsInit] off for all NETCoreApp libraries, so CI exercises compInitMem=true code paths end-to-end against this PR's GT_LCLHEAP changes (cost model, value-profile specialization, QMARK arms). Must be reverted before merge. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
Member
Author
|
/azp run runtime-coreclr pgo, runtime-coreclr libraries-pgo |
|
Azure Pipelines successfully started running 2 pipeline(s). |
…ing compInitMem The previous attempt to flip the SkipLocalsInit default to false broke CI because the same <When SkipLocalsInit='true'> block also implicitly sets AllowUnsafeBlocks=true (which many libraries silently rely on). Without it, every project that uses unsafe code without explicitly setting the flag fails with CS0227. Restructure: keep SkipLocalsInit=true (so AllowUnsafeBlocks stays set), but skip including the SkipLocalsInit.cs module-attribute file. That way [module: SkipLocalsInit] is not emitted -> compInitMem becomes true, without breaking any unsafe-code-using libraries. Also apply the JIT format patch (clang-format alignment in gentree.cpp's new GT_LCLHEAP cost case). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
/azp run runtime-coreclr pgo, runtime-coreclr libraries-pgo |
|
Azure Pipelines successfully started running 2 pipeline(s). |
Collapse the previous per-shape impDuplicateWithProfiledArg /
impProfileLclHeap into a single Compiler::impProfileValueGuardedTree
that callers parameterize declaratively:
op1 = impProfileValueGuardedTree(node, &operandRef, ilOffset,
minProfitable, maxProfitable,
&successMetric DEBUGARG(tmpName));
The helper handles both the Instrumented Tier0 marking (sets
BBF_HAS_VALUE_PROFILE; for calls also attaches
gtHandleHistogramProfileCandidateInfo) and the Tier1+PGO specialization:
* pickProfiledValue + likelihood/range filtering.
* For calls, pre-spills the non-profiled args so their side effects
stay in evaluation order.
* Spills *operandRef into a temp.
* slowTree = gtCloneExpr(node) (clone references the spilled temp).
* Mutates *operandRef to a constant -> node itself becomes fastTree.
* Builds (operand == popular) ? fastTree : slowTree, sets the QMARK
likelihood from the histogram, bumps the supplied metric, and wraps
value-typed results in a STORE_LCL_VAR/LCL_VAR pair.
No oper switch, no lambdas, no IsValueHistogramProbeCandidate dependency
on the importer side. Per-shape concerns (which operand is profiled, the
profitable range, which metric to bump) live with the caller. Adding a
new probe site is now a single declarative call.
IsValueHistogramProbeCandidate stays in fgprofile.cpp as the per-oper
registry needed by the value-histogram instrumentor visitor (which walks
all nodes in BBF_HAS_VALUE_PROFILE blocks).
Verified:
* SPMI benchmarks.run.: All replays clean.
* LCLHEAP repro: Tier1 still emits cmp/je guard + unrolled vmovdqu32
fast path + push-loop slow path (identical codegen to pre-refactor).
* Memmove repro under JitRandomGuardedDevirtualization=1 stress: builds
QMARK and bumps ValueProfiledMemmove.
Also picks up an incidental fix for the IsValueHistogramProbeCandidate
default-arg/null-guard regression introduced by the prior 'clean up'
commit: the visitor at fgprofile.cpp:1969 calls with two args, but
'clean up' had removed both the default value and the null guard,
breaking the build at HEAD.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment on lines
+1927
to
+1930
| if (ilOffset != nullptr) | ||
| { | ||
| assert(node->AsCall()->gtHandleHistogramProfileCandidateInfo != nullptr); | ||
| *ilOffset = node->AsCall()->gtHandleHistogramProfileCandidateInfo->ilOffset; |
Comment on lines
+6937
to
+6941
| const ssize_t size = op1->AsIntCon()->IconValue(); | ||
| const ssize_t alignedSize = (size + (STACK_ALIGN - 1)) & ~(ssize_t)(STACK_ALIGN - 1); | ||
| costEx = 8 + (int)(alignedSize / REGSIZE_BYTES); // > 7 to block if-conversion | ||
| costSz = 4 + (int)(alignedSize / REGSIZE_BYTES); | ||
| } |
Previously the sibling-spill loop in impProfileValueGuardedTree only
walked GetUserArgByIndex(i) over CountUserArgs(), which misses:
* gtControlExpr (indirect / fat / expanded-virtual call targets)
* Non-user CallArg slots (this pointer, ret buffer, well-known args
like PInvoke cookie / stack-array-local / etc.)
Replace with a generic GenTreeUseEdgeIterator walk. Documented as
"each use edge of a GenTree node in the order in which they are used"
(GTF_REVERSE_OPS-aware), so eval order is preserved. As a bonus the
helper now works uniformly across node kinds (calls, unary nodes,
HWINTRINSICs, ...) without an IsCall() special case.
Also skip side-effect-free siblings (e.g. CNS_INT, fresh LCL_VARs):
no reordering hazard, no need to allocate a temp.
Verified:
* SPMI benchmarks.run.: All replays clean.
* LCLHEAP repro: ValueProfiledLclHeap = 1, Tier1 codegen unchanged.
* Memmove repro under JitRandomGuardedDevirtualization=1 stress:
builds QMARK and bumps ValueProfiledMemmove.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
IsValueHistogramProbeCandidate matched Memmove/SequenceEqual purely on NamedIntrinsic. Since impProfileValueGuardedTree skips marking during inline import (compIsForInlining), an inlined Memmove can survive into a caller's BB without gtHandleHistogramProfileCandidateInfo. If that caller's BB is marked BBF_HAS_VALUE_PROFILE because of its own probe site, the value-instrumentor visitor walks the inlined call, matches on the intrinsic, and then derefs/asserts the null candidate info in the schema-builder / inserter. Treat presence of gtHandleHistogramProfileCandidateInfo as the explicit "importer marked this call for value profiling" signal: if it's null, return false. Inlined unmarked sites are now skipped cleanly. (LCLHEAP doesn't have an analogous explicit-mark bit -- it always carries an IL offset on GenTreeOpWithILOffset -- so an inlined LCLHEAP in a marked BB still gets a schema slot. That's wasted instrumentation, but not a crash; needs a separate mechanism to fully suppress.) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment on lines
251
to
271
| @@ -256,7 +266,7 @@ | |||
| <AllowUnsafeBlocks>true</AllowUnsafeBlocks> | |||
| </PropertyGroup> | |||
|
|
|||
| <ItemGroup> | |||
| <ItemGroup Condition="'$(SkipLocalsInitForCI)' != 'true'"> | |||
| <Compile Include="$(CommonPath)SkipLocalsInit.cs" Link="Common\SkipLocalsInit.cs" /> | |||
| </ItemGroup> | |||
Comment on lines
+6935
to
+6946
| if (op1->IsCnsIntOrI() && info.compInitMem) | ||
| { | ||
| const ssize_t size = op1->AsIntCon()->IconValue(); | ||
| const ssize_t alignedSize = (size + (STACK_ALIGN - 1)) & ~(ssize_t)(STACK_ALIGN - 1); | ||
| costEx = 8 + (int)(alignedSize / REGSIZE_BYTES); // > 7 to block if-conversion | ||
| costSz = 4 + (int)(alignedSize / REGSIZE_BYTES); | ||
| } | ||
| else | ||
| { | ||
| costEx = 36; | ||
| costSz = 8; | ||
| } |
Comment on lines
+1903
to
+1905
| // Mirror the gating in impProfileLclHeap: only zero-init, non-constant | ||
| // locallocs are value-profile candidates. Anything else would have its | ||
| // probe data ignored by the consumer and just waste a schema slot. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Revive my old PR that tried to do this, now that #127959 is merged.
Marking as
reduce-unsafebecause the motivation is to makeSkipLocalInitsless useful.Instrument stackallocs of variable size for PGO + Generalize value probing so we can reuse it for other future ideas.
is optimized into additional fast path for most common length:
when we call it mostly for the same size (100 in this example) so we zero it with just 3 AVX512 instructions instead of slow loop
Benchmark
Results
Speedup of PR vs
main(>1.00×= PR is faster, bold). Sizes <= 32 bytes are intentionally not specialized (variable-size loop is already cheap there).Raw runs: EgorBot/Benchmarks#197.
Wins are large on Linux x64 across the board (up to 3.76× at 512 bytes). On Apple M4 the wins are also significant for non-power-of-two sizes (40 / 50 / 64 / 200 -> 2-3×), with two minor regressions at 128 and 256 (-13% / -11%) and parity-or-better elsewhere.