Implement UnknownSizeFrame for locals with unknown size by snickolls-arm · Pull Request #125491 · dotnet/runtime

snickolls-arm · 2026-03-12T12:57:54Z

Implements a simple bump allocator for TYP_SIMD and TYP_MASK. Locals are allocated to this space when lvaIsUnknownSizeLocal is true for the variable.

The frame is implemented on ARM64 as two homogenenous blocks containing either TYP_SIMD or TYP_MASK locals. The x19 register is reserved for addressing locals in the block. Updates codegen for SVE memory transfer instructions to accept indices in multiples of the vector length (or VL / 8 for masks) instead of deriving them from the size of the local.

Implements a simple bump allocator for TYP_SIMD and TYP_MASK. Locals are allocated to this space when lvaIsUnknownSizeLocal is true for the variable. The frame is implemented on ARM64 as two homogenenous blocks containing either TYP_SIMD or TYP_MASK locals. The x19 register is reserved for addressing locals in the block. Updates codegen for SVE memory transfer instructions to accept indices in multiples of the vector length (or VL / 8 for masks) instead of deriving them from the size of the local.

dotnet-policy-service · 2026-03-12T12:59:06Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

snickolls-arm · 2026-03-16T11:06:13Z

Looking at the throughput differences, the performance of lvaIsUnknownSizeLocal is probably not very good, maybe adding another property bit to LclVarDsc could help with this.

* Add function header * Create UnknownSizeFrame::GetAddressingOffset and revert changes to lvaFrameAddress * Use rsSetRegsModified and remove kill ref position

jakobbotsch · 2026-04-01T09:03:20Z

Looking at the throughput differences, the performance of lvaIsUnknownSizeLocal is probably not very good, maybe adding another property bit to LclVarDsc could help with this.

Can you post the detailed throughput analysis with per-function information?

snickolls-arm · 2026-04-13T14:46:54Z

Looking at the throughput differences, the performance of lvaIsUnknownSizeLocal is probably not very good, maybe adding another property bit to LclVarDsc could help with this.

Can you post the detailed throughput analysis with per-function information?

Just checking which analysis you mean here? As the only detailed output I can find from tpdiff is in CSV form. Is there tooling available for processing this?

jakobbotsch · 2026-04-14T09:38:44Z

Looking at the throughput differences, the performance of lvaIsUnknownSizeLocal is probably not very good, maybe adding another property bit to LclVarDsc could help with this.

Can you post the detailed throughput analysis with per-function information?

Just checking which analysis you mean here? As the only detailed output I can find from tpdiff is in CSV form. Is there tooling available for processing this?

Sorry, I got my contributors confused. We have some tooling that can break throughput regressions/improvements down by JIT function, contributed by @SingleAccretion. However, it is x64-host only (based on Intel PIN).

I collected the data on benchmarks.run_pgo and it looks like this:

Base: 86438410243, Diff: 86528734455, +0.1045%

33418435 : +20.56%  : 36.23% : +0.0387% : public: void __cdecl Compiler::lvaAssignFrameOffsets(enum Compiler::FrameLayoutState)                                                                                                                                                    
15698295 : +15.34%  : 17.02% : +0.0182% : protected: void __cdecl CodeGen::genFnProlog(void)                                                                                                                                                                                       
13341778 : +8.18%   : 14.46% : +0.0154% : public: void __cdecl emitter::emitIns_R_S(enum instruction, enum emitAttr, enum _regNumber_enum, int, int)                                                                                                                               
11303530 : +4.61%   : 12.25% : +0.0131% : public: void __cdecl Compiler::lvaAssignVirtualFrameOffsetsToLocals(void)                                                                                                                                                                
9485383  : +11.28%  : 10.28% : +0.0110% : public: void __cdecl emitter::emitIns_S_R(enum instruction, enum emitAttr, enum _regNumber_enum, int, int)                                                                                                                               
5717661  : +7.00%   : 6.20%  : +0.0066% : protected: void __cdecl CodeGen::genCheckUseBlockInit(void)                                                                                                                                                                              
471445   : +4.61%   : 0.51%  : +0.0005% : protected: void __cdecl CodeGen::genFinalizeFrame(void)                                                                                                                                                                                  
441974   : +0.28%   : 0.48%  : +0.0005% : public: __cdecl Compiler::Compiler(class ArenaAllocatorT<struct JitMemKindTraits> *, struct CORINFO_METHOD_STRUCT_*, class ICorJitInfo *, struct CORINFO_METHOD_INFO *, struct InlineInfo *)                                             
377156   : +2.79%   : 0.41%  : +0.0004% : protected: void __cdecl CodeGen::genPushCalleeSavedRegisters(enum _regNumber_enum, bool *)                                                                                                                                               
282867   : +4.91%   : 0.31%  : +0.0003% : private: void __cdecl LinearScan::setFrameType(void)                                                                                                                                                                                     
246151   : +4.35%   : 0.27%  : +0.0003% : protected: void __cdecl CodeGen::genZeroInitFrame(int, int, enum _regNumber_enum, bool *)                                                                                                                                                
92508    : +0.04%   : 0.10%  : +0.0001% : public: static void __cdecl BitSetOps<unsigned __int64 *, 1, class Compiler *, class TrackedVarBitSetTraits>::LivenessD(class Compiler *, unsigned __int64 *&, unsigned __int64 *const, unsigned __int64 *const, unsigned __int64 *const)
-96660   : -0.09%   : 0.10%  : -0.0001% : protected: void __cdecl JitExpandArray<unsigned char>::InitializeRange(unsigned int, unsigned int)                                                                                                                                       
-660023  : -100.00% : 0.72%  : -0.0008% : public: void __cdecl Compiler::funSetCurrentFunc(unsigned int)

These regressions are correspondingly larger in tier0 code where it matters more, but I think we can live with it and if we really care address it in a follow-up.

I pushed a merge to resolve the merge conflict.

snickolls-arm · 2026-04-14T10:15:24Z

Looking at the throughput differences, the performance of lvaIsUnknownSizeLocal is probably not very good, maybe adding another property bit to LclVarDsc could help with this.

Can you post the detailed throughput analysis with per-function information?

Just checking which analysis you mean here? As the only detailed output I can find from tpdiff is in CSV form. Is there tooling available for processing this?

Sorry, I got my contributors confused. We have some tooling that can break throughput regressions/improvements down by JIT function, contributed by @SingleAccretion. However, it is x64-host only (based on Intel PIN).

I collected the data on benchmarks.run_pgo and it looks like this:

Base: 86438410243, Diff: 86528734455, +0.1045%

33418435 : +20.56%  : 36.23% : +0.0387% : public: void __cdecl Compiler::lvaAssignFrameOffsets(enum Compiler::FrameLayoutState)                                                                                                                                                    
15698295 : +15.34%  : 17.02% : +0.0182% : protected: void __cdecl CodeGen::genFnProlog(void)                                                                                                                                                                                       
13341778 : +8.18%   : 14.46% : +0.0154% : public: void __cdecl emitter::emitIns_R_S(enum instruction, enum emitAttr, enum _regNumber_enum, int, int)                                                                                                                               
11303530 : +4.61%   : 12.25% : +0.0131% : public: void __cdecl Compiler::lvaAssignVirtualFrameOffsetsToLocals(void)                                                                                                                                                                
9485383  : +11.28%  : 10.28% : +0.0110% : public: void __cdecl emitter::emitIns_S_R(enum instruction, enum emitAttr, enum _regNumber_enum, int, int)                                                                                                                               
5717661  : +7.00%   : 6.20%  : +0.0066% : protected: void __cdecl CodeGen::genCheckUseBlockInit(void)                                                                                                                                                                              
471445   : +4.61%   : 0.51%  : +0.0005% : protected: void __cdecl CodeGen::genFinalizeFrame(void)                                                                                                                                                                                  
441974   : +0.28%   : 0.48%  : +0.0005% : public: __cdecl Compiler::Compiler(class ArenaAllocatorT<struct JitMemKindTraits> *, struct CORINFO_METHOD_STRUCT_*, class ICorJitInfo *, struct CORINFO_METHOD_INFO *, struct InlineInfo *)                                             
377156   : +2.79%   : 0.41%  : +0.0004% : protected: void __cdecl CodeGen::genPushCalleeSavedRegisters(enum _regNumber_enum, bool *)                                                                                                                                               
282867   : +4.91%   : 0.31%  : +0.0003% : private: void __cdecl LinearScan::setFrameType(void)                                                                                                                                                                                     
246151   : +4.35%   : 0.27%  : +0.0003% : protected: void __cdecl CodeGen::genZeroInitFrame(int, int, enum _regNumber_enum, bool *)                                                                                                                                                
92508    : +0.04%   : 0.10%  : +0.0001% : public: static void __cdecl BitSetOps<unsigned __int64 *, 1, class Compiler *, class TrackedVarBitSetTraits>::LivenessD(class Compiler *, unsigned __int64 *&, unsigned __int64 *const, unsigned __int64 *const, unsigned __int64 *const)
-96660   : -0.09%   : 0.10%  : -0.0001% : protected: void __cdecl JitExpandArray<unsigned char>::InitializeRange(unsigned int, unsigned int)                                                                                                                                       
-660023  : -100.00% : 0.72%  : -0.0008% : public: void __cdecl Compiler::funSetCurrentFunc(unsigned int)

These regressions are correspondingly larger in tier0 code where it matters more, but I think we can live with it and if we really care address it in a follow-up.

I pushed a merge to resolve the merge conflict.

No problem, thanks for this. I suppose we see fewer locals on frame at higher optimization level so the impact isn't as strong?

UnknownSizeFrame Adds lvaIsAllocatedOnUnknownSizeFrame with a stronger criteria for what locals should or shouldn't be allocated in the unknown size frame. Namely promoted struct fields that are address exposed should not be allocated there, because the layout of the structure in memory needs to be preserved.

…on the" This reverts commit 0a448e16be8a929311243e3172a8b0c8f7793969.

These locals need to be treated specially as they allocated to a different part of the frame. Adds some assertions to the original accessors to prevent use with variable sized locals.

snickolls-arm · 2026-04-17T14:46:28Z

        //
        unsigned varNum = lclNum;

+        // Variable-sized locals reside in a different part of the stack frame.


This brings up OSR support for this kind of stack frame, which I hadn't yet run into. I suppose it's not possible to just skip over these kinds of variables. I would have to either disable OSR for the method, or add some support to allow for copying over the extra frame space as well?

OSR with unknown frame size runs into problems. Particularly how do you address locals from the tier0 frame? You will approximately have a frame that looks like:

Tier0 locals Tier0 vectors/masks OSR locals <- FP points here OSR vectors/masks <- SP points here

It is not possible to address the tier0 locals via FP without some non-fixed offset encoding. You will need another frame pointer to do that.
I think it is reasonable to disable OSR for now in these functions (meaning that they will be tier1 compiled immediately). You will need to predict whether we are going to end up with unknown size locals, which may not be trivial.

We may need to support this eventually, OSR is important for our PGO and tiering strategy. cc @AndyAyersMS, he has thought about this in relation to localloc a lot.

I have reserved x19 for addressing the vectors/masks, will this make it easier to support OSR in future? So long as the compiler knows to copy the data and update x19 accordingly? The entire space is some N*VL in size from x19 --> sp.

You will need to predict whether we are going to end up with unknown size locals

This sounds like it needs another pass over the IL, as the earliest we would know TYP_SIMD/TYP_MASK is used is on import of code. I am assuming this is too late to decide on whether OSR is possible?

Since OSR is only supported for JIT cases we can probably just compute the size of the "Tier0 vectors/masks" part based on the actual vector/mask size during JIT time. It seems like the most straightforward approach.

To answer your questions:

I have reserved x19 for addressing the vectors/masks, will this make it easier to support OSR in future? So long as the compiler knows to copy the data and update x19 accordingly? The entire space is some N*VL in size from x19 --> sp.

Note that you cannot move this data around after its initial allocation since there can be pointers pointing to it.
I do not think having reserved x19 makes it much easier. The codegen side of two frame pointers is probably not that hard, but the rest of the VM is not set up to handle the possibility of having to address locals via separate frame pointers.

This sounds like it needs another pass over the IL, as the earliest we would know TYP_SIMD/TYP_MASK is used is on import of code. I am assuming this is too late to decide on whether OSR is possible?

Yes, we currently only support switching very early:

runtime/src/coreclr/jit/compiler.cpp

Lines 6936 to 6978 in 0c9b431

if (compHasBackwardJump && (reason == nullptr) && (JitConfig.TC_OnStackReplacement() > 0))

{

bool canEscapeViaOSR = compCanHavePatchpoints(&reason);

#ifdef DEBUG

if (canEscapeViaOSR)

{

// Optionally disable OSR by method hash. This will force any

// method that might otherwise get trapped in Tier0 to be optimized.

//

static ConfigMethodRange JitEnableOsrRange;

JitEnableOsrRange.EnsureInit(JitConfig.JitEnableOsrRange());

const unsigned hash = impInlineRoot()->info.compMethodHash();

if (!JitEnableOsrRange.Contains(hash))

{

canEscapeViaOSR = false;

reason = "OSR disabled by JitEnableOsrRange";

}

}

#endif

if (canEscapeViaOSR)

{

JITDUMP("\nOSR enabled for this method\n");

if (compHasBackwardJump && !compTailPrefixSeen &&

opts.jitFlags->IsSet(JitFlags::JIT_FLAG_BBINSTR_IF_LOOPS) && opts.IsTier0())

{

assert((info.compFlags & CORINFO_FLG_DISABLE_TIER0_FOR_LOOPS) == 0);

opts.jitFlags->Set(JitFlags::JIT_FLAG_BBINSTR);

JITDUMP("\nEnabling instrumentation for this method so OSR'd version will have a profile.\n");

}

}

else

{

JITDUMP("\nOSR disabled for this method: %s\n", reason);

assert(reason != nullptr);

}

}

if (reason != nullptr)

{

fgSwitchToOptimized(reason);

}

We haven't even imported the IR at this point, we have only done basic setup of the basic blocks. As part of that we do look at the IL though, but I am not sure how feasible it would be to predict whether we are going to end up with unknown size locals at this point. Perhaps a strategy where we reimported once we saw one and then switched to optimize code would work.

Since OSR is only supported for JIT cases we can probably just compute the size of the "Tier0 vectors/masks" part based on the actual vector/mask size during JIT time. It seems like the most straightforward approach.

I agree, the only reason not to access the actual value of VL is when we are trying to compile portable size-agnostic code (for AOT). If an optimization only runs in JIT mode we should be able to take advantage of the situation and read VL.

The codegen side of two frame pointers is probably not that hard, but the rest of the VM is not set up to handle the possibility of having to address locals via separate frame pointers.

I see, well the main reason for reserving the register is to make addressing modes play nicely. SVE addressing modes typically only accept offsets in multiples of VL, so caching a convenient base address that can reach vectors/masks with offsets in clean multiples of VL and VL/8 is useful.

It's not a necessity though, as the base of the unknown size frame should always be computable from SP/FP (it should always just be SP before any localloc). This just results in an extra instruction or two whenever a vector/mask is loaded from the stack, which should be a slow path anyway, because ideally well vectorized code shouldn't be spilling.

So to recap: when jitting we'll never have unknown sized frames from SVE, and OSR is only needed when jitting, so there is no problem to solve?

Yes, we can determine the size of the 'unknown' size frame when jitting which should allow us to solve OSR in future. It should be as simple as reading the size of Vector<T> from the EE and multiplying it by the number of vectors in the frame, as the EE executes the instruction rdvl already and patches the method table with the result.

So to recap: when jitting we'll never have unknown sized frames from SVE, and OSR is only needed when jitting, so there is no problem to solve?

I think so.

Eventually we will need to solve the "GC pointers at VL-offset dependent locations in the stack frame" problem.
I wonder if we can solve this problem and the fact that escape analysis wants something similar at the same time. IIRC you discussed the possibility of a separate dynamic stack with @davidwrighton before, maybe it would be the way to go here since we could allocate the unknown size frame there.

That would potentially also make the OSR case a bit more natural since now all fixed size locals are next to each other in the OSR method.

snickolls-arm · 2026-05-07T08:50:54Z

I'm not seeing any test failures, other than cancellations at the moment. Are these likely to be caused by the patch, or are they unrelated?

jakobbotsch · 2026-05-07T09:28:06Z

+    if (m_compiler->compUsesUnknownSizeFrame)
+    {
+        genUnknownSizeFrame();
+    }


I think this is too early in the prolog. At this point we are still emitting prolog unwind info. I would not expect that we will want to emit any unwind for this adjustment.
This adjustment should happen after unwindEndProlog in genFnProlog.

This PR does not seem to handle zeroing of these locals. Do you expect to do that in a follow-up?

This PR does not seem to handle zeroing of these locals. Do you expect to do that in a follow-up?

Yes, I think I should take a similar approach to genPoisonFrame and iterate over this frame in genCodeForBlock. Possibly best to cover both in one patch.

Can you open a tracking issue (or modify an existing one, if there is one) to make sure we don't lose track of these TODOs?

I've added bullets here for spill temps and initialization: #120599

jakobbotsch · 2026-05-07T09:33:38Z

There are some x64 throughput regressions which is unexpected to me.
Can you try to fast-path lvaIsUnknownSizeLocal outside arm64 and see if they disappear?

* Move genUnknownSizeFrame call after generating unwind info * Remove lvaIsUnknownSizeLocal body from builds from other architectures

jakobbotsch

LGTM!

github-actions Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 12, 2026

dotnet-policy-service Bot added the community-contribution Indicates that the PR has been added by a community member label Mar 12, 2026

Reinstate HOST_64BIT ifdef

38e056e

jkotas added the arm-sve Work related to arm64 SVE/SVE2 support label Mar 12, 2026

snickolls-arm mentioned this pull request Mar 12, 2026

Accelerating Vector<T> with SVE on ARM64 #120599

Open

11 tasks

Merge branch 'main' into stack-frame

ceb5ce1

build-analysis Bot mentioned this pull request Mar 13, 2026

AOT precompile failed on ios-arm64: 'No space left on device' (exit code 1) #125107

Closed

Merge branch 'main' into stack-frame

359b1e1

jakobbotsch self-requested a review March 16, 2026 09:46

build-analysis Bot mentioned this pull request Mar 16, 2026

Test failure: System.IO.Tests.File_Move_Tests.File_Move_SynchronizingObject #103940

Closed