Implement UnknownSizeFrame for locals with unknown size#125491
Implement UnknownSizeFrame for locals with unknown size#125491jakobbotsch merged 13 commits intodotnet:mainfrom
Conversation
Implements a simple bump allocator for TYP_SIMD and TYP_MASK. Locals are allocated to this space when lvaIsUnknownSizeLocal is true for the variable. The frame is implemented on ARM64 as two homogenenous blocks containing either TYP_SIMD or TYP_MASK locals. The x19 register is reserved for addressing locals in the block. Updates codegen for SVE memory transfer instructions to accept indices in multiples of the vector length (or VL / 8 for masks) instead of deriving them from the size of the local.
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
|
Looking at the throughput differences, the performance of |
* Add function header * Create UnknownSizeFrame::GetAddressingOffset and revert changes to lvaFrameAddress * Use rsSetRegsModified and remove kill ref position
Can you post the detailed throughput analysis with per-function information? |
Just checking which analysis you mean here? As the only detailed output I can find from |
Sorry, I got my contributors confused. We have some tooling that can break throughput regressions/improvements down by JIT function, contributed by @SingleAccretion. However, it is x64-host only (based on Intel PIN). I collected the data on benchmarks.run_pgo and it looks like this: Base: 86438410243, Diff: 86528734455, +0.1045%
33418435 : +20.56% : 36.23% : +0.0387% : public: void __cdecl Compiler::lvaAssignFrameOffsets(enum Compiler::FrameLayoutState)
15698295 : +15.34% : 17.02% : +0.0182% : protected: void __cdecl CodeGen::genFnProlog(void)
13341778 : +8.18% : 14.46% : +0.0154% : public: void __cdecl emitter::emitIns_R_S(enum instruction, enum emitAttr, enum _regNumber_enum, int, int)
11303530 : +4.61% : 12.25% : +0.0131% : public: void __cdecl Compiler::lvaAssignVirtualFrameOffsetsToLocals(void)
9485383 : +11.28% : 10.28% : +0.0110% : public: void __cdecl emitter::emitIns_S_R(enum instruction, enum emitAttr, enum _regNumber_enum, int, int)
5717661 : +7.00% : 6.20% : +0.0066% : protected: void __cdecl CodeGen::genCheckUseBlockInit(void)
471445 : +4.61% : 0.51% : +0.0005% : protected: void __cdecl CodeGen::genFinalizeFrame(void)
441974 : +0.28% : 0.48% : +0.0005% : public: __cdecl Compiler::Compiler(class ArenaAllocatorT<struct JitMemKindTraits> *, struct CORINFO_METHOD_STRUCT_*, class ICorJitInfo *, struct CORINFO_METHOD_INFO *, struct InlineInfo *)
377156 : +2.79% : 0.41% : +0.0004% : protected: void __cdecl CodeGen::genPushCalleeSavedRegisters(enum _regNumber_enum, bool *)
282867 : +4.91% : 0.31% : +0.0003% : private: void __cdecl LinearScan::setFrameType(void)
246151 : +4.35% : 0.27% : +0.0003% : protected: void __cdecl CodeGen::genZeroInitFrame(int, int, enum _regNumber_enum, bool *)
92508 : +0.04% : 0.10% : +0.0001% : public: static void __cdecl BitSetOps<unsigned __int64 *, 1, class Compiler *, class TrackedVarBitSetTraits>::LivenessD(class Compiler *, unsigned __int64 *&, unsigned __int64 *const, unsigned __int64 *const, unsigned __int64 *const)
-96660 : -0.09% : 0.10% : -0.0001% : protected: void __cdecl JitExpandArray<unsigned char>::InitializeRange(unsigned int, unsigned int)
-660023 : -100.00% : 0.72% : -0.0008% : public: void __cdecl Compiler::funSetCurrentFunc(unsigned int) These regressions are correspondingly larger in tier0 code where it matters more, but I think we can live with it and if we really care address it in a follow-up. I pushed a merge to resolve the merge conflict. |
No problem, thanks for this. I suppose we see fewer locals on frame at higher optimization level so the impact isn't as strong? |
UnknownSizeFrame Adds lvaIsAllocatedOnUnknownSizeFrame with a stronger criteria for what locals should or shouldn't be allocated in the unknown size frame. Namely promoted struct fields that are address exposed should not be allocated there, because the layout of the structure in memory needs to be preserved.
…on the" This reverts commit 0a448e16be8a929311243e3172a8b0c8f7793969.
These locals need to be treated specially as they allocated to a different part of the frame. Adds some assertions to the original accessors to prevent use with variable sized locals.
| // | ||
| unsigned varNum = lclNum; | ||
|
|
||
| // Variable-sized locals reside in a different part of the stack frame. |
There was a problem hiding this comment.
This brings up OSR support for this kind of stack frame, which I hadn't yet run into. I suppose it's not possible to just skip over these kinds of variables. I would have to either disable OSR for the method, or add some support to allow for copying over the extra frame space as well?
There was a problem hiding this comment.
OSR with unknown frame size runs into problems. Particularly how do you address locals from the tier0 frame? You will approximately have a frame that looks like:
Tier0 locals
Tier0 vectors/masks
OSR locals <- FP points here
OSR vectors/masks <- SP points here
It is not possible to address the tier0 locals via FP without some non-fixed offset encoding. You will need another frame pointer to do that.
I think it is reasonable to disable OSR for now in these functions (meaning that they will be tier1 compiled immediately). You will need to predict whether we are going to end up with unknown size locals, which may not be trivial.
We may need to support this eventually, OSR is important for our PGO and tiering strategy. cc @AndyAyersMS, he has thought about this in relation to localloc a lot.
There was a problem hiding this comment.
I have reserved x19 for addressing the vectors/masks, will this make it easier to support OSR in future? So long as the compiler knows to copy the data and update x19 accordingly? The entire space is some N*VL in size from x19 --> sp.
You will need to predict whether we are going to end up with unknown size locals
This sounds like it needs another pass over the IL, as the earliest we would know TYP_SIMD/TYP_MASK is used is on import of code. I am assuming this is too late to decide on whether OSR is possible?
There was a problem hiding this comment.
Since OSR is only supported for JIT cases we can probably just compute the size of the "Tier0 vectors/masks" part based on the actual vector/mask size during JIT time. It seems like the most straightforward approach.
There was a problem hiding this comment.
To answer your questions:
I have reserved
x19for addressing the vectors/masks, will this make it easier to support OSR in future? So long as the compiler knows to copy the data and updatex19accordingly? The entire space is someN*VLin size fromx19 --> sp.
Note that you cannot move this data around after its initial allocation since there can be pointers pointing to it.
I do not think having reserved x19 makes it much easier. The codegen side of two frame pointers is probably not that hard, but the rest of the VM is not set up to handle the possibility of having to address locals via separate frame pointers.
This sounds like it needs another pass over the IL, as the earliest we would know TYP_SIMD/TYP_MASK is used is on import of code. I am assuming this is too late to decide on whether OSR is possible?
Yes, we currently only support switching very early:
runtime/src/coreclr/jit/compiler.cpp
Lines 6936 to 6978 in 0c9b431
We haven't even imported the IR at this point, we have only done basic setup of the basic blocks. As part of that we do look at the IL though, but I am not sure how feasible it would be to predict whether we are going to end up with unknown size locals at this point. Perhaps a strategy where we reimported once we saw one and then switched to optimize code would work.
There was a problem hiding this comment.
Since OSR is only supported for JIT cases we can probably just compute the size of the "Tier0 vectors/masks" part based on the actual vector/mask size during JIT time. It seems like the most straightforward approach.
I agree, the only reason not to access the actual value of VL is when we are trying to compile portable size-agnostic code (for AOT). If an optimization only runs in JIT mode we should be able to take advantage of the situation and read VL.
The codegen side of two frame pointers is probably not that hard, but the rest of the VM is not set up to handle the possibility of having to address locals via separate frame pointers.
I see, well the main reason for reserving the register is to make addressing modes play nicely. SVE addressing modes typically only accept offsets in multiples of VL, so caching a convenient base address that can reach vectors/masks with offsets in clean multiples of VL and VL/8 is useful.
It's not a necessity though, as the base of the unknown size frame should always be computable from SP/FP (it should always just be SP before any localloc). This just results in an extra instruction or two whenever a vector/mask is loaded from the stack, which should be a slow path anyway, because ideally well vectorized code shouldn't be spilling.
There was a problem hiding this comment.
So to recap: when jitting we'll never have unknown sized frames from SVE, and OSR is only needed when jitting, so there is no problem to solve?
There was a problem hiding this comment.
Yes, we can determine the size of the 'unknown' size frame when jitting which should allow us to solve OSR in future. It should be as simple as reading the size of Vector<T> from the EE and multiplying it by the number of vectors in the frame, as the EE executes the instruction rdvl already and patches the method table with the result.
There was a problem hiding this comment.
So to recap: when jitting we'll never have unknown sized frames from SVE, and OSR is only needed when jitting, so there is no problem to solve?
I think so.
Eventually we will need to solve the "GC pointers at VL-offset dependent locations in the stack frame" problem.
I wonder if we can solve this problem and the fact that escape analysis wants something similar at the same time. IIRC you discussed the possibility of a separate dynamic stack with @davidwrighton before, maybe it would be the way to go here since we could allocate the unknown size frame there.
That would potentially also make the OSR case a bit more natural since now all fixed size locals are next to each other in the OSR method.
|
I'm not seeing any test failures, other than cancellations at the moment. Are these likely to be caused by the patch, or are they unrelated? |
| if (m_compiler->compUsesUnknownSizeFrame) | ||
| { | ||
| genUnknownSizeFrame(); | ||
| } |
There was a problem hiding this comment.
I think this is too early in the prolog. At this point we are still emitting prolog unwind info. I would not expect that we will want to emit any unwind for this adjustment.
This adjustment should happen after unwindEndProlog in genFnProlog.
This PR does not seem to handle zeroing of these locals. Do you expect to do that in a follow-up?
There was a problem hiding this comment.
This PR does not seem to handle zeroing of these locals. Do you expect to do that in a follow-up?
Yes, I think I should take a similar approach to genPoisonFrame and iterate over this frame in genCodeForBlock. Possibly best to cover both in one patch.
There was a problem hiding this comment.
Can you open a tracking issue (or modify an existing one, if there is one) to make sure we don't lose track of these TODOs?
There was a problem hiding this comment.
I've added bullets here for spill temps and initialization: #120599
|
There are some x64 throughput regressions which is unexpected to me. |
* Move genUnknownSizeFrame call after generating unwind info * Remove lvaIsUnknownSizeLocal body from builds from other architectures
Implements a simple bump allocator for
TYP_SIMDandTYP_MASK. Locals are allocated to this space whenlvaIsUnknownSizeLocalis true for the variable.The frame is implemented on ARM64 as two homogenenous blocks containing either
TYP_SIMDorTYP_MASKlocals. The x19 register is reserved for addressing locals in the block. Updates codegen for SVE memory transfer instructions to accept indices in multiples of the vector length (or VL / 8 for masks) instead of deriving them from the size of the local.