Conversation
…n, x, y) Add IL-generated kernels for np.where using runtime code generation: - Uses DynamicMethod to generate type-specific kernels at runtime - Vector256/Vector128.ConditionalSelect for SIMD element selection - 4x loop unrolling for better instruction-level parallelism - Full long indexing support for arrays > 2^31 elements - Supports all 12 dtypes (11 via SIMD, Decimal via scalar fallback) - Kernels cached per type for reuse Architecture: - WhereKernel<T> delegate: (bool* cond, T* x, T* y, T* result, long count) - GetWhereKernel<T>(): Returns cached IL-generated kernel - WhereExecute<T>(): Main entry point with automatic fallback IL Generation: - 4x unrolled SIMD loop (processes 4 vectors per iteration) - Remainder SIMD loop (1 vector at a time) - Scalar tail loop for remaining elements - Mask creation methods by element size (1/2/4/8 bytes) - All arithmetic uses long types natively (no int-to-long casts) Falls back to iterator path for: - Non-contiguous/broadcasted arrays (stride=0) - Non-bool conditions (need truthiness conversion) Files: - src/NumSharp.Core/APIs/np.where.cs: Kernel dispatch logic - src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Where.cs: IL generation - test/NumSharp.UnitTest/Backends/Kernels/WhereSimdTests.cs: 26 tests Closes #604
Member
Author
Performance Results: AVX2 Mask Expansion OptimizationAfter implementing AVX2/SSE4.1 intrinsics for mask expansion, here are the benchmark results: Kernel Performance (double, 1M elements)
Scaling
How It WorksReplaced scalar conditional mask creation with single-instruction SIMD expansion: // Before: 4 scalar conditionals for 8-byte elements
Vector256.Create(
bools[0] != 0 ? 0xFFFFFFFFFFFFFFFFul : 0ul,
bools[1] != 0 ? 0xFFFFFFFFFFFFFFFFul : 0ul,
bools[2] != 0 ? 0xFFFFFFFFFFFFFFFFul : 0ul,
bools[3] != 0 ? 0xFFFFFFFFFFFFFFFFul : 0ul
);
// After: 2-3 instructions using AVX2
var bytes128 = Vector128.CreateScalar(*(uint*)bools).AsByte();
var expanded = Avx2.ConvertToVector256Int64(bytes128).AsUInt64(); // vpmovzxbq
return Vector256.GreaterThan(expanded, Vector256<ulong>.Zero);
All 12 dtypes supported with scalar fallback for non-AVX2/SSE4.1 systems. |
Member
Author
Update: Inlined IL - Now 3.9x FASTER than NumPy!By inlining the mask creation directly in IL instead of calling helper methods:
What ChangedInstead of emitting This eliminates:
The kernel now processes 2,083 million elements per second - significantly faster than NumPy's ~540 M/ms. |
Replace scalar conditional mask creation with SIMD intrinsics: V256 mask creation (for AVX2): - 8-byte elements: Avx2.ConvertToVector256Int64 (vpmovzxbq) - 4-byte elements: Avx2.ConvertToVector256Int32 (vpmovzxbd) - 2-byte elements: Avx2.ConvertToVector256Int16 (vpmovzxbw) V128 mask creation (for SSE4.1): - 8-byte elements: Sse41.ConvertToVector128Int64 (pmovzxbq) - 4-byte elements: Sse41.ConvertToVector128Int32 (pmovzxbd) - 2-byte elements: Sse41.ConvertToVector128Int16 (pmovzxbw) Each intrinsic replaces 4-16 scalar conditionals with a single zero-extend + compare instruction sequence. Also fixes reflection lookups for Vector256/Vector128.Load, Store, and ConditionalSelect methods that were failing because these are generic method definitions requiring special handling. Performance (1M double elements): - Kernel: 2.6ms @ 381 M elements/ms - NumPy baseline: ~1.86ms - Ratio: ~1.4x slower (down from ~3x before optimization) All 12 dtypes supported with fallback for non-AVX2/SSE4.1 systems.
Instead of emitting Call opcodes to mask helper methods, now emit the AVX2/SSE4.1 instructions directly inline in the IL stream. This eliminates: - Method call overhead (~12% per call) - Runtime Avx2.IsSupported checks in hot path - JIT optimization barriers at call boundaries The IL now emits the full mask creation sequence: - 8-byte: ldind.u4 → CreateScalar → AsByte → ConvertToVector256Int64 → AsUInt64 → GreaterThan - 4-byte: ldind.i8 → CreateScalar → AsByte → ConvertToVector256Int32 → AsUInt32 → GreaterThan - 2-byte: Load → ConvertToVector256Int16 → AsUInt16 → GreaterThan - 1-byte: Load → GreaterThan (direct comparison) Performance (1M double elements): - Previous (method call): 2.6 ms - Inlined IL: 0.48 ms (5.4x faster) - NumPy baseline: 1.86 ms (NumSharp is now 3.9x FASTER) Fixed reflection lookups for AsByte/AsUInt* which are extension methods on Vector128/Vector256 static classes, not instance methods.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
np.where(condition, x, y)DynamicMethodto generate type-specific kernels at runtimeImplementation
WhereKernel<T>GetWhereKernel<T>()WhereExecute<T>()Eligibility for SIMD Path
Falls back to iterator path for:
Test Plan
Closes #604