Skip to content

[API] Support np.where via ILKernelGenerator#606

Open
Nucs wants to merge 3 commits intomasterfrom
np_where
Open

[API] Support np.where via ILKernelGenerator#606
Nucs wants to merge 3 commits intomasterfrom
np_where

Conversation

@Nucs
Copy link
Copy Markdown
Member

@Nucs Nucs commented Apr 12, 2026

Summary

  • Add IL-generated SIMD optimization for np.where(condition, x, y)
  • Uses DynamicMethod to generate type-specific kernels at runtime
  • Vector256/Vector128.ConditionalSelect for SIMD element selection
  • 4x loop unrolling for instruction-level parallelism
  • Native long indexing for large arrays
  • Supports all 12 dtypes (11 via SIMD, Decimal via scalar fallback)

Implementation

Component Description
WhereKernel<T> Delegate type for IL-generated kernels
GetWhereKernel<T>() Get/generate cached kernel
WhereExecute<T>() Main entry with automatic fallback
Mask creation Grouped by element size (1/2/4/8 bytes)

Eligibility for SIMD Path

bool canUseKernel = ILKernelGenerator.Enabled &&
                    cond.typecode == NPTypeCode.Boolean &&
                    cond.Shape.IsContiguous &&
                    xArr.Shape.IsContiguous &&
                    yArr.Shape.IsContiguous;

Falls back to iterator path for:

  • Non-contiguous/broadcasted arrays
  • Non-bool conditions (need truthiness conversion)

Test Plan

  • 26 new WhereSimdTests for SIMD correctness
  • 36 existing np_where_Test pass
  • 21 battle tests pass
  • All 12 dtypes covered

Closes #604

…n, x, y)

Add IL-generated kernels for np.where using runtime code generation:
- Uses DynamicMethod to generate type-specific kernels at runtime
- Vector256/Vector128.ConditionalSelect for SIMD element selection
- 4x loop unrolling for better instruction-level parallelism
- Full long indexing support for arrays > 2^31 elements
- Supports all 12 dtypes (11 via SIMD, Decimal via scalar fallback)
- Kernels cached per type for reuse

Architecture:
- WhereKernel<T> delegate: (bool* cond, T* x, T* y, T* result, long count)
- GetWhereKernel<T>(): Returns cached IL-generated kernel
- WhereExecute<T>(): Main entry point with automatic fallback

IL Generation:
- 4x unrolled SIMD loop (processes 4 vectors per iteration)
- Remainder SIMD loop (1 vector at a time)
- Scalar tail loop for remaining elements
- Mask creation methods by element size (1/2/4/8 bytes)
- All arithmetic uses long types natively (no int-to-long casts)

Falls back to iterator path for:
- Non-contiguous/broadcasted arrays (stride=0)
- Non-bool conditions (need truthiness conversion)

Files:
- src/NumSharp.Core/APIs/np.where.cs: Kernel dispatch logic
- src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Where.cs: IL generation
- test/NumSharp.UnitTest/Backends/Kernels/WhereSimdTests.cs: 26 tests

Closes #604
@Nucs
Copy link
Copy Markdown
Member Author

Nucs commented Apr 12, 2026

Performance Results: AVX2 Mask Expansion Optimization

After implementing AVX2/SSE4.1 intrinsics for mask expansion, here are the benchmark results:

Kernel Performance (double, 1M elements)

Metric Value
Kernel time 2.62 ms
Throughput 381 M elements/ms
NumPy baseline ~1.86 ms
Ratio vs NumPy ~1.4x slower

Scaling

Size Kernel (ms) Throughput
1K 0.0024 416 M/ms
10K 0.027 368 M/ms
100K 0.28 356 M/ms
1M 2.62 381 M/ms

How It Works

Replaced scalar conditional mask creation with single-instruction SIMD expansion:

// Before: 4 scalar conditionals for 8-byte elements
Vector256.Create(
    bools[0] != 0 ? 0xFFFFFFFFFFFFFFFFul : 0ul,
    bools[1] != 0 ? 0xFFFFFFFFFFFFFFFFul : 0ul,
    bools[2] != 0 ? 0xFFFFFFFFFFFFFFFFul : 0ul,
    bools[3] != 0 ? 0xFFFFFFFFFFFFFFFFul : 0ul
);

// After: 2-3 instructions using AVX2
var bytes128 = Vector128.CreateScalar(*(uint*)bools).AsByte();
var expanded = Avx2.ConvertToVector256Int64(bytes128).AsUInt64();  // vpmovzxbq
return Vector256.GreaterThan(expanded, Vector256<ulong>.Zero);
Element Size Intrinsic Effect
8 bytes vpmovzxbq 4 bytes → 4 qwords
4 bytes vpmovzxbd 8 bytes → 8 dwords
2 bytes vpmovzxbw 16 bytes → 16 words

All 12 dtypes supported with scalar fallback for non-AVX2/SSE4.1 systems.

@Nucs
Copy link
Copy Markdown
Member Author

Nucs commented Apr 12, 2026

Update: Inlined IL - Now 3.9x FASTER than NumPy!

By inlining the mask creation directly in IL instead of calling helper methods:

Version Kernel Time vs NumPy
With method Call 2.6 ms 1.4x slower
Inlined IL 0.48 ms 3.9x faster
NumPy 1.86 ms baseline

What Changed

Instead of emitting Call opcodes to mask helper methods, the IL now emits the full AVX2 instruction sequence inline:

ldind.u4           ; Load 4 bool bytes
call CreateScalar  ; Put in Vector128
call AsByte        ; Reinterpret
call vpmovzxbq     ; AVX2 zero-extend bytes to qwords
call AsUInt64      ; Reinterpret  
call get_Zero      ; Vector256<ulong>.Zero
call GreaterThan   ; Create mask

This eliminates:

  • Method call overhead (~12%)
  • Runtime Avx2.IsSupported checks in hot path
  • JIT optimization barriers at call boundaries

The kernel now processes 2,083 million elements per second - significantly faster than NumPy's ~540 M/ms.

Nucs added 2 commits April 12, 2026 15:03
Replace scalar conditional mask creation with SIMD intrinsics:

V256 mask creation (for AVX2):
- 8-byte elements: Avx2.ConvertToVector256Int64 (vpmovzxbq)
- 4-byte elements: Avx2.ConvertToVector256Int32 (vpmovzxbd)
- 2-byte elements: Avx2.ConvertToVector256Int16 (vpmovzxbw)

V128 mask creation (for SSE4.1):
- 8-byte elements: Sse41.ConvertToVector128Int64 (pmovzxbq)
- 4-byte elements: Sse41.ConvertToVector128Int32 (pmovzxbd)
- 2-byte elements: Sse41.ConvertToVector128Int16 (pmovzxbw)

Each intrinsic replaces 4-16 scalar conditionals with a single
zero-extend + compare instruction sequence.

Also fixes reflection lookups for Vector256/Vector128.Load, Store,
and ConditionalSelect methods that were failing because these are
generic method definitions requiring special handling.

Performance (1M double elements):
- Kernel: 2.6ms @ 381 M elements/ms
- NumPy baseline: ~1.86ms
- Ratio: ~1.4x slower (down from ~3x before optimization)

All 12 dtypes supported with fallback for non-AVX2/SSE4.1 systems.
Instead of emitting Call opcodes to mask helper methods, now emit
the AVX2/SSE4.1 instructions directly inline in the IL stream.

This eliminates:
- Method call overhead (~12% per call)
- Runtime Avx2.IsSupported checks in hot path
- JIT optimization barriers at call boundaries

The IL now emits the full mask creation sequence:
- 8-byte: ldind.u4 → CreateScalar → AsByte → ConvertToVector256Int64 → AsUInt64 → GreaterThan
- 4-byte: ldind.i8 → CreateScalar → AsByte → ConvertToVector256Int32 → AsUInt32 → GreaterThan
- 2-byte: Load → ConvertToVector256Int16 → AsUInt16 → GreaterThan
- 1-byte: Load → GreaterThan (direct comparison)

Performance (1M double elements):
- Previous (method call): 2.6 ms
- Inlined IL:             0.48 ms (5.4x faster)
- NumPy baseline:         1.86 ms (NumSharp is now 3.9x FASTER)

Fixed reflection lookups for AsByte/AsUInt* which are extension
methods on Vector128/Vector256 static classes, not instance methods.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[API] Support np.where via ILKernelGenerator

1 participant