[API] Support np.where via ILKernelGenerator by Nucs · Pull Request #606 · SciSharp/NumSharp

Nucs · 2026-04-12T10:20:37Z

Summary

Add IL-generated SIMD optimization for np.where(condition, x, y)
Uses DynamicMethod to generate type-specific kernels at runtime
Vector256/Vector128.ConditionalSelect for SIMD element selection
4x loop unrolling for instruction-level parallelism
Native long indexing for large arrays
Supports all 12 dtypes (11 via SIMD, Decimal via scalar fallback)

Implementation

Component	Description
`WhereKernel<T>`	Delegate type for IL-generated kernels
`GetWhereKernel<T>()`	Get/generate cached kernel
`WhereExecute<T>()`	Main entry with automatic fallback
Mask creation	Grouped by element size (1/2/4/8 bytes)

Eligibility for SIMD Path

bool canUseKernel = ILKernelGenerator.Enabled &&
                    cond.typecode == NPTypeCode.Boolean &&
                    cond.Shape.IsContiguous &&
                    xArr.Shape.IsContiguous &&
                    yArr.Shape.IsContiguous;

Falls back to iterator path for:

Non-contiguous/broadcasted arrays
Non-bool conditions (need truthiness conversion)

Test Plan

26 new WhereSimdTests for SIMD correctness
36 existing np_where_Test pass
21 battle tests pass
All 12 dtypes covered

Closes #604

…n, x, y) Add IL-generated kernels for np.where using runtime code generation: - Uses DynamicMethod to generate type-specific kernels at runtime - Vector256/Vector128.ConditionalSelect for SIMD element selection - 4x loop unrolling for better instruction-level parallelism - Full long indexing support for arrays > 2^31 elements - Supports all 12 dtypes (11 via SIMD, Decimal via scalar fallback) - Kernels cached per type for reuse Architecture: - WhereKernel<T> delegate: (bool* cond, T* x, T* y, T* result, long count) - GetWhereKernel<T>(): Returns cached IL-generated kernel - WhereExecute<T>(): Main entry point with automatic fallback IL Generation: - 4x unrolled SIMD loop (processes 4 vectors per iteration) - Remainder SIMD loop (1 vector at a time) - Scalar tail loop for remaining elements - Mask creation methods by element size (1/2/4/8 bytes) - All arithmetic uses long types natively (no int-to-long casts) Falls back to iterator path for: - Non-contiguous/broadcasted arrays (stride=0) - Non-bool conditions (need truthiness conversion) Files: - src/NumSharp.Core/APIs/np.where.cs: Kernel dispatch logic - src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Where.cs: IL generation - test/NumSharp.UnitTest/Backends/Kernels/WhereSimdTests.cs: 26 tests Closes #604

Nucs · 2026-04-12T11:11:53Z

Performance Results: AVX2 Mask Expansion Optimization

After implementing AVX2/SSE4.1 intrinsics for mask expansion, here are the benchmark results:

Kernel Performance (double, 1M elements)

Metric	Value
Kernel time	2.62 ms
Throughput	381 M elements/ms
NumPy baseline	~1.86 ms
Ratio vs NumPy	~1.4x slower

Scaling

Size	Kernel (ms)	Throughput
1K	0.0024	416 M/ms
10K	0.027	368 M/ms
100K	0.28	356 M/ms
1M	2.62	381 M/ms

How It Works

Replaced scalar conditional mask creation with single-instruction SIMD expansion:

// Before: 4 scalar conditionals for 8-byte elements
Vector256.Create(
    bools[0] != 0 ? 0xFFFFFFFFFFFFFFFFul : 0ul,
    bools[1] != 0 ? 0xFFFFFFFFFFFFFFFFul : 0ul,
    bools[2] != 0 ? 0xFFFFFFFFFFFFFFFFul : 0ul,
    bools[3] != 0 ? 0xFFFFFFFFFFFFFFFFul : 0ul
);

// After: 2-3 instructions using AVX2
var bytes128 = Vector128.CreateScalar(*(uint*)bools).AsByte();
var expanded = Avx2.ConvertToVector256Int64(bytes128).AsUInt64();  // vpmovzxbq
return Vector256.GreaterThan(expanded, Vector256<ulong>.Zero);

Element Size	Intrinsic	Effect
8 bytes	`vpmovzxbq`	4 bytes → 4 qwords
4 bytes	`vpmovzxbd`	8 bytes → 8 dwords
2 bytes	`vpmovzxbw`	16 bytes → 16 words

All 12 dtypes supported with scalar fallback for non-AVX2/SSE4.1 systems.

Nucs · 2026-04-12T11:48:39Z

Update: Inlined IL - Now 3.9x FASTER than NumPy!

By inlining the mask creation directly in IL instead of calling helper methods:

Version	Kernel Time	vs NumPy
With method Call	2.6 ms	1.4x slower
Inlined IL	0.48 ms	3.9x faster
NumPy	1.86 ms	baseline

What Changed

Instead of emitting Call opcodes to mask helper methods, the IL now emits the full AVX2 instruction sequence inline:

ldind.u4           ; Load 4 bool bytes
call CreateScalar  ; Put in Vector128
call AsByte        ; Reinterpret
call vpmovzxbq     ; AVX2 zero-extend bytes to qwords
call AsUInt64      ; Reinterpret  
call get_Zero      ; Vector256<ulong>.Zero
call GreaterThan   ; Create mask

This eliminates:

Method call overhead (~12%)
Runtime Avx2.IsSupported checks in hot path
JIT optimization barriers at call boundaries

The kernel now processes 2,083 million elements per second - significantly faster than NumPy's ~540 M/ms.

Replace scalar conditional mask creation with SIMD intrinsics: V256 mask creation (for AVX2): - 8-byte elements: Avx2.ConvertToVector256Int64 (vpmovzxbq) - 4-byte elements: Avx2.ConvertToVector256Int32 (vpmovzxbd) - 2-byte elements: Avx2.ConvertToVector256Int16 (vpmovzxbw) V128 mask creation (for SSE4.1): - 8-byte elements: Sse41.ConvertToVector128Int64 (pmovzxbq) - 4-byte elements: Sse41.ConvertToVector128Int32 (pmovzxbd) - 2-byte elements: Sse41.ConvertToVector128Int16 (pmovzxbw) Each intrinsic replaces 4-16 scalar conditionals with a single zero-extend + compare instruction sequence. Also fixes reflection lookups for Vector256/Vector128.Load, Store, and ConditionalSelect methods that were failing because these are generic method definitions requiring special handling. Performance (1M double elements): - Kernel: 2.6ms @ 381 M elements/ms - NumPy baseline: ~1.86ms - Ratio: ~1.4x slower (down from ~3x before optimization) All 12 dtypes supported with fallback for non-AVX2/SSE4.1 systems.

Instead of emitting Call opcodes to mask helper methods, now emit the AVX2/SSE4.1 instructions directly inline in the IL stream. This eliminates: - Method call overhead (~12% per call) - Runtime Avx2.IsSupported checks in hot path - JIT optimization barriers at call boundaries The IL now emits the full mask creation sequence: - 8-byte: ldind.u4 → CreateScalar → AsByte → ConvertToVector256Int64 → AsUInt64 → GreaterThan - 4-byte: ldind.i8 → CreateScalar → AsByte → ConvertToVector256Int32 → AsUInt32 → GreaterThan - 2-byte: Load → ConvertToVector256Int16 → AsUInt16 → GreaterThan - 1-byte: Load → GreaterThan (direct comparison) Performance (1M double elements): - Previous (method call): 2.6 ms - Inlined IL: 0.48 ms (5.4x faster) - NumPy baseline: 1.86 ms (NumSharp is now 3.9x FASTER) Fixed reflection lookups for AsByte/AsUInt* which are extension methods on Vector128/Vector256 static classes, not instance methods.

Nucs added 2 commits April 12, 2026 15:03

Nucs force-pushed the np_where branch from 46de8bc to 10ae98b Compare April 12, 2026 12:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[API] Support np.where via ILKernelGenerator#606

[API] Support np.where via ILKernelGenerator#606
Nucs wants to merge 3 commits intomasterfrom
np_where

Nucs commented Apr 12, 2026

Uh oh!

Nucs commented Apr 12, 2026

Uh oh!

Nucs commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Nucs commented Apr 12, 2026

Summary

Implementation

Eligibility for SIMD Path

Test Plan

Uh oh!

Nucs commented Apr 12, 2026

Performance Results: AVX2 Mask Expansion Optimization

Kernel Performance (double, 1M elements)

Scaling

How It Works

Uh oh!

Nucs commented Apr 12, 2026

Update: Inlined IL - Now 3.9x FASTER than NumPy!

What Changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant