Commit 3639e99
[rocprofiler-systems] Decouple GPU device from AMD SMI types (#5334)
## Motivation
GPU `device.hpp` directly included `<amd_smi/amdsmi.h>` and used AMD SMI
types throughout (`amdsmi_gpu_metrics_t`, `amdsmi_processor_handle`,
`amdsmi_proc_info_t`, status codes). Even though the driver was
template-injected for testability, the AMD SMI types leaked into the
device class, making it impossible to compile and run GPU device unit
tests without the AMD SMI library installed.
This PR introduces a proper abstraction boundary so that `device.hpp`
and its tests have zero AMD SMI dependency, mirroring exactly the
pattern already established for the NIC collector in #5196.
## Technical Details
**Bridge pattern — `gpu_driver` owns the handle and abstracts all AMD
SMI types**
- Add GPU POD type (`asic_info`) to `gpu/types.hpp` (others fold into
the existing `metrics` POD).
- Create `gpu_driver.hpp` — production driver that owns the
`amdsmi_processor_handle`, calls AMD SMI internally, returns POD
`metrics` / `asic_info`, throws `std::runtime_error` on failure. AMD-SMI
status text is included in the message via
`amdsmi_status_code_to_string`.
- Create `tests/mock_gpu_driver.hpp` — test mock with zero AMD SMI
dependency.
- Simplify `device.hpp`:
- Drop `#include <amd_smi/amdsmi.h>`.
- Constructor takes `(driver, logical_index)` instead of `(driver,
handle, processor_type, index)`.
- All driver calls wrapped in try/catch with `LOG_DEBUG` on failure.
- Provider owns enumeration: replace `enumerate_gpu_devices<Device>`
(which used the old 4-arg constructor) with
`enumerate_gpu_devices<Device, GpuDriver>` and add
`get_gpu_devices<Device, GpuDriver>()`. This mirrors
`get_nic_devices<Device, NicDriver>()` introduced for NIC. The driver no
longer self-enumerates via a static method.
- Wire `gpu_traits::enumerate_devices` to call `provider->template
get_gpu_devices<device_t, driver_t>()` (was `gpu_driver_t::enumerate()`
bypassing the provider).
**Sentinel detection fixes**
The POD widens some fields from AMD-SMI uint16 to uint32/int64. With the
default `is_metric_supported` checking `value !=
std::numeric_limits<T>::max()`, the 16-bit `0xFFFF` sentinel was no
longer detected after widening (e.g., a sentineled
`current_socket_power` came through as 65535 and registered as
supported).
- Add `METRIC_VALUE_NOT_SUPPORTED_16` constant in `gpu/types.hpp`.
- Pass it explicitly in `device::initialize_supported_metrics` for
`current_socket_power`, `average_socket_power`, `hotspot_temperature`,
`edge_temperature`, `gfx_activity`, `umc_activity`, `mm_activity`.
- Use `populate_if_supported` per-field for
`xgmi.{link,data_acc.read[],data_acc.write[]}` and
`pcie.{link,bandwidth.{acc,inst}}` so unsupported scalars surface as 0
instead of leaking sentinels — matches develop behavior.
- Memory usage now also sentinel-checks the returned `uint64_t` (using
`METRIC_VALUE_NOT_SUPPORTED_64`) instead of trusting the absence of an
exception.
- Per-XCP `vcn_busy[]` / `jpeg_busy[]` arrays still pass through
verbatim so the processor layer can filter them — matches the contract
expected by the four sentinel-preservation tests from #5145.
**Files added:** `gpu_driver.hpp`, `tests/mock_gpu_driver.hpp`
**Files modified:** `device.hpp`, `gpu_traits.hpp`, `cache_policy.hpp`,
`perfetto_policy.hpp`, `types.hpp`, `tests/test_device.cpp`,
`device_providers/amd_smi/provider.hpp`
## JIRA ID
N/A
## Test Plan
- Build with `ROCPROFSYS_BUILD_AINIC=ON` — verify production code
compiles and links.
- Run full unit-test suite: `rocprof-sys-unit-tests` — verify all GPU
device tests pass plus the four restored sentinel-preservation tests
(`vcn_busy_collection_preserves_sentinels`,
`jpeg_busy_collection_preserves_sentinels`,
`vcn_activity_device_level_preserves_sentinels`,
`memory_usage_unsupported_sentinel_value`).
- Mock-based PMC validation across the 6-cell matrix (MI300X + Navi 48 ×
Perfetto Standard + Perfetto Legacy + RocPD) using the mock
configurations under `~/work/amd-smi-mock-configurations/`.
- Sentinel scan on every output (Perfetto and RocPD) — must report zero
leaks for scalar metrics.
- Range validation against the documented per-mock expected ranges.
## Test Result
- **Unit tests:** 512 / 512 pass on debug build (28 suites). Includes
the four restored sentinel-preservation tests.
- **PMC validation (6 runs):**
- Perfetto Standard / Legacy on MI300X and Navi 48: **0 sentinel
values** in PMC counters.
- RocPD on MI300X and Navi 48: **0 sentinel values** in scalar metrics.
VCN/JPEG array sentinels in RocPD are a pre-existing
`rocpd_processor.cpp` gap — unchanged by this PR.
- All configured metrics present with values in the expected ranges
(Power, Temperature, GFX/UMC/MM Busy, VCN, JPEG, PCIe, NIC RDMA stats,
etc.).
- **Architectural parity:** `git grep get_nic_devices get_gpu_devices`
shows symmetric provider API. `gpu_driver` and `nic_driver` follow the
same constructor/method/error shape.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---
**Note on base branch:** This PR is targeted at
`users/adjordje-amd/decouple-nic-device-from-amdsmi` (PR #5196's head)
because it depends on the NIC decouple work. Once #5196 merges to
develop, this PR will rebase onto develop with no conflicts.
---------
Co-authored-by: David Galiffi <David.Galiffi@amd.com>1 parent faedb08 commit 3639e99
10 files changed
Lines changed: 1168 additions & 1815 deletions
File tree
- projects/rocprofiler-systems/source/lib/rocprof-sys/library/pmc
- collectors/gpu
- tests
- device_providers/amd_smi
Lines changed: 10 additions & 10 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
67 | 67 | | |
68 | 68 | | |
69 | 69 | | |
70 | | - | |
| 70 | + | |
71 | 71 | | |
72 | 72 | | |
73 | 73 | | |
| |||
78 | 78 | | |
79 | 79 | | |
80 | 80 | | |
81 | | - | |
| 81 | + | |
82 | 82 | | |
83 | 83 | | |
84 | 84 | | |
| |||
88 | 88 | | |
89 | 89 | | |
90 | 90 | | |
91 | | - | |
| 91 | + | |
92 | 92 | | |
93 | 93 | | |
94 | 94 | | |
| |||
101 | 101 | | |
102 | 102 | | |
103 | 103 | | |
104 | | - | |
| 104 | + | |
105 | 105 | | |
106 | 106 | | |
107 | 107 | | |
| |||
110 | 110 | | |
111 | 111 | | |
112 | 112 | | |
113 | | - | |
| 113 | + | |
114 | 114 | | |
115 | 115 | | |
116 | 116 | | |
| |||
207 | 207 | | |
208 | 208 | | |
209 | 209 | | |
210 | | - | |
| 210 | + | |
211 | 211 | | |
212 | 212 | | |
213 | 213 | | |
| |||
220 | 220 | | |
221 | 221 | | |
222 | 222 | | |
223 | | - | |
| 223 | + | |
224 | 224 | | |
225 | | - | |
| 225 | + | |
226 | 226 | | |
227 | 227 | | |
228 | 228 | | |
| |||
237 | 237 | | |
238 | 238 | | |
239 | 239 | | |
240 | | - | |
| 240 | + | |
241 | 241 | | |
242 | | - | |
| 242 | + | |
243 | 243 | | |
244 | 244 | | |
245 | 245 | | |
| |||
0 commit comments