Skip to content

Commit 3639e99

Browse files
[rocprofiler-systems] Decouple GPU device from AMD SMI types (#5334)
## Motivation GPU `device.hpp` directly included `<amd_smi/amdsmi.h>` and used AMD SMI types throughout (`amdsmi_gpu_metrics_t`, `amdsmi_processor_handle`, `amdsmi_proc_info_t`, status codes). Even though the driver was template-injected for testability, the AMD SMI types leaked into the device class, making it impossible to compile and run GPU device unit tests without the AMD SMI library installed. This PR introduces a proper abstraction boundary so that `device.hpp` and its tests have zero AMD SMI dependency, mirroring exactly the pattern already established for the NIC collector in #5196. ## Technical Details **Bridge pattern — `gpu_driver` owns the handle and abstracts all AMD SMI types** - Add GPU POD type (`asic_info`) to `gpu/types.hpp` (others fold into the existing `metrics` POD). - Create `gpu_driver.hpp` — production driver that owns the `amdsmi_processor_handle`, calls AMD SMI internally, returns POD `metrics` / `asic_info`, throws `std::runtime_error` on failure. AMD-SMI status text is included in the message via `amdsmi_status_code_to_string`. - Create `tests/mock_gpu_driver.hpp` — test mock with zero AMD SMI dependency. - Simplify `device.hpp`: - Drop `#include <amd_smi/amdsmi.h>`. - Constructor takes `(driver, logical_index)` instead of `(driver, handle, processor_type, index)`. - All driver calls wrapped in try/catch with `LOG_DEBUG` on failure. - Provider owns enumeration: replace `enumerate_gpu_devices<Device>` (which used the old 4-arg constructor) with `enumerate_gpu_devices<Device, GpuDriver>` and add `get_gpu_devices<Device, GpuDriver>()`. This mirrors `get_nic_devices<Device, NicDriver>()` introduced for NIC. The driver no longer self-enumerates via a static method. - Wire `gpu_traits::enumerate_devices` to call `provider->template get_gpu_devices<device_t, driver_t>()` (was `gpu_driver_t::enumerate()` bypassing the provider). **Sentinel detection fixes** The POD widens some fields from AMD-SMI uint16 to uint32/int64. With the default `is_metric_supported` checking `value != std::numeric_limits<T>::max()`, the 16-bit `0xFFFF` sentinel was no longer detected after widening (e.g., a sentineled `current_socket_power` came through as 65535 and registered as supported). - Add `METRIC_VALUE_NOT_SUPPORTED_16` constant in `gpu/types.hpp`. - Pass it explicitly in `device::initialize_supported_metrics` for `current_socket_power`, `average_socket_power`, `hotspot_temperature`, `edge_temperature`, `gfx_activity`, `umc_activity`, `mm_activity`. - Use `populate_if_supported` per-field for `xgmi.{link,data_acc.read[],data_acc.write[]}` and `pcie.{link,bandwidth.{acc,inst}}` so unsupported scalars surface as 0 instead of leaking sentinels — matches develop behavior. - Memory usage now also sentinel-checks the returned `uint64_t` (using `METRIC_VALUE_NOT_SUPPORTED_64`) instead of trusting the absence of an exception. - Per-XCP `vcn_busy[]` / `jpeg_busy[]` arrays still pass through verbatim so the processor layer can filter them — matches the contract expected by the four sentinel-preservation tests from #5145. **Files added:** `gpu_driver.hpp`, `tests/mock_gpu_driver.hpp` **Files modified:** `device.hpp`, `gpu_traits.hpp`, `cache_policy.hpp`, `perfetto_policy.hpp`, `types.hpp`, `tests/test_device.cpp`, `device_providers/amd_smi/provider.hpp` ## JIRA ID N/A ## Test Plan - Build with `ROCPROFSYS_BUILD_AINIC=ON` — verify production code compiles and links. - Run full unit-test suite: `rocprof-sys-unit-tests` — verify all GPU device tests pass plus the four restored sentinel-preservation tests (`vcn_busy_collection_preserves_sentinels`, `jpeg_busy_collection_preserves_sentinels`, `vcn_activity_device_level_preserves_sentinels`, `memory_usage_unsupported_sentinel_value`). - Mock-based PMC validation across the 6-cell matrix (MI300X + Navi 48 × Perfetto Standard + Perfetto Legacy + RocPD) using the mock configurations under `~/work/amd-smi-mock-configurations/`. - Sentinel scan on every output (Perfetto and RocPD) — must report zero leaks for scalar metrics. - Range validation against the documented per-mock expected ranges. ## Test Result - **Unit tests:** 512 / 512 pass on debug build (28 suites). Includes the four restored sentinel-preservation tests. - **PMC validation (6 runs):** - Perfetto Standard / Legacy on MI300X and Navi 48: **0 sentinel values** in PMC counters. - RocPD on MI300X and Navi 48: **0 sentinel values** in scalar metrics. VCN/JPEG array sentinels in RocPD are a pre-existing `rocpd_processor.cpp` gap — unchanged by this PR. - All configured metrics present with values in the expected ranges (Power, Temperature, GFX/UMC/MM Busy, VCN, JPEG, PCIe, NIC RDMA stats, etc.). - **Architectural parity:** `git grep get_nic_devices get_gpu_devices` shows symmetric provider API. `gpu_driver` and `nic_driver` follow the same constructor/method/error shape. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --- **Note on base branch:** This PR is targeted at `users/adjordje-amd/decouple-nic-device-from-amdsmi` (PR #5196's head) because it depends on the NIC decouple work. Once #5196 merges to develop, this PR will rebase onto develop with no conflicts. --------- Co-authored-by: David Galiffi <David.Galiffi@amd.com>
1 parent faedb08 commit 3639e99

10 files changed

Lines changed: 1168 additions & 1815 deletions

File tree

projects/rocprofiler-systems/source/lib/rocprof-sys/library/pmc/collectors/gpu/cache_policy.hpp

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ struct cache_policy
6767
thread_id, "{}" });
6868

6969
auto add_vcn_track = [&](std::optional<int> xcp_idx) {
70-
for(int clk = 0; clk < AMDSMI_MAX_NUM_VCN; ++clk)
70+
for(size_t clk = 0; clk < MAX_NUM_VCN; ++clk)
7171
{
7272
auto name =
7373
trace_cache::info::format_track_name<category::amd_smi_vcn_activity>(
@@ -78,7 +78,7 @@ struct cache_policy
7878
};
7979

8080
auto add_jpeg_track = [&](std::optional<int> xcp_idx) {
81-
for(size_t clk = 0; clk < ROCPROFSYS_AMDSMI_JPEG_ENGINE_COUNT; ++clk)
81+
for(size_t clk = 0; clk < MAX_NUM_JPEG_V1; ++clk)
8282
{
8383
auto name =
8484
trace_cache::info::format_track_name<category::amd_smi_jpeg_activity>(
@@ -88,7 +88,7 @@ struct cache_policy
8888
}
8989
};
9090

91-
for(int xcp = 0; xcp < AMDSMI_MAX_NUM_XCP; ++xcp)
91+
for(size_t xcp = 0; xcp < MAX_NUM_XCP; ++xcp)
9292
{
9393
add_vcn_track(xcp);
9494
add_jpeg_track(xcp);
@@ -101,7 +101,7 @@ struct cache_policy
101101
{ trace_cache::info::format_track_name<category::amd_smi_xgmi_link_speed>(),
102102
thread_id, "{}" });
103103

104-
for(int vcn = 0; vcn < AMDSMI_MAX_NUM_VCN; ++vcn)
104+
for(size_t vcn = 0; vcn < MAX_NUM_VCN; ++vcn)
105105
{
106106
auto vcn_name =
107107
trace_cache::info::format_track_name<category::amd_smi_vcn_activity>(
@@ -110,7 +110,7 @@ struct cache_policy
110110
{ vcn_name.c_str(), thread_id, "{}" });
111111
}
112112

113-
for(int link = 0; link < AMDSMI_MAX_NUM_XGMI_LINKS; ++link)
113+
for(size_t link = 0; link < MAX_NUM_XGMI_LINKS; ++link)
114114
{
115115
auto read_name =
116116
trace_cache::info::format_track_name<category::amd_smi_xgmi_read_data>(
@@ -207,7 +207,7 @@ struct cache_policy
207207
COMPONENT, tim::units::mem_repr(tim::units::megabyte),
208208
rocprofsys::trace_cache::ABSOLUTE, BLOCK, EXPRESSION, 0, 0, "{}" });
209209

210-
for(int vcn = 0; vcn < AMDSMI_MAX_NUM_VCN; ++vcn)
210+
for(size_t vcn = 0; vcn < MAX_NUM_VCN; ++vcn)
211211
{
212212
auto vcn_name =
213213
trace_cache::info::format_track_name<category::amd_smi_vcn_activity>(vcn);
@@ -220,9 +220,9 @@ struct cache_policy
220220
EXPRESSION, 0, 0, "{}" });
221221
}
222222

223-
for(int xcp = 0; xcp < AMDSMI_MAX_NUM_XCP; ++xcp)
223+
for(size_t xcp = 0; xcp < MAX_NUM_XCP; ++xcp)
224224
{
225-
for(int vcn = 0; vcn < AMDSMI_MAX_NUM_VCN; ++vcn)
225+
for(size_t vcn = 0; vcn < MAX_NUM_VCN; ++vcn)
226226
{
227227
auto vcn_name =
228228
trace_cache::info::format_track_name<category::amd_smi_vcn_activity>(
@@ -237,9 +237,9 @@ struct cache_policy
237237
}
238238
}
239239

240-
for(size_t xcp = 0; xcp < AMDSMI_MAX_NUM_XCP; ++xcp)
240+
for(size_t xcp = 0; xcp < MAX_NUM_XCP; ++xcp)
241241
{
242-
for(size_t jpeg = 0; jpeg < ROCPROFSYS_AMDSMI_JPEG_ENGINE_COUNT; ++jpeg)
242+
for(size_t jpeg = 0; jpeg < MAX_NUM_JPEG_V1; ++jpeg)
243243
{
244244
auto jpeg_name =
245245
trace_cache::info::format_track_name<category::amd_smi_jpeg_activity>(

0 commit comments

Comments
 (0)