Skip to content

Add survivors label to GC timeline samples [PROF-14573]#3861

Draft
realFlowControl wants to merge 2 commits intomasterfrom
florian/prof-14573-gc-survivors
Draft

Add survivors label to GC timeline samples [PROF-14573]#3861
realFlowControl wants to merge 2 commits intomasterfrom
florian/prof-14573-gc-survivors

Conversation

@realFlowControl
Copy link
Copy Markdown
Member

@realFlowControl realFlowControl commented May 7, 2026

Description

Surfaces a lightweight per-GC heap snapshot that lets users spot object leaks in the existing GC timeline events, without building a full heap-live profiler.

After every GC tick recorded by the profiler, walks EG(objects_store) and attaches a single survivors label to the existing [gc] timeline sample listing the top 10 live classes by instance count:

survivors = \DateTime 24, \ImmutableDateTime 59, Closure 12, ...

Watching the label across successive GC events in the timeline view surfaces the leak fingerprint: a class whose count climbs run-over-run.

How it works

  • One-line C accessor exposing &EG(objects_store). Everything else (iteration, aggregation, top-N selection, formatting) is Rust.
  • Aggregation keys on *const zend_class_entry (cheap pointer key, no string work during the walk); class name is resolved once per kept entry.
  • Sort by (count desc, class_name asc) for deterministic output.
  • Internal classes (Closure, generators, engine types) are included — they can leak too.
  • Skips emission entirely when EG(objects_store).top < 32 (heap too small to be interesting).
  • No new sample type, no new INI flag — always on when the timeline is enabled (which is the existing gate for GC samples).

Test coverage

  • 5 new Rust unit tests for sort tie-break, top-N truncation, format shape, and empty input.
  • New prof-correctness test gc_survivors.{php,json} that instantiates known counts of two namespaced classes (Bench\\AlphaThing × 200, Bench\\BetaThing × 100), forces a GC, and asserts both appear in the survivors label in the expected order.
  • CI wiring added to .github/workflows/prof_correctness.yml.

Reviewer checklist

  • Test coverage seems ok.
  • Appropriate labels assigned.
Plan

PROF-14573 — Show survivors after GC

Goal

After every GC tick recorded by the profiler, attach a single `survivors` label
to the existing GC timeline sample listing the top 10 classes (by live-object
instance count) at that moment. Lets users spot object leaks by watching the
counts climb across successive GC events in the timeline.

Design

  • No new sample type, no new INI flag. One extra label on the existing
    `[gc]` timeline sample.
    • key: `survivors`
    • value: `\DateTime 24, \ImmutableDateTime 59, Closure 12, ...`
  • Source of truth: walk `EG(objects_store).object_buckets[1..top]`, skip
    buckets where `IS_OBJ_VALID` is false. Aggregate by `zend_class_entry*`.
  • Top 10: sort by `(count desc, class_name asc)` for deterministic output.
  • Threshold: skip emission entirely when `EG(objects_store).top < 32`.
  • Internal classes included (Closure, Generator, engine types). They can
    also leak.

Implementation

Tiny C accessor — `profiling/src/php_ffi.{h,c}`

Bindgen already exposes `zend_objects_store` (struct with `object_buckets`,
`top`, `size`, `free_list_head`) so all the iteration stays in Rust. We only
need one trivial helper to expand the `EG(objects_store)` macro:

```c
zend_objects_store *ddog_php_prof_objects_store(void) {
return &EG(objects_store);
}
```

Rust — `profiling/src/timeline.rs` (`mod gc_survivors`)

  • `pub(super) fn collect_top_n() -> Option`:

    1. If `top < 32`, return `None`.
    2. Walk `object_buckets[0..top]`. `IS_OBJ_VALID(o)` is just a bitwise
      check (`(ptr as usize) & 1 == 0`); inlined Rust-side, no FFI.
    3. Aggregate `(*const zend_class_entry, u64)` in a `HashMap`.
    4. Sort by `(count desc, name asc)`, take 10.
    5. Format `\Class N, Class N, ...` (leading `\` per the ticket).
  • Class name extraction reuses `zai_str_from_zstr` (the canonical
    `zend_string` -> bytes helper used elsewhere in `bindings/mod.rs`).

GC sample wiring — `profiling/src/profiling/mod.rs`

  • Extends `Profiler::collect_garbage_collection` to take an optional
    survivors string and attach it as a `survivors` label when `Some`.

Caller — `ddog_php_prof_gc_collect_cycles`

After `prev()`, before passing to `collect_garbage_collection`:

```rust
let survivors = gc_survivors::collect_top_n();
profiler.collect_garbage_collection(now, duration, reason, collected, runs, survivors);
```

Tests

Rust unit tests (`gc_survivors` module)

  • Top-10 cutoff (15 inputs → 10 emitted, lowest count dropped).
  • Tie-break by name when counts are equal.
  • Format string shape (leading `\`, comma-space separator).
  • Single entry, empty input.

Prof-correctness test — `profiling/tests/correctness/gc_survivors.{php,json}`

PHP script:

  • Two namespaced classes `Bench\AlphaThing`, `Bench\BetaThing`.
  • Instantiate well-known counts (200 of Alpha, 100 of Beta), keep them
    in a global so they survive GC.
  • Final `gc_collect_cycles()` to force a user-induced run.

JSON expectations:

  • A `[gc]` stack with a `survivors` label whose value matches a regex
    asserting both class names appear with their expected counts in the
    expected order.

CI wiring — `.github/workflows/prof_correctness.yml`

  • Adds `gc_survivors` to the `for test_case` loops.
  • Adds a `Check profiler correctness for GC survivors` step.

Out of scope

  • Arrays, strings, refs.
  • Per-class allocation site / stacktrace.
  • Dominator/retention analysis.
  • Tunable `top_n` or threshold.

🤖 Generated with Claude Code

After every GC tick, walk EG(objects_store) and attach a `survivors`
label to the existing [gc] sample listing the top 10 live classes by
instance count (e.g. `\DateTime 24, \ImmutableDateTime 59, ...`).
Watching that label across consecutive GC events surfaces object leaks
without needing a full heap profiler.

C side adds a one-line accessor for &EG(objects_store); all walking,
aggregation, sorting and formatting stay in Rust. Skips emission when
fewer than 32 objects are tracked. Includes a prof-correctness test
that asserts known counts of two namespaced classes appear in the
survivors label.

PROF-14573

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@realFlowControl realFlowControl force-pushed the florian/prof-14573-gc-survivors branch from 6744d2b to 2f4908d Compare May 7, 2026 12:37
@datadog-prod-us1-6
Copy link
Copy Markdown

datadog-prod-us1-6 Bot commented May 7, 2026

Tests

Fix all issues with BitsAI or with Cursor

⚠️ Warnings

🧪 31 Tests failed

tests.appsec.test_automated_login_events.Test_V3_Login_Events_Blocking.test_login_event_blocking_auto_id[apache-mod-8.0] from system_tests_suite   View in Datadog   (Fix with Cursor)
AssertionError: assert 500 == 200
 +  where 500 = HttpResponse(status_code:500, headers:{'Date': 'Fri, 08 May 2026 10:03:10 GMT', 'Server': 'Apache/2.4.66 (Debian)', 'X...ed-By': 'PHP/8.0.18', 'Content-Length': '0', 'Connection': 'close', 'Content-Type': 'text/html; charset=UTF-8'}, text:).status_code
 +    where HttpResponse(status_code:500, headers:{'Date': 'Fri, 08 May 2026 10:03:10 GMT', 'Server': 'Apache/2.4.66 (Debian)', 'X...ed-By': 'PHP/8.0.18', 'Content-Length': '0', 'Connection': 'close', 'Content-Type': 'text/html; charset=UTF-8'}, text:) = <tests.appsec.test_automated_login_events.Test_V3_Login_Events_Blocking object at 0x7f8291349fd0>.r_login

self = <tests.appsec.test_automated_login_events.Test_V3_Login_Events_Blocking object at 0x7f8291349fd0>

    def test_login_event_blocking_auto_id(self):
>       assert self.r_login.status_code == 200
E       AssertionError: assert 500 == 200
E        +  where 500 = HttpResponse(status_code:500, headers:{'Date': 'Fri, 08 May 2026 10:03:10 GMT', 'Server': 'Apache/2.4.66 (Debian)', 'X...ed-By': 'PHP/8.0.18', 'Content-Length': '0', 'Connection': 'close', 'Content-Type': 'text/html; charset=UTF-8'}, text:).status_code
...
tests.appsec.test_automated_login_events.Test_V3_Login_Events_Blocking.test_login_event_blocking_auto_id[php-fpm-8.5] from system_tests_suite   View in Datadog   (Fix with Cursor)
AssertionError: assert 500 == 200
 +  where 500 = HttpResponse(status_code:500, headers:{'Date': 'Fri, 08 May 2026 10:01:56 GMT', 'Server': 'Apache/2.4.67 (Ubuntu)', 'X...red-By': 'PHP/8.5.5', 'Content-Length': '0', 'Connection': 'close', 'Content-Type': 'text/html; charset=UTF-8'}, text:).status_code
 +    where HttpResponse(status_code:500, headers:{'Date': 'Fri, 08 May 2026 10:01:56 GMT', 'Server': 'Apache/2.4.67 (Ubuntu)', 'X...red-By': 'PHP/8.5.5', 'Content-Length': '0', 'Connection': 'close', 'Content-Type': 'text/html; charset=UTF-8'}, text:) = <tests.appsec.test_automated_login_events.Test_V3_Login_Events_Blocking object at 0x7f428c2f41a0>.r_login

self = <tests.appsec.test_automated_login_events.Test_V3_Login_Events_Blocking object at 0x7f428c2f41a0>

    def test_login_event_blocking_auto_id(self):
>       assert self.r_login.status_code == 200
E       AssertionError: assert 500 == 200
E        +  where 500 = HttpResponse(status_code:500, headers:{'Date': 'Fri, 08 May 2026 10:01:56 GMT', 'Server': 'Apache/2.4.67 (Ubuntu)', 'X...red-By': 'PHP/8.5.5', 'Content-Length': '0', 'Connection': 'close', 'Content-Type': 'text/html; charset=UTF-8'}, text:).status_code
...
tests.appsec.test_automated_login_events.Test_V3_Login_Events_Blocking.test_login_event_blocking_auto_login[apache-mod-8.0] from system_tests_suite   View in Datadog   (Fix with Cursor)
AssertionError: assert 500 == 200
 +  where 500 = HttpResponse(status_code:500, headers:{'Date': 'Fri, 08 May 2026 10:03:38 GMT', 'Server': 'Apache/2.4.66 (Debian)', 'X...ed-By': 'PHP/8.0.18', 'Content-Length': '0', 'Connection': 'close', 'Content-Type': 'text/html; charset=UTF-8'}, text:).status_code
 +    where HttpResponse(status_code:500, headers:{'Date': 'Fri, 08 May 2026 10:03:38 GMT', 'Server': 'Apache/2.4.66 (Debian)', 'X...ed-By': 'PHP/8.0.18', 'Content-Length': '0', 'Connection': 'close', 'Content-Type': 'text/html; charset=UTF-8'}, text:) = <tests.appsec.test_automated_login_events.Test_V3_Login_Events_Blocking object at 0x7f829134ac90>.r_login

self = <tests.appsec.test_automated_login_events.Test_V3_Login_Events_Blocking object at 0x7f829134ac90>

    def test_login_event_blocking_auto_login(self):
>       assert self.r_login.status_code == 200
E       AssertionError: assert 500 == 200
E        +  where 500 = HttpResponse(status_code:500, headers:{'Date': 'Fri, 08 May 2026 10:03:38 GMT', 'Server': 'Apache/2.4.66 (Debian)', 'X...ed-By': 'PHP/8.0.18', 'Content-Length': '0', 'Connection': 'close', 'Content-Type': 'text/html; charset=UTF-8'}, text:).status_code
...
View all

ℹ️ Info

No other issues found (see more)

❄️ No new flaky tests detected

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 60.67% (-0.05%)

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 249b582 | Docs | Datadog PR Page | Give us feedback!

@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented May 7, 2026

Benchmarks [ profiler ]

Benchmark execution time: 2026-05-08 09:48:51

Comparing candidate commit 249b582 in PR branch florian/prof-14573-gc-survivors with baseline commit 9fe0330 in branch master.

Found 1 performance improvements and 0 performance regressions! Performance is the same for 27 metrics, 8 unstable metrics.

scenario:php-profiler-timeline-memory-with-profiler

  • 🟩 cpu_system_time [-52.029ms; -18.576ms] or [-12.319%; -4.398%]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant