Skip to content

Persistent MonitorEvents#4491

Open
valentinewallace wants to merge 19 commits intolightningdevkit:mainfrom
valentinewallace:2026-03-persistent-mon-evs
Open

Persistent MonitorEvents#4491
valentinewallace wants to merge 19 commits intolightningdevkit:mainfrom
valentinewallace:2026-03-persistent-mon-evs

Conversation

@valentinewallace
Copy link
Contributor

@valentinewallace valentinewallace commented Mar 17, 2026

As part of #4482, we're looking into changing our architecture -- currently, the Channel{Manager} is responsible for managing the resolution of off-chain HTLCs, and the ChannelMonitor is responsible for them once they're on-chain. See the issue description but there's complexity that results from this design.

Quoting the issue, "Instead, we want to do all resolution in ChannelMonitors (in response to ChannelMonitorUpdates) and pass them back to ChannelManager in the form of MonitorEvents (similar to how HTLCs are resolved after channels are closed). In order to have reliable resolution, we'll need to keep MonitorEvents around in the ChannelMonitor until the ChannelManager has finished processing them - adding a new MonitorEvent resolution path through a new method (rather than via ChannelMonitorUpdates)."

Here we don't add any new MonitorEvents but add support for making the ones we do have persistent across restarts until they are explicitly ack'd, via a new chain::Watch API.

  • update commit messages

@ldk-reviews-bot
Copy link

ldk-reviews-bot commented Mar 17, 2026

👋 Thanks for assigning @TheBlueMatt as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

(17, in_flight_monitor_updates, option),
(19, peer_storage_dir, optional_vec),
(21, async_receive_offer_cache, (default_value, async_receive_offer_cache)),
(23, persistent_monitor_events, (default_value, false)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the logic isn't complete, should we write this as an even tlv on the write side and deliberately not implement read for it yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean use an even tlv and conditionally write it only if it's set to true? That makes sense to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the plan for upgrades/downgrades?

  • Until ~everything's in place, use even TLVs for things but don't write them in prod, and error in prod if they are set on read
  • Then have a version or two that allows the even TLVs to be set on read, but doesn't write them, to allow downgrades
  • Then have a version that allows the even TLVs to be missing, runs the legacy reconstruction logic once, then sets persistent_monitor_events in the manager/monitors at the end of manager read so even TLVs will be henceforth written

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, that's basically what I was thinking.

@valentinewallace valentinewallace force-pushed the 2026-03-persistent-mon-evs branch from 2a22b3e to 6930ccd Compare March 19, 2026 17:11
@valentinewallace valentinewallace self-assigned this Mar 19, 2026
@valentinewallace valentinewallace added the weekly goal Someone wants to land this this week label Mar 19, 2026
@valentinewallace valentinewallace force-pushed the 2026-03-persistent-mon-evs branch from 6930ccd to 2eba975 Compare March 19, 2026 20:24
@valentinewallace valentinewallace marked this pull request as ready for review March 19, 2026 20:27
@valentinewallace valentinewallace requested review from TheBlueMatt and removed request for wpaulino March 19, 2026 20:28
@codecov
Copy link

codecov bot commented Mar 19, 2026

Codecov Report

❌ Patch coverage is 86.11111% with 55 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.20%. Comparing base (d830f10) to head (bbf1786).

Files with missing lines Patch % Lines
lightning/src/ln/channelmanager.rs 74.67% 29 Missing and 10 partials ⚠️
lightning/src/chain/channelmonitor.rs 92.20% 6 Missing and 6 partials ⚠️
lightning/src/ln/channel.rs 94.73% 1 Missing and 2 partials ⚠️
lightning/src/chain/chainmonitor.rs 94.73% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4491      +/-   ##
==========================================
+ Coverage   86.18%   86.20%   +0.01%     
==========================================
  Files         160      160              
  Lines      107536   107789     +253     
  Branches   107536   107789     +253     
==========================================
+ Hits        92680    92916     +236     
- Misses      12231    12235       +4     
- Partials     2625     2638      +13     
Flag Coverage Δ
tests 86.20% <86.11%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically LGTM!

}

impl_writeable_tlv_based!(MonitorEventSource, {
(0, event_id, required),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: odds for new fields so we can drop them later in future versions if we dont want them.

/// When set, monitor events are retained until explicitly acked rather than cleared on read.
///
/// Allows the ChannelManager to reconstruct pending HTLC state by replaying monitor events on
/// startup, rather than from complexly polling and reconciling Channel{Monitor} APIs, as well as
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

complexly lol.

};

// Only write `persistent_events_enabled` if it's set to true, as it's an even TLV.
let persistent_events_enabled = channel_monitor.persistent_events_enabled.then_some(true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit should this write () instead of bool?

}

/// Removes a [`MonitorEvent`] by its event ID, acknowledging that it has been processed.
pub(super) fn ack_monitor_event(&self, event_id: u64) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be pub, no?

fn fail_htlc<L: Logger, E: FailHTLCContents + Clone>(
&mut self, htlc_id_arg: u64, err_contents: E, mut force_holding_cell: bool,
logger: &L
monitor_event_source: Option<MonitorEventSource>, logger: &L
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to go ahead and update the logic here to push the event source if there isn't one and the fail is still pending?

HTLCFailureMsg::Relay(fail_htlc) => HTLCForwardInfo::FailHTLC {
htlc_id: fail_htlc.htlc_id,
err_packet: fail_htlc.into(),
monitor_event_source: None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm we're failing to forward, presumably there's a monitor event from the inbound edge we have to complete when we fail the HTLC? Bit funny to fail an HTLC back with a monitor event completion for the same channel but makes sense I think?

// inbound channel durably has the preimage, the outbound
// monitor event can be acked. The preimage remains in the
// outbound monitor's storage and will be replayed to all
// remaining inbound hops on restart via pending_claims_to_replay.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to note that we want to move away from this eventually? We might eventually move to a world where we skip the replay and move to pushing "temporary" monitor events such that we just have to replay monitor events to do the multi-claim.

@ldk-reviews-bot
Copy link

👋 The first review has been submitted!

Do you think this PR is ready for a second reviewer? If so, click here to assign a second reviewer.

Helps when debugging to know which variants failed.
Currently, the resolution of HTLCs (and decisions on when HTLCs can be
forwarded) is the responsibility of Channel objects (a part of ChannelManager)
until the channel is closed, and then the ChannelMonitor thereafter. This leads
to some complexity around race conditions for HTLCs right around channel
closure. Additionally, there is lots of complexity reconstructing the state of
all HTLCs in the ChannelManager deserialization/loading logic.

Instead, we want to do all resolution in ChannelMonitors (in response to
ChannelMonitorUpdates) and pass them back to ChannelManager in the form of
MonitorEvents (similar to how HTLCs are resolved after channels are closed). In
order to have reliable resolution, we'll need to keep MonitorEvents around in
the ChannelMonitor until the ChannelManager has finished processing them.  This
simplify things - on restart instead of examining the set of HTLCs in monitors
we can simply replay all the pending MonitorEvents.

As a first step towards this, here we persist a flag in the ChannelManager and
ChannelMonitors indicating whether this new feature is enabled. It will be used
in upcoming commits to maintain compatibility and create an upgrade/downgrade
path between LDK versions.
@valentinewallace valentinewallace force-pushed the 2026-03-persistent-mon-evs branch from 2eba975 to b8644fb Compare March 20, 2026 21:34
@valentinewallace
Copy link
Contributor Author

Rebased with no changes due to conflicts

}
let duplicate_event = self.pending_monitor_events.iter().any(
|update| if let &MonitorEvent::HTLCEvent(ref upd) = update {
|(_, update)| if let &MonitorEvent::HTLCEvent(ref upd) = update {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug (latent, currently test-only): When persistent_events_enabled is true, get_and_clear_pending_monitor_events moves events from pending_monitor_events to provided_monitor_events. This duplicate check only scans pending_monitor_events, so it will miss events that have been provided but not yet acked. A subsequent block connection could then push a duplicate HTLCEvent.

The same issue exists in the check_connected_output duplicate checks (around lines 6398 and 6420).

Since persistent_events_enabled is gated behind _test_utils/test today, this isn't a production issue yet — but it will be once the feature is enabled. These checks should also scan provided_monitor_events.

if accepted_preimage_claim {
if !self.pending_monitor_events.iter().any(
|update| if let &MonitorEvent::HTLCEvent(ref upd) = update { upd.source == source } else { false }) {
|(_, update)| if let &MonitorEvent::HTLCEvent(ref upd) = update { upd.source == source } else { false }) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same latent issue as the check_spend_height_closed duplicate check: when persistent_events_enabled is true, this only scans pending_monitor_events but events may have already been moved to provided_monitor_events, allowing duplicate event generation.

@ldk-claude-review-bot
Copy link
Collaborator

Cleans up the next commit
This field will be deprecated in upcoming commits when we start persisting
MonitorEvent ids alongside the MonitorEvents.
Currently, the resolution of HTLCs (and decisions on when HTLCs can be
forwarded) is the responsibility of Channel objects (a part of ChannelManager)
until the channel is closed, and then the ChannelMonitor thereafter. This leads
to some complexity around race conditions for HTLCs right around channel
closure. Additionally, there is lots of complexity reconstructing the state of
all HTLCs in the ChannelManager deserialization/loading logic.

Instead, we want to do all resolution in ChannelMonitors (in response to
ChannelMonitorUpdates) and pass them back to ChannelManager in the form of
MonitorEvents (similar to how HTLCs are resolved after channels are closed). In
order to have reliable resolution, we'll need to keep MonitorEvents around in
the ChannelMonitor until the ChannelManager has finished processing them.  This
simplify things - on restart instead of examining the set of HTLCs in monitors
we can simply replay all the pending MonitorEvents.

Here we add an as-yet-unused API to chain::Watch to allow the ChannelManager to
tell the a ChannelMonitor that a MonitorEvent has been irrevocably processed
and can be deleted.
Currently, the resolution of HTLCs (and decisions on when HTLCs can be
forwarded) is the responsibility of Channel objects (a part of ChannelManager)
until the channel is closed, and then the ChannelMonitor thereafter. This leads
to some complexity around race conditions for HTLCs right around channel
closure. Additionally, there is lots of complexity reconstructing the state of
all HTLCs in the ChannelManager deserialization/loading logic.

Instead, we want to do all resolution in ChannelMonitors (in response to
ChannelMonitorUpdates) and pass them back to ChannelManager in the form of
MonitorEvents (similar to how HTLCs are resolved after channels are closed). In
order to have reliable resolution, we'll need to keep MonitorEvents around in
the ChannelMonitor until the ChannelManager has finished processing them.  This
simplify things - on restart instead of examining the set of HTLCs in monitors
we can simply replay all the pending MonitorEvents.

To allow the ChannelManager to ack specific monitor events once they are
resolved in upcoming commits, here we give each MonitorEvent a corresponding
unique id. It's implemented in such a way that we can delete legacy monitor
event code in the future when the new persistent monitor events flag is enabled
by default.
Currently, the resolution of HTLCs (and decisions on when HTLCs can be
forwarded) is the responsibility of Channel objects (a part of ChannelManager)
until the channel is closed, and then the ChannelMonitor thereafter. This leads
to some complexity around race conditions for HTLCs right around channel
closure. Additionally, there is lots of complexity reconstructing the state of
all HTLCs in the ChannelManager deserialization/loading logic.

Instead, we want to do all resolution in ChannelMonitors (in response to
ChannelMonitorUpdates) and pass them back to ChannelManager in the form of
MonitorEvents (similar to how HTLCs are resolved after channels are closed). In
order to have reliable resolution, we'll need to keep MonitorEvents around in
the ChannelMonitor until the ChannelManager has finished processing them. This
will simplify things - on restart instead of examining the set of HTLCs in
monitors we can simply replay all the pending MonitorEvents.

Here we add a MonitorUpdateCompletionAction that will allow the ChannelManager
to tell a monitor that an event is complete, upon completion of a particular
ChannelMonitorUpdate. For example, this will be used when an HTLC is
irrevocably failed backwards post-channel-close, to delete the corresponding
MonitorEvent::HTLCEvent.
Currently, the resolution of HTLCs (and decisions on when HTLCs can be
forwarded) is the responsibility of Channel objects (a part of ChannelManager)
until the channel is closed, and then the ChannelMonitor thereafter. This leads
to some complexity around race conditions for HTLCs right around channel
closure. Additionally, there is lots of complexity reconstructing the state of
all HTLCs in the ChannelManager deserialization/loading logic.

Instead, we want to do all resolution in ChannelMonitors (in response to
ChannelMonitorUpdates) and pass them back to ChannelManager in the form of
MonitorEvents (similar to how HTLCs are resolved after channels are closed). In
order to have reliable resolution, we'll need to keep MonitorEvents around in
the ChannelMonitor until the ChannelManager has finished processing them. This
will simplify things - on restart instead of examining the set of HTLCs in
monitors we can simply replay all the pending MonitorEvents.

Here we take care of the monitor events that are trivial to ACK, since all
except HTLCEvents can be ACK'd immediately after the event is handled and don't
need to be re-processed on startup.
Currently, the resolution of HTLCs (and decisions on when HTLCs can be
forwarded) is the responsibility of Channel objects (a part of ChannelManager)
until the channel is closed, and then the ChannelMonitor thereafter. This leads
to some complexity around race conditions for HTLCs right around channel
closure. Additionally, there is lots of complexity reconstructing the state of
all HTLCs in the ChannelManager deserialization/loading logic.

Instead, we want to do all resolution in ChannelMonitors (in response to
ChannelMonitorUpdates) and pass them back to ChannelManager in the form of
MonitorEvents (similar to how HTLCs are resolved after channels are closed). In
order to have reliable resolution, we'll need to keep MonitorEvents around in
the ChannelMonitor until the ChannelManager has finished processing them. This
will simplify things - on restart instead of examining the set of HTLCs in
monitors we can simply replay all the pending MonitorEvents.

If an outbound payment resolved on-chain and we get a monitor event for it, we
want to mark that monitor event as resolved once the user has processed a
PaymentSent or a PaymentFailed event.
Currently, the resolution of HTLCs (and decisions on when HTLCs can be
forwarded) is the responsibility of Channel objects (a part of ChannelManager)
until the channel is closed, and then the ChannelMonitor thereafter. This leads
to some complexity around race conditions for HTLCs right around channel
closure. Additionally, there is lots of complexity reconstructing the state of
all HTLCs in the ChannelManager deserialization/loading logic.

Instead, we want to do all resolution in ChannelMonitors (in response to
ChannelMonitorUpdates) and pass them back to ChannelManager in the form of
MonitorEvents (similar to how HTLCs are resolved after channels are closed). In
order to have reliable resolution, we'll need to keep MonitorEvents around in
the ChannelMonitor until the ChannelManager has finished processing them. This
will simplify things - on restart instead of examining the set of HTLCs in
monitors we can simply replay all the pending MonitorEvents.

If the outbound edge of a forwarded payment is resolved on-chain via preimage,
and we get a monitor event for it, we want to mark that monitor event resolved
once the preimage is durably persisted in the inbound edge. Hence, we add the
monitor event resolution to the completion of the inbound edge's preimage
monitor update.
Currently, the resolution of HTLCs (and decisions on when HTLCs can be
forwarded) is the responsibility of Channel objects (a part of ChannelManager)
until the channel is closed, and then the ChannelMonitor thereafter. This leads
to some complexity around race conditions for HTLCs right around channel
closure. Additionally, there is lots of complexity reconstructing the state of
all HTLCs in the ChannelManager deserialization/loading logic.

Instead, we want to do all resolution in ChannelMonitors (in response to
ChannelMonitorUpdates) and pass them back to ChannelManager in the form of
MonitorEvents (similar to how HTLCs are resolved after channels are closed). In
order to have reliable resolution, we'll need to keep MonitorEvents around in
the ChannelMonitor until the ChannelManager has finished processing them. This
will simplify things - on restart instead of examining the set of HTLCs in
monitors we can simply replay all the pending MonitorEvents.

To ensure we can resolve HTLC monitor events for forward failures, we need to
pipe the event id through the HTLC failure pipeline, starting from when we
first handle the monitor event and call fail_htlc_backwards. To get it from
there to the next stage, here we store the monitor event id in the manager's
HTLCForwardInfo::Fail*. In upcoming commits we will eventually get to the point
of acking the monitor event when the HTLC is irrevocably removed from the
inbound edge via monitor update.
Currently, the resolution of HTLCs (and decisions on when HTLCs can be
forwarded) is the responsibility of Channel objects (a part of ChannelManager)
until the channel is closed, and then the ChannelMonitor thereafter. This leads
to some complexity around race conditions for HTLCs right around channel
closure. Additionally, there is lots of complexity reconstructing the state of
all HTLCs in the ChannelManager deserialization/loading logic.

Instead, we want to do all resolution in ChannelMonitors (in response to
ChannelMonitorUpdates) and pass them back to ChannelManager in the form of
MonitorEvents (similar to how HTLCs are resolved after channels are closed). In
order to have reliable resolution, we'll need to keep MonitorEvents around in
the ChannelMonitor until the ChannelManager has finished processing them. This
will simplify things - on restart instead of examining the set of HTLCs in
monitors we can simply replay all the pending MonitorEvents.

To ensure we can resolve HTLC monitor events for forward failures, we need to
pipe the event id through the HTLC failure pipeline. We started this process
in the previous commit, and here we continue by storing the monitor event id
in the Channel's holding cell HTLC failures. In upcoming commits we will
eventually get to the point of acking the monitor event when the HTLC is
irrevocably removed from the inbound edge via monitor update.
Currently, the resolution of HTLCs (and decisions on when HTLCs can be
forwarded) is the responsibility of Channel objects (a part of ChannelManager)
until the channel is closed, and then the ChannelMonitor thereafter. This leads
to some complexity around race conditions for HTLCs right around channel
closure. Additionally, there is lots of complexity reconstructing the state of
all HTLCs in the ChannelManager deserialization/loading logic.

Instead, we want to do all resolution in ChannelMonitors (in response to
ChannelMonitorUpdates) and pass them back to ChannelManager in the form of
MonitorEvents (similar to how HTLCs are resolved after channels are closed). In
order to have reliable resolution, we'll need to keep MonitorEvents around in
the ChannelMonitor until the ChannelManager has finished processing them. This
will simplify things - on restart instead of examining the set of HTLCs in
monitors we can simply replay all the pending MonitorEvents.

To ensure we can resolve HTLC monitor events for forward failures, we need to
pipe the event id through the HTLC failure pipeline. Here we pipe the monitor
event source from the manager to the channel, so it can be put in the channel's
holding cell.

In upcoming commits we will eventually get to the point of acking the monitor
event when the HTLC is irrevocably removed from the inbound edge via monitor
update.
Currently, the resolution of HTLCs (and decisions on when HTLCs can be
forwarded) is the responsibility of Channel objects (a part of ChannelManager)
until the channel is closed, and then the ChannelMonitor thereafter. This leads
to some complexity around race conditions for HTLCs right around channel
closure. Additionally, there is lots of complexity reconstructing the state of
all HTLCs in the ChannelManager deserialization/loading logic.

Instead, we want to do all resolution in ChannelMonitors (in response to
ChannelMonitorUpdates) and pass them back to ChannelManager in the form of
MonitorEvents (similar to how HTLCs are resolved after channels are closed). In
order to have reliable resolution, we'll need to keep MonitorEvents around in
the ChannelMonitor until the ChannelManager has finished processing them. This
will simplify things - on restart instead of examining the set of HTLCs in
monitors we can simply replay all the pending MonitorEvents.

Here we build on recent commits by ACK'ing monitor events for forward failures
once the monitor update that marks them as failed on the inbound edge is
complete.
Currently, the resolution of HTLCs (and decisions on when HTLCs can be
forwarded) is the responsibility of Channel objects (a part of ChannelManager)
until the channel is closed, and then the ChannelMonitor thereafter. This leads
to some complexity around race conditions for HTLCs right around channel
closure. Additionally, there is lots of complexity reconstructing the state of
all HTLCs in the ChannelManager deserialization/loading logic.

Instead, we want to do all resolution in ChannelMonitors (in response to
ChannelMonitorUpdates) and pass them back to ChannelManager in the form of
MonitorEvents (similar to how HTLCs are resolved after channels are closed). In
order to have reliable resolution, we'll need to keep MonitorEvents around in
the ChannelMonitor until the ChannelManager has finished processing them. This
will simplify things - on restart instead of examining the set of HTLCs in
monitors we can simply replay all the pending MonitorEvents.

Here we build on recent commits by ACK'ing monitor events for forward failures
once the monitor update that marks them as failed on the inbound edge is
complete, when the HTLC was irrevocably failed via revoke_and_ack.
Currently, the resolution of HTLCs (and decisions on when HTLCs can be
forwarded) is the responsibility of Channel objects (a part of ChannelManager)
until the channel is closed, and then the ChannelMonitor thereafter. This leads
to some complexity around race conditions for HTLCs right around channel
closure. Additionally, there is lots of complexity reconstructing the state of
all HTLCs in the ChannelManager deserialization/loading logic.

Instead, we want to do all resolution in ChannelMonitors (in response to
ChannelMonitorUpdates) and pass them back to ChannelManager in the form of
MonitorEvents (similar to how HTLCs are resolved after channels are closed). In
order to have reliable resolution, we'll need to keep MonitorEvents around in
the ChannelMonitor until the ChannelManager has finished processing them. This
will simplify things - on restart instead of examining the set of HTLCs in
monitors we can simply replay all the pending MonitorEvents.

Here we build on recent commits by ACK'ing monitor events for forward failures
if try to fail backwards and discover the inbound edge is closed. If that's the
case, there's nothing for us to do as the HTLC will be resolved via timeout
by the inbound edge counterparty.
Currently, the resolution of HTLCs (and decisions on when HTLCs can be
forwarded) is the responsibility of Channel objects (a part of ChannelManager)
until the channel is closed, and then the ChannelMonitor thereafter. This leads
to some complexity around race conditions for HTLCs right around channel
closure. Additionally, there is lots of complexity reconstructing the state of
all HTLCs in the ChannelManager deserialization/loading logic.

Instead, we want to do all resolution in ChannelMonitors (in response to
ChannelMonitorUpdates) and pass them back to ChannelManager in the form of
MonitorEvents (similar to how HTLCs are resolved after channels are closed). In
order to have reliable resolution, we'll need to keep MonitorEvents around in
the ChannelMonitor until the ChannelManager has finished processing them. This
will simplify things - on restart instead of examining the set of HTLCs in
monitors we can simply replay all the pending MonitorEvents.

Here we build on recent commits by ACK'ing monitor events for forward failures
if we go to fail on the inbound edge and get an error when queueing the
backwards failure. If we get an error in this case it means the failure is
duplicate or the channel is closed, and in either case it's fine to mark the
event as resolved.
Currently, the resolution of HTLCs (and decisions on when HTLCs can be
forwarded) is the responsibility of Channel objects (a part of ChannelManager)
until the channel is closed, and then the ChannelMonitor thereafter. This leads
to some complexity around race conditions for HTLCs right around channel
closure. Additionally, there is lots of complexity reconstructing the state of
all HTLCs in the ChannelManager deserialization/loading logic.

Instead, we want to do all resolution in ChannelMonitors (in response to
ChannelMonitorUpdates) and pass them back to ChannelManager in the form of
MonitorEvents (similar to how HTLCs are resolved after channels are closed). In
order to have reliable resolution, we'll need to keep MonitorEvents around in
the ChannelMonitor until the ChannelManager has finished processing them. This
will simplify things - on restart instead of examining the set of HTLCs in
monitors we can simply replay all the pending MonitorEvents.

Here we complete work that was built on recent prior commits and actually start
re-providing monitor events on startup if they went un-acked during runtime.
This isn't actually supported in prod yet, so this new code will run randomly
in tests, to ensure we still support the old paths.
@valentinewallace valentinewallace force-pushed the 2026-03-persistent-mon-evs branch from b8644fb to bbf1786 Compare March 20, 2026 22:11
Comment on lines 8898 to 8905
@@ -8845,15 +8904,21 @@ where
.push(PendingChannelMonitorUpdate { update: monitor_update });
return Ok(($htlcs_to_fail, static_invoices, None));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: When the monitor update is blocked (!release_monitor), the monitor_events_to_ack populated from the holding cell drain at line 8921 is silently dropped — PendingChannelMonitorUpdate only stores the ChannelMonitorUpdate, not the associated events to ack. When the blocked update is eventually released via unblock_next_blocked_monitor_update, the ChannelManager never learns about these events and cannot add an AckMonitorEvents completion action.

With persistent_events_enabled, this means the outbound monitor events that triggered the HTLC failures will never be acked and will replay on every restart.

The fix would likely involve either:

  1. Adding a monitor_events_to_ack: Vec<MonitorEventSource> field to PendingChannelMonitorUpdate and returning it alongside the update when unblocking.
  2. Or always returning the events to ack from revoke_and_ack (even when the monitor update itself is blocked), so the ChannelManager can store them independently.

Comment on lines 9331 to +9363
@@ -9245,6 +9360,7 @@ impl<
&incoming_trampoline_shared_secret,
&None,
*htlc_id,
if i == 0 { trampoline_monitor_event_source } else { None },
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit/concern: Unlike the claim path (line 10217-10221), there's no comment explaining why only the first HTLC gets the monitor_event_source for the failure case. The reasoning is different here — for claims, the preimage is stored in the monitor and will be replayed to remaining hops via pending_claims_to_replay. For failures, there's no equivalent replay mechanism for the remaining inbound HTLCs.

If the outbound monitor event is acked (because the first inbound failure was committed and the outbound monitor happened to persist), but the node crashes before the remaining inbound HTLCs are failed, those remaining failures would not be re-triggered on restart.

This is acknowledged by the TODO at line 9327, but it might be worth a comment here explaining why the risk is acceptable today (e.g., the ack is not durable since the outbound monitor doesn't persist on ack, so the event will replay in practice).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

weekly goal Someone wants to land this this week

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants