Our BTCPay Server installation went into a degraded state on Friday March 6th around 10:30 AM UTC. The symptoms: customer payments were being detected on the Bitcoin network, BTCPay’s database showed the invoices as paid, but the webhook notifications to our internal systems were not firing. From our internal systems’ perspective, the invoices remained unpaid.

The visible customer-facing impact was bounded but real. Customers paying for our services were not receiving immediate service activation. They paid, the payment confirmed on the Bitcoin network, and they waited for activation that never came. Customer support tickets started arriving by Friday afternoon.

We identified and remediated the issue over the next 14 hours, with full restoration by Saturday morning. This post-mortem documents the incident, the cause analysis, the remediation, and what we changed about our BTCPay monitoring and operational practices afterward.

The BTCPay Server context

BTCPay Server is our self-hosted Bitcoin payment processor. We have been running it in production since 2022 for accepting customer cryptocurrency payments alongside other payment methods.

The setup:

Bitcoin Core full node running in-house
BTCPay Server connected to the Bitcoin node via NBXplorer
LND (Lightning Network Daemon) for Lightning Network payments
Custom webhook integration with our internal customer management system
Standard BTCPay store configuration for our different services

The typical operational flow: customer requests an invoice, BTCPay generates a payment address, customer pays, BTCPay detects the payment, BTCPay fires a webhook to our internal system, internal system activates customer service.

The flow has been operating reliably for over three years. The Friday incident was the first significant operational failure.

Initial detection

Friday afternoon at 2:15 PM UTC, customer support received the first ticket: “I paid for ASH service two hours ago and nothing has happened.”

The first investigation suggested a customer error. The customer had used an old invoice. The payment was sent to an expired address. We were prepared to reissue.

Then a second ticket arrived. Then a third. Each customer reporting similar issues: paid, no activation. The pattern made customer error unlikely.

Our internal payment dashboard showed no recent payments for the past 4+ hours, despite expected payment activity. The dashboard pulls from our internal system rather than BTCPay directly. The discrepancy was the first signal that the BTCPay-to-internal integration was the issue.

Investigation phase 1: BTCPay state

We logged into BTCPay’s admin interface to check status. The interface showed:

Several “Paid” invoices that should have triggered webhooks. The invoice records showed payment received, blockchain confirmation count, and “Paid” status. Yet our internal system showed nothing.

The webhook delivery logs showed attempts but most attempts were marked as failed. The error messages varied: “Connection timeout”, “Read timeout”, “DNS resolution failure”.

The diagnosis was clear at the high level: BTCPay was processing payments correctly internally but failing to communicate with our internal system. The communication failure was the issue.

Investigation phase 2: webhook target

We checked our internal system’s webhook endpoint. The endpoint was responding normally to manual test requests. Health checks passed. Other systems were communicating with it without issues.

The mystery deepened: BTCPay could not reach an endpoint that was otherwise responsive.

Network connectivity testing revealed the issue. BTCPay’s server could resolve our internal hostname through DNS but the resulting IP was unreachable from BTCPay’s server. The endpoint was responding to other systems but not to BTCPay specifically.

The network issue suggested either routing problems or a firewall configuration change.

Investigation phase 3: infrastructure changes

We reviewed recent infrastructure changes. The relevant find: our internal system’s hosting had been migrated two days earlier to new infrastructure with different IP allocation. The migration was supposed to be transparent through DNS.

The internal system’s DNS records had been updated. Most consumers of the internal system updated their DNS cache within hours. BTCPay’s server, for reasons we needed to investigate, had not updated its DNS cache.

The DNS TTL on the internal system’s records was 3600 seconds (1 hour). Standard TTL. BTCPay should have refreshed DNS within an hour of the migration.

Investigation revealed BTCPay’s server was running an older systemd-resolved configuration that was caching DNS more aggressively than the TTL specified. The 2-day-old DNS cache was still in effect.

The immediate fix

The diagnosis was clear by 9 PM Friday. The fix was straightforward:

# On BTCPay server
sudo systemctl restart systemd-resolved
sudo systemctl restart btcpayserver

The DNS cache cleared. BTCPay reconnected to the internal system through the correct IP. Webhook delivery resumed.

We manually triggered the webhook retry for invoices that had been in failed state. Within an hour, all backlogged webhooks delivered successfully. Customer activations triggered. Customer support tickets resolved.

The technical fix took 5 minutes. The investigation to identify the cause took 6+ hours.

Communication during the incident

We sent customer communication at 4 PM, 7 PM, and 10 PM on Friday with status updates. The communications included:

The acknowledgment that payments were arriving but activations were delayed.

The estimated resolution timeline (which we revised as the investigation progressed).

The promise that all paid customers would be activated retroactively without additional payment.

The apology for the inconvenience.

The customer response was overwhelmingly understanding. Several customers explicitly noted appreciation for the proactive communication. A few customers were frustrated (understandably) and received individual replies addressing their specific situations.

By Saturday morning, the customer impact was fully resolved. All paid customers had their service activated. All customer tickets had been closed.

The cumulative customer impact

The incident affected approximately 18 customers over the 14-hour window. The specific impacts:

12 customers received their service activation within 14 hours of payment, which was longer than usual but acceptable.

4 customers required manual intervention to ensure correct activation after the webhook delivery resumed.

2 customers cancelled their service request and requested refunds due to the delay. We processed the refunds (refunds via cryptocurrency are operationally bounded but require manual handling).

The estimated business impact:

Lost revenue from the 2 cancellations: approximately €280.

Customer support time on the incident: approximately 8 hours.

Engineering investigation and remediation time: approximately 16 hours.

Trust impact: bounded but real. The 18 affected customers had an experience that did not match our typical reliability.

Total estimated cost: €1,500-2,000 including direct and indirect impacts.

Root cause analysis

The incident had multiple contributing factors that combined to produce the failure.

The DNS caching issue

The aggressive DNS caching by systemd-resolved was the proximate cause. The DNS TTL was honored by most systems but ignored by the specific configuration on BTCPay’s server.

This was an underlying configuration issue that had existed for months without producing visible problems. Without the infrastructure migration triggering a DNS change, the caching behavior would have remained invisible.

The infrastructure migration

The internal system migration changed the IP for the webhook target. The change was operationally smooth for most consumers but produced the BTCPay-specific failure.

The migration plan included DNS verification for major consumer systems. BTCPay was not on the list of systems explicitly verified post-migration. The oversight was procedural.

The monitoring gap

Our BTCPay monitoring tracked uptime and basic functionality but did not specifically monitor webhook delivery success. The webhook failures accumulated for 4+ hours before being detected through customer reports.

The monitoring gap was specific to BTCPay; other systems with webhook flows had better monitoring. The inconsistency was an oversight.

The customer communication delay

The first customer ticket arrived at 2:15 PM. Our initial investigation suggested customer error rather than systemic issue. The systemic understanding emerged around 4 PM with the second and third tickets.

The 90-minute delay from first signal to systemic understanding was longer than ideal. Better internal monitoring would have surfaced the issue independently of customer reports.

Procedural changes after the incident

The post-incident review produced specific procedural updates.

BTCPay-specific monitoring expansion

We added explicit monitoring for:

Webhook delivery success rate (target 99%+). Alerts if success rate drops below 95% over a 15-minute window.

Time between invoice paid event and webhook successful delivery (target under 30 seconds). Alerts if median delivery time exceeds 5 minutes.

Webhook destination connectivity (proactive testing rather than reactive). Hourly synthetic webhook test to verify the integration path.

Comparison of BTCPay’s payment record with internal system’s payment record (target 100% match within reconciliation window). Alerts on any persistent mismatch.

These monitors run continuously now. The monitoring catches issues within minutes rather than hours.

Infrastructure migration verification expansion

The infrastructure migration runbook now includes:

Explicit verification of every webhook consumer post-migration. BTCPay is on the list along with all other consumers.

DNS cache flush verification on every webhook consumer. Confirm the consumer has updated DNS cache to point to new endpoint.

Test webhook delivery to every consumer post-migration. Confirm end-to-end flow is operational.

Sign-off requirement before migration is considered complete. The migration is not finalized until all verification steps pass.

BTCPay configuration review

We reviewed the BTCPay server’s configuration thoroughly. Several improvements identified beyond the immediate fix:

DNS configuration tightened. The systemd-resolved configuration now respects DNS TTLs strictly. No aggressive caching.

Health check expansion. The BTCPay health check endpoint now reports more detailed status including webhook target connectivity.

Logging expansion. BTCPay’s webhook delivery logs are now more verbose for debugging purposes.

Backup webhook configuration. Critical webhooks now have backup destinations. If the primary fails, the backup receives the notification with appropriate routing.

Customer communication standardization

The communication during the incident was relatively ad-hoc. We now have standardized communication templates for different incident types and severity levels.

The standardization includes:

Initial customer notification within 30 minutes of confirmed incident (vs the 90 minutes during this incident).

Status updates every 2 hours during ongoing incidents.

Resolution communication immediately after restoration.

Post-incident summary within 48 hours of resolution.

The improved communication discipline reduces customer anxiety and demonstrates operational quality.

Internal system redundancy

The single-system webhook target was a single point of failure. The post-incident architecture includes redundancy:

The webhook target now has multiple receivers behind a load balancer. Failure of one receiver does not affect availability.

The receivers cross-replicate state. Webhooks received by one are visible to all.

The system can handle webhook delivery to any one of several IPs based on DNS round-robin.

The redundancy makes single-point-of-failure scenarios less consequential.

What we learned about BTCPay specifically

The incident produced specific BTCPay operational insights.

Webhook reliability requires monitoring

BTCPay’s webhook delivery is generally reliable but not guaranteed. Production systems depending on webhook delivery need their own monitoring for the integration.

The BTCPay-side webhook logs help diagnose issues but do not prevent them. External monitoring of the actual integration flow is operationally important.

Configuration drift can be silent

The systemd-resolved DNS configuration drift was invisible until the migration triggered the symptoms. Configuration that produces no immediate problems can have failure modes that emerge under specific conditions.

Regular configuration audits catch drift before incidents occur. The auditing cadence should match the operational criticality of the system.

Self-hosted requires operational discipline

The benefits of self-hosted BTCPay (no fees, no custodial dependency, full control) come with operational responsibility. Our team is the only line of defense for issues like this one.

The discipline required is bounded but real. Operators considering self-hosted BTCPay should plan for the operational responsibility.

Bitcoin network reliability is high

Throughout the incident, the Bitcoin network itself operated normally. Payments arrived as expected. Confirmations occurred on schedule. The issue was in our integration with BTCPay, not in the underlying Bitcoin infrastructure.

The reliability of the underlying network is one of the operational benefits of Bitcoin payment processing.

Lightning Network was unaffected

Lightning Network payments worked throughout the incident. The webhook delivery issue affected on-chain payment notifications but not Lightning. The architectural separation between Lightning and on-chain payment handling produced this isolation.

For operations heavily dependent on Lightning, the incident would have been less impactful.

What we tell other BTCPay operators

For other operators running BTCPay Server in production, several patterns matter.

Monitor the webhook integration

The most common failure mode for production BTCPay deployments is in the webhook integration to downstream systems. Active monitoring of webhook delivery catches issues much faster than reactive support tickets.

Have backup notification paths

For business-critical integrations, having a backup notification path beyond just webhooks provides resilience. Polling for invoice state changes, scheduled reconciliation processes, manual review checkpoints all serve as backup.

Document the deployment configuration

The specific configuration of the BTCPay deployment (DNS settings, network configuration, integration paths) should be documented. The documentation enables faster diagnosis when issues occur and supports operational continuity.

Plan for operational continuity

The self-hosted BTCPay deployment depends on the operator’s team. Plan for operational continuity: documentation, on-call procedures, runbook for common issues, escalation paths for complex issues.

Invest in proper logging

BTCPay’s logging is comprehensive but only useful if it is accessible during incidents. Ensure logs are accessible, retained appropriately, and queryable when needed.

Periodic operational reviews

The BTCPay infrastructure benefits from periodic operational reviews. Configuration drift, dependency updates, security patches all need attention. The review cadence should match operational priorities.

Test recovery procedures

The procedures for recovering from various failure modes should be tested periodically. The first time you execute a recovery procedure should not be during a real incident.

The broader operational pattern

This incident reflects a broader operational pattern with self-hosted infrastructure.

Self-hosted infrastructure has more failure modes than expected

Self-hosted gives full control. Full control means full responsibility for operational quality. Failure modes that managed services hide become visible.

The trade-off: lower direct costs and better strategic positioning versus higher operational responsibility. For privacy-aligned operators, the trade-off favors self-hosted. For operations without operational capability, managed services may be the better fit.

Configuration drift is invisible until it is not

Configurations that work today may fail under future conditions. The conditions are not always predictable. Regular auditing catches drift before failures occur.

Integration monitoring matters

The seams between systems are often where failures occur. Monitoring the integration points is operationally important and often more valuable than monitoring the individual systems.

Customer communication is operational deliverable

During incidents, the technical resolution matters but the customer communication is what customers experience directly. Professional communication during incidents preserves customer relationships even when the technical outcome is imperfect.

Post-mortems produce sustainable improvement

The post-incident review is where the cost of the incident produces value. The procedural improvements emerging from this incident will prevent or detect similar incidents faster in the future.

The cost of the incident was real (€1,500-2,000 estimated). The cost of the post-incident improvements was bounded (approximately 40 hours of engineering time). The investment in improvements is worthwhile relative to expected future incidents.

What this changed about our broader operations

Beyond the specific BTCPay improvements, the incident affected broader operational practices.

Webhook integration patterns

Other webhook integrations in our infrastructure received review. Several smaller improvements were identified and implemented. The cumulative reliability of webhook-based integrations improved across our infrastructure.

DNS configuration review

The DNS caching issue on BTCPay’s server prompted review across other systems. Several other systems had similar caching configurations that were tightened.

Monitoring investment

The monitoring gap that allowed the incident to develop for 4+ hours before detection prompted broader monitoring investment. Specific monitoring for integration points across our infrastructure now exists.

Customer communication discipline

The improved customer communication templates created for this incident are now used for all incident types. The discipline is consistent across our operations.

The customer relationship outcome

The 18 affected customers’ relationships with us are largely intact. The incident did not produce systemic customer churn. Several factors contributed:

Proactive communication during the incident.

Substantive remediation including refunds for cancellations.

Honest post-incident summary shared with affected customers.

Continued service quality after the incident.

Some customers explicitly noted that the way we handled the incident strengthened their confidence in us. The honest handling of imperfect situations builds trust.

A few customers indicated they would think harder about cryptocurrency payment methods given the experience. We do not consider this irrational; the incident did produce a poor experience even if the cause was technical.

What we expect for future operations

Looking forward at BTCPay operations:

Continued production use. The incident did not change our commitment to BTCPay as our preferred Bitcoin payment processor.

Improved monitoring catching issues earlier. The new monitoring infrastructure detects issues within minutes rather than hours.

Continued operational discipline. The procedural improvements from this incident apply ongoing.

Periodic operational reviews. The schedule for regular operational review of BTCPay configuration is established.

Sustained customer trust. The customer relationships affected by this incident continue. Future incidents may occur but the response capability is stronger now.

The honest summary

The incident was operational. The cause was bounded (DNS caching configuration). The diagnosis was challenging (6+ hours of investigation). The remediation was straightforward (5 minutes of fix).

The customer impact was bounded but real. The 18 affected customers experienced service delays that did not match our typical reliability. Two customers ultimately cancelled. Most accepted the explanation and continued.

The post-incident work produced sustainable improvements. The monitoring, procedural, and architectural changes will prevent or detect similar incidents faster in the future. The cost of the improvements is bounded; the benefit accumulates across future operations.

For other operators reading this: the patterns are generalizable. Self-hosted infrastructure produces full responsibility. Monitoring matters. Configuration drift is silent. Customer communication during incidents is operational deliverable. Post-incident discipline produces sustainable improvement.

We continue running self-hosted BTCPay. The brief degradation event does not change the operational case for self-hosted Bitcoin payment processing. The case for self-hosted involves operational responsibility that we accept and continue to invest in. The incident validated rather than challenged the operational philosophy.

For customers reading this who paid via Bitcoin during the incident: the service is operating normally. The reliability is restored. The improvements from this experience benefit all customers going forward. The work to make this better continues.