A major mailbox provider experienced a partial outage starting around 9 AM UTC on Thursday July 17. The provider’s MX servers were accepting connections but returning 4xx temporary failures on most submissions. The outage lasted approximately 6 hours before normal acceptance resumed.
Our customer queues grew accordingly. By the time normal acceptance resumed, we had approximately 2.4 million messages queued across the affected customer base. The queue drain took an additional 4 hours after the provider’s recovery.
The incident produced several operational decisions that surprised us in their complexity. Decisions about when to throttle our retry behavior, when to communicate with customers, whether to migrate volume to other providers, all required judgment under uncertainty about the outage duration.
This post is the timeline, the decisions, the surprises, and what we changed about our outage response procedures afterward.
The provider outage timeline (external)
The mailbox provider’s outage as observed from our infrastructure:
8:47 AM UTC: First 4xx responses observed in PowerMTA accounting logs. The pattern was distinct from baseline rejections: high rate of 421 deferral responses across many message types.
9:12 AM UTC: 4xx rate on this provider’s mail at 67%. Our monitoring threshold for “external incident” triggered. Investigation began.
9:38 AM UTC: 4xx rate at 89%. Queue growth visible in our dashboard. We confirmed the issue was at the provider, not on our side.
10:15 AM UTC: Provider’s status page acknowledged delivery issues. Confirmed external incident.
11:30 AM UTC: 4xx rate plateaued at 92%. The provider was accepting some mail but rejecting most.
1:45 PM UTC: 4xx rate began declining. Provider apparently recovering.
3:08 PM UTC: 4xx rate back to baseline (under 2%). Provider’s status page marked incident resolved.
6 hours 21 minutes total outage duration from our observation point.
Our queue growth during the outage
The visible impact in PowerMTA queues:
9:00 AM UTC: 12K messages queued for this provider. Normal operational level.
9:30 AM UTC: 28K messages queued. Growth visible but not yet concerning.
10:00 AM UTC: 67K messages queued. Growth accelerating as provider 4xx rate climbed.
10:30 AM UTC: 140K messages queued. Crossed our “elevated queue” threshold for this provider.
11:00 AM UTC: 285K messages queued. Triggered automated alerts to operations team.
11:30 AM UTC: 410K messages queued. Provider 4xx rate plateaued so growth rate slowed.
12:00 PM UTC: 530K messages queued. Growth from new submissions plus continued failed delivery attempts.
1:00 PM UTC: 760K messages queued. Significant queue size.
2:00 PM UTC: 1.1M messages queued. Approaching our infrastructure capacity considerations.
3:00 PM UTC: 1.6M messages queued. Recovery had started but retries were not yet succeeding at recovered rate.
4:00 PM UTC: 2.4M messages queued. Peak. The provider was accepting but our queue had backlog that needed processing.
7:00 PM UTC: Queue drained to baseline levels.
The decisions we faced
The outage produced several decisions that required judgment.
Decision 1: Wait or migrate?
The customer queues were growing. Some customers had time-sensitive transactional mail (password resets, account confirmations) that was significantly delayed.
Option A: wait for the provider to recover. Continue normal retry pattern. Messages would deliver once the provider recovered.
Option B: migrate the affected mail to alternative providers (where applicable). Some customers had backup sending paths through other providers.
We chose Option A for most customers. The reasoning:
The provider’s outage was visibly resolving. Status updates suggested recovery within hours, not days.
Migration to alternative providers introduces reputation risk for the new path. The receivers would see a sudden surge of mail from an unfamiliar source.
Some mail had customer-specific receiver routing that did not have alternative paths.
The migration work itself takes time. By the time migration was complete, the provider was likely to be recovered.
For two specific customers with critical transactional requirements and pre-existing backup paths, we did migrate. The migrations completed in the second hour of the outage and reduced their customer-facing impact.
Decision 2: Reduce retry aggressiveness?
PowerMTA’s default retry pattern keeps retrying with escalating delays. During an extended outage, retries can contribute to the provider’s recovery load.
Option A: keep default retry behavior. Continue trying to deliver. Some messages will succeed during the outage even if most fail.
Option B: dynamically adjust retry behavior. Increase the retry interval significantly so we attempt fewer deliveries per unit time. Reduces our contribution to the provider’s recovery load.
Option C: pause sending to this provider entirely. Hold messages in queue without retry attempts. Resume when provider recovers.
We chose Option B. The reasoning:
Some messages were succeeding even during the outage. Stopping retry entirely would prevent those successes.
The provider’s load was likely contributed to by aggressive retry from many senders. Reducing our retry rate by 80% would be small but additive contribution.
Customer mail volume was being preserved (PowerMTA holds the messages). The reduction in retry aggressiveness did not lose any mail.
We dynamically adjusted retry interval from default (15-30 minutes) to 90-120 minutes during the outage. The adjustment ramped back to default as recovery progressed.
Decision 3: Customer communication
The outage was external (mailbox provider, not us). The visible impact at our customers was delayed delivery for some recipients.
Option A: communicate proactively to all affected customers about the outage.
Option B: communicate only to customers experiencing significant impact.
Option C: wait for customer inquiries and respond reactively.
We chose Option A. The reasoning:
Proactive communication respects the customer’s need to know what is happening with their infrastructure.
Customers who learn from us are better positioned than customers who learn from their own customers asking why password resets are slow.
The communication overhead is bounded (one notification to all affected customers).
We sent an initial communication at 10:30 AM UTC explaining the external outage, the impact, and our response. Updates went out at 12:00 PM, 1:30 PM, 3:00 PM, and 7:00 PM as the situation evolved.
Customer feedback on the communication was overwhelmingly positive. Several customers explicitly noted appreciation for being informed proactively.
Decision 4: Queue capacity management
The queue grew to 2.4M messages. PowerMTA can handle large queues but performance scales with queue size. The growth was approaching our infrastructure capacity considerations.
Option A: scale infrastructure horizontally. Add more PowerMTA capacity to handle the larger queue.
Option B: triage the queue. Identify low-priority mail and delay or drop it to preserve capacity for higher-priority mail.
Option C: wait through the peak. The queue would drain once provider recovered.
We chose Option C. The infrastructure was capable of holding the queue size. Performance was degrading but acceptable. Adding capacity for a temporary peak that would drain naturally would have been over-engineering.
The decision tested our infrastructure capacity assumptions. The peak queue size was within our capacity planning but we had not previously seen actual peak load at this level.
What surprised us
Several aspects of the incident surprised us in operational terms.
Surprise 1: Customer reactions varied widely
Some customers were highly concerned and contacted us proactively despite our communications. Others did not notice the outage at all (their mail volumes are low enough that delivery delays did not produce visible business impact).
The variation reflects the different operational profiles. Customers with time-sensitive transactional mail noticed immediately. Customers with newsletter mail with delivery flexibility did not notice.
The implication: customer communication should be tailored. Some customers want detailed real-time updates; others want occasional summaries. We learned to ask customers their communication preferences explicitly.
Surprise 2: Queue size affected reporting tools
Our internal monitoring dashboards were not optimized for queue sizes above 1M messages. Several reports took significantly longer to load. One query timed out and required manual intervention.
The tools assumed operational queue sizes (typically 50K-500K). The 2.4M peak revealed limitations we had not encountered before. The post-incident work included tool improvements for handling larger queue sizes.
Surprise 3: Retry behavior interactions
Our manual reduction of retry aggressiveness produced unexpected interactions with PowerMTA’s automated backoff logic. PowerMTA was already extending retry intervals based on observed failure rate. Our manual extensions stacked with PowerMTA’s automatic extensions, producing longer-than-intended intervals.
We have since refined the manual intervention process. Instead of layering manual adjustments on PowerMTA’s automatic behavior, we now override PowerMTA’s automatic adjustments with explicit values for the duration of incident response.
Surprise 4: Recovery rate slower than expected
When the provider’s outage ended, the recovery rate of our queue was slower than we expected. The provider had restored normal acceptance but their capacity was constrained for some time as their infrastructure recovered.
Our queue did not drain at the rate the provider was accepting; it drained at the rate of the provider’s actual processing capacity post-recovery. This produced a longer-than-expected drain time.
Future outage planning includes “recovery period” in capacity calculations. Recovery is not instantaneous even when status pages indicate the incident is resolved.
What we changed in our practices
The incident produced several procedural updates.
Updated outage response runbook
Our pre-incident runbook for external mailbox provider outages was minimal. The runbook now covers:
- Detection criteria and thresholds
- Investigation steps to confirm external vs internal cause
- Customer communication templates with variable severity levels
- Decision criteria for retry adjustment
- Decision criteria for traffic migration
- Recovery monitoring procedures
- Post-incident reporting procedures
The runbook ensures that future similar incidents have clearer decision frameworks rather than ad-hoc judgment.
Capacity planning for large queues
Our infrastructure can handle queue sizes up to several million messages. The performance characteristics at high queue sizes are now documented. Future infrastructure planning includes capacity for peak queue scenarios.
Tool improvements for large queues
The internal dashboards that struggled at high queue size were improved. Several reports were optimized. Query patterns that scaled poorly were rewritten. The tools now handle 5M+ message queues without performance issues.
Customer communication standardization
The communication during this incident was relatively ad-hoc. We now have templates for different incident types and severity levels. Customers receive consistent communication structure across incidents.
Provider relationship documentation
We documented the specific behaviors of each major mailbox provider during outages we have observed. The documentation supports future incident response by giving operators historical context on each provider’s pattern.
Customer-specific severity assessment
We documented per-customer severity considerations. Some customers have time-sensitive operations that require aggressive response; others have flexible operations that can tolerate delay. The documentation supports tailored response rather than uniform response.
What we learned about PowerMTA specifically
The incident produced specific PowerMTA insights.
Queue scaling
PowerMTA handles queues up to several million messages without architectural issues. Performance degrades gradually rather than failing catastrophically. The scaling behavior is favorable for incident response.
Manual override of automatic backoff
PowerMTA’s automatic backoff is generally appropriate but can be overridden manually during incidents. The override process is documented in PowerMTA documentation but requires familiarity to execute reliably during an incident.
The override mechanism uses configuration changes that PowerMTA can reload without restart. Changes take effect within minutes.
Per-destination queue management
PowerMTA’s per-destination queue management proved valuable during the incident. We could focus retry adjustments on the affected destination specifically without affecting traffic to other destinations.
The per-destination granularity is one of PowerMTA’s strengths during incidents like this.
Accounting log analysis at scale
Our accounting log analysis tools handled the elevated volume during the incident. The hourly aggregations remained accurate. The per-message logs continued recording properly.
The infrastructure for log analysis was tested at scale during the incident. The tests revealed minor issues that we have since addressed but no fundamental problems.
The broader operational pattern
The incident illustrates a recurring operational pattern: external incidents produce internal decisions that require judgment under uncertainty.
The key elements:
The cause is external. Our infrastructure is functioning normally. The remediation is at the external party (in this case, the mailbox provider).
The impact is internal. Our queues grow, our customers wait, our resources consume capacity. The cost is borne by us even though we cannot fix the underlying cause.
The duration is uncertain. We do not know whether the outage will resolve in 30 minutes or 24 hours. Decisions need to account for the duration uncertainty.
The customer expectations are real. Customers chose us for reliable delivery. External outages affect that reliability even when the cause is not ours.
The communication is the operational deliverable. We cannot fix the provider’s outage. We can communicate professionally with customers about it. The communication quality is what differentiates good operations from average operations.
What we tell customers about external incidents
Following this incident, we have a clearer message for customers about external mailbox provider outages.
External outages happen. Mailbox providers have outages at varying frequency. We have no control over their infrastructure or their incident response.
We monitor for external incidents continuously. Our monitoring detects provider outages typically within 15-30 minutes of onset.
We communicate proactively. Customers receive notification when we detect significant external incidents, with updates as the situation evolves.
We make operational decisions to optimize the customer outcome. Retry behavior, traffic management, capacity allocation are managed actively during incidents rather than passively.
We document incidents in post-mortem form. Customers can review the documentation to understand our response and what we learned.
The honest framing: we are not immune to external incidents. We respond to them well. Customer expectations should be set accordingly.
What we expect over time
External mailbox provider outages will continue happening. The frequency varies by provider. The duration varies by incident type. The customer impact varies by the timing and the specific affected functionality.
Our improvements over time:
The detection improves. Each incident teaches us about new patterns and tightens our monitoring.
The response improves. Each incident produces runbook updates and procedural refinements.
The communication improves. Each incident teaches us about customer communication preferences and timing.
The capacity planning improves. Each incident validates or challenges our infrastructure assumptions.
The relationships improve. Customers who experience our incident response well develop deeper trust over time.
The incidents themselves are not improvements. They are operational costs. What we do with them determines whether they are pure cost or partial investment in future operations.
The honest summary
The July 17 outage cost us roughly 6 hours of operational time across our team. The customer impact was bounded by our response (most customers had minimal visible impact; some had meaningful but not catastrophic impact).
The post-incident work (runbook updates, tool improvements, documentation) cost approximately 20 hours. The investment is justified by the improvements to future incident response.
The customer relationships came through the incident unscathed and in some cases strengthened. The customers who experienced our communication and response well noted it explicitly.
For other operators reading this with “what do we do when our upstream has an outage?” thoughts: the answer is a combination of monitoring, communication, judgment, and procedural discipline. The specific decisions depend on your specific situation. The general pattern is bounded and learnable.
External outages are part of operating email infrastructure. They will happen. The work to prepare for them is bounded. The cost of being unprepared during them is also bounded but higher. The investment in preparation pays off across multiple future incidents.
We expect to handle the next external outage better than we handled this one. That is the operational arc: incidents produce learning, learning produces better preparation, preparation produces better outcomes during the next incident. The customers benefit. Our team gets more confident. The operational maturity grows over time.