The incident began on a Wednesday in late September. A customer’s transactional email volume to Gmail recipients started failing at 100% rate within an hour of our scheduled DKIM key rotation. Other recipients were largely unaffected. The customer’s customer-facing flows that depended on transactional email (account confirmation, password reset, payment receipts) broke for the duration.
Total impact: 48 hours from incident start to full remediation. Estimated business impact for the customer: €18,000 in lost transactions, delayed customer onboarding, and support response load. Our internal cost: roughly 35 hours of engineering time across the team.
This post-mortem documents the incident, the cause analysis, the remediation, and what we changed about our rotation procedures afterward. Customer name anonymized; the operational specifics are real.
Background: the DKIM rotation schedule
The customer is a mid-size SaaS sending transactional email at approximately 80K messages per day across multiple sending domains. We operate their PowerMTA infrastructure on dedicated servers with our managed services covering deliverability operations.
Our standard practice includes quarterly DKIM key rotation for all customers. The rotation produces new private keys, new public keys, new DNS records, and gradual transition from old to new keys over a 7-14 day overlap window. The transition is intended to be invisible to the customer and to receiving mailbox providers.
We have performed approximately 200 DKIM rotations across our customer base over the past two years. The procedure has been reliable. This incident was the first significant failure.
The procedural sequence we follow
Our DKIM rotation procedure (the procedure that existed before this incident):
Day 1: New key generation
Generate new RSA-2048 key pair. Create new DKIM selector identifier (typically the current month/year identifier).
Publish new public key in DNS. The new key is now resolvable but PowerMTA is still signing with the old key.
Verify new key DNS record is propagated and resolves correctly across major DNS resolvers. Standard 4-6 hour wait for propagation.
Day 2-3: Begin signing with new key
Configure PowerMTA to sign with the new key on a small percentage of outbound mail (typically 10%). The remaining 90% continues signing with the old key.
Monitor for signature verification issues. Postmaster Tools, SNDS, and customer-reported feedback are reviewed.
Increase new key percentage to 50% if no issues observed.
Day 4-7: Full transition to new key
Increase new key percentage to 100%. Old key is no longer signing new mail but the DNS record is still published.
Monitor for delayed mail that signed with old key but is now being delivered (typical 24-48 hour window).
Day 7-14: Old key wind-down
After the overlap window where mail signed with the old key may still be in transit, remove the old key DNS record.
Document the rotation completion. Mark calendar for next quarter.
What happened during this rotation
The September rotation began following the standard procedure. Day 1 produced the new key and DNS publication without issues. Day 2 began signing 10% with new key. Within an hour, the customer’s monitoring alerted: Gmail bounce rate jumped from baseline 0.05% to nearly 100% for the transactional email volume.
Within the first 30 minutes, our team was investigating. Several hypotheses tested rapidly:
The new DKIM key was checked. The DNS record was correct. The key format matched specifications.
The PowerMTA configuration was reviewed. The signing configuration was correct.
The signed mail was inspected. Mail messages were signing properly with the new key.
The Gmail bounce responses were examined. The error was “550 5.7.26 - This message does not have authentication information or fails to pass authentication checks (SPF or DKIM)”.
This was confusing. The DKIM signing looked correct. Authentication should have been passing.
The discovery
After 40 minutes of investigation, the cause became clear. The customer’s DMARC record specified adkim=s (strict alignment) for DKIM. Strict DKIM alignment requires the signing domain to exactly match the From header domain.
The customer’s From header used notifications@mail.customer.com. The old DKIM signing was using mail.customer.com as the signing domain (exact match, strict alignment passes). The new DKIM signing was using customer.com as the signing domain (subdomain match, strict alignment fails).
The cause: during new key generation, the signing domain configuration was set to customer.com instead of mail.customer.com. The change was inadvertent, made during a configuration update earlier that day for a different customer issue.
The signing was technically valid (the signature itself verified correctly). The alignment failed at the DMARC level. Gmail with the customer’s adkim=s DMARC policy refused the mail.
The customer’s other major destinations (Microsoft, Yahoo) were also using adkim=s but had less aggressive DMARC enforcement during the incident. Some mail still landed in inbox there during the failure period. Gmail’s stricter enforcement caused near-complete rejection.
The immediate remediation
Once the cause was identified, the fix was clear: restore the signing domain to mail.customer.com to match the strict DMARC alignment requirement.
The fix was applied within 5 minutes of identification. Mail signing resumed with the correct domain. New mail signed correctly and passed Gmail’s DMARC check.
The remediation completion took longer than the technical fix:
Queued mail had been failing for the 45 minutes from incident start to fix. Approximately 6,000 messages were in PowerMTA’s retry queue when the fix went live. These messages would be re-attempted with correct signing on their next retry. The retry schedule meant some would be retried within minutes, others would wait hours.
For messages already returned to the application as failures, the application would treat them as permanent bounces. The customer’s application needed to re-send these. We coordinated with the customer to identify which messages had been treated as bounced and required re-sending.
For some messages, the customer’s application logic did not auto-retry but instead surfaced the failure to end users (e.g., “we could not send your password reset email”). These users had a degraded experience and required separate communication.
The full impact remediation took an additional 47 hours. Some of this was waiting for retry attempts to clear. Some was active coordination with the customer about which specific messages needed manual intervention. Some was the customer’s own customer-facing communication and support response.
The 18K euros estimate
The €18,000 estimated impact breaks down approximately as:
Lost transactions during the failure window: customers attempting actions that required transactional email confirmation either gave up, called support, or returned the next day. Some fraction of these never converted. The customer estimated €11,000 in lost transactional revenue based on their typical conversion patterns.
Delayed onboarding: new customers signing up during the failure window could not complete onboarding (email confirmation required). About 30% of these returned later; 70% were lost. Estimated impact €4,500 in delayed/lost onboarding.
Support response load: customer support had to handle elevated ticket volume from the failure. The cost included support team overtime and the opportunity cost of support resources unavailable for other customer issues. Estimated €1,500.
Customer-facing communication: the customer sent apology emails and offered some make-good incentives to affected users. The communication and incentive cost approximately €1,000.
The customer absorbed the full €18,000 cost. Our managed services contract did not specify deliverability availability SLAs that would have shifted financial responsibility. We provided 6 weeks of service credit as goodwill but the financial impact was the customer’s.
What we changed in our procedures
The post-mortem produced specific changes to our DKIM rotation procedure.
Pre-flight verification of DMARC alignment requirements
Before any rotation, we now verify the customer’s current DMARC alignment requirements. The check covers:
- DMARC
adkimsetting (relaxed vs strict) - DMARC
aspfsetting (relaxed vs strict) - Current DKIM signing domain vs From header domain
- Whether strict alignment is in effect
This verification is documented in the rotation runbook and required before proceeding.
Explicit signing domain confirmation
The signing domain configuration is now explicitly verified by two team members before the rotation begins. The configuration is documented in the rotation work order. The verification confirms the signing domain matches the customer’s strict alignment requirements.
Staged percentage rollout with verification gates
Previously, we started at 10% and ramped to 100% over days. Now we have explicit verification gates at each percentage step:
10% → wait 30 minutes → verify bounce rates, complaint rates, and Postmaster Tools data → only proceed if metrics are normal.
50% → wait 2 hours → same verification.
100% → wait 24 hours → final verification.
If any gate shows abnormal metrics, the rotation is paused and investigated before proceeding.
Faster rollback capability
The rotation runbook now includes explicit rollback procedure. Within 5 minutes of a problem detection, we can revert PowerMTA to signing with the prior key.
Previously, rollback required reconfiguration that took 15-20 minutes. The streamlined rollback minimizes the impact window if a rotation produces unexpected problems.
Customer notification before rotation
Previously, DKIM rotations were operationally transparent to customers. Now, we notify customers 48 hours before a planned rotation with:
- The rotation timing
- The expected technical impact (none, normally)
- The escalation contact for any deliverability concerns during the rotation window
The notification produces minor customer overhead but allows the customer’s team to be prepared for any issues.
Monitoring threshold tightening
Our monitoring during rotations now alerts on smaller deviations from baseline. Previously, bounce rate increases needed to reach 2% to alert. During rotations, the threshold is 0.5%. The tighter threshold catches issues 15-30 minutes faster than the previous threshold.
Post-rotation review
Within 24 hours of rotation completion, we now do a formal post-rotation review including:
- Bounce rate before and after
- Complaint rate before and after
- Postmaster Tools reputation indicators
- Customer-reported issues if any
- Operational notes for future rotations
The review documentation is reviewed during the next quarter’s rotation planning to capture institutional learning.
What this incident revealed about our broader practice
Beyond the specific rotation procedure changes, the incident produced broader practice changes.
Configuration changes outside scheduled windows
The signing domain misconfiguration happened because an engineer made a configuration change for a different customer during an unrelated investigation. The change inadvertently affected the rotation procedure.
We now require all configuration changes (not just rotations) to follow change management process. Changes outside of scheduled maintenance windows require explicit authorization and documentation.
The change management process is heavier-weight than our previous “make changes as needed” approach. Some operational agility is lost. The agility loss is acceptable given the incident demonstrated the risk.
Customer-facing communication during incidents
The customer’s experience during the 48 hours included some unclear communication on our side. They were not always sure what we knew, what we were doing, or what to expect. The unclear communication amplified their stress and made their internal coordination harder.
We now have a clear customer communication protocol during incidents:
- Initial notification within 15 minutes of confirmed incident detection
- Hourly updates during active investigation
- Customer-facing summary at incident resolution
- Detailed post-mortem within 7 days
The customer can choose their preferred communication channel (email, Slack, Telegram, phone) and frequency. The structured communication eliminates the previous ambiguity.
Documentation visibility
Our DKIM rotation runbook existed but was internal documentation. The customer in this incident was not aware of our rotation schedule, did not know what the rotation was supposed to look like, and had no way to assess whether the symptoms they were seeing matched expected rotation behavior.
We now share customer-relevant portions of operational documentation with customers. The customer can review their rotation schedule, the expected behavior, and the rollback procedures. The shared documentation enables better customer-side preparation.
Insurance and liability discussion
The incident produced a discussion about service level agreements and insurance. Our previous contracts did not specify deliverability availability with associated financial responsibility. Customers absorbed direct business impact from incidents like this.
After this incident, we updated our contracts to include:
- Deliverability availability target (99.5% measured monthly)
- Service credit calculations for downtime below target
- Customer responsibility for business impact recovery (insurance, internal coordination)
- Annual review of liability allocation
The contractual updates do not eliminate the risk of incidents. They make the financial responsibility for impacts clearer to both parties.
What we would have done differently
Looking back at the specific incident, several decisions could have produced better outcomes.
Verifying DMARC alignment requirements before rotation
The fundamental error was not verifying that the customer’s DMARC was in strict alignment mode before changing the signing domain. A pre-rotation check would have caught the misconfiguration before any mail was sent.
This is now standard procedure. It would have prevented the entire incident.
Lower percentage initial rollout
Starting at 10% was the standard. But 10% of 80K daily messages is 8K messages per day from a major sender. Even a brief failure produces meaningful volume of bounced mail.
For customers with very high transactional volumes, starting at lower percentages (1-2%) might produce earlier signal of problems with less impact volume.
Slower escalation tolerance
After the 10% rollout, we had 45 minutes from start before identifying the cause. The escalation pattern was reasonable but could have been faster with more aggressive automated checks.
The monitoring threshold tightening addresses some of this. Additional automated checks (like comparison of signing domain to DMARC requirements before each rotation step) would have caught the issue automatically.
Better customer pre-notification
The customer was not aware the rotation was happening that day. The lack of awareness meant they could not have any team members on standby during the rotation window. The 48-hour pre-notification we now use addresses this.
The customer relationship post-incident
The customer relationship continued positively despite the incident. Several factors contributed to the positive outcome:
Honest communication. We acknowledged the incident immediately and provided clear cause analysis. No attempts to minimize the impact or blame external factors.
Substantive remediation. The service credit was meaningful (6 weeks of service). The procedure changes were documented and shared. The customer could see we were taking the incident seriously.
Continued operational quality. After the incident, we ran another DKIM rotation 2 months later. The rotation completed without issues. The customer saw the procedural improvements working in practice.
Transparency about the broader practice. We shared the incident details and procedure changes more broadly than just with this customer. The customer felt their incident produced systemic improvements that benefited other customers, not just remediation for them specifically.
The customer is still with us. The relationship is stronger than before the incident, not weaker. The strength comes from the honest handling rather than from any attempt to minimize the incident.
What other operators should learn from this
The specific incident details may not apply to other operators directly. The pattern lessons do apply broadly.
DMARC alignment is a configuration variable that interacts with DKIM signing in ways that can produce surprising failures. Changes to either DKIM signing or DMARC enforcement need verification against each other.
Strict DMARC alignment (adkim=s or aspf=s) is more aggressive than relaxed alignment but is less forgiving of configuration mistakes. Operators with strict alignment have less margin for error.
DKIM rotation procedures need verification gates at each step. Continuous progression from 10% to 100% over days is acceptable; jumping percentages without verification is risky.
Customer communication during incidents matters as much as technical remediation. Customers who feel informed are easier to work with than customers who feel kept in the dark.
Post-mortems should produce procedure changes, not just documentation. The incident response value is the procedural improvement that prevents the next incident.
Service credits and liability allocations should be in contracts before incidents happen. Reactive credit negotiations during an incident are harder than proactive credit calculations from existing terms.
Our current rotation track record
Since the incident in September, we have performed 47 DKIM rotations across customer base using the updated procedures. Zero incidents. Zero customer-affecting bounces. Zero rollback events.
The updated procedures are operationally heavier than the prior procedures. The verification gates, customer notifications, configuration confirmations, and monitoring tightening all add work. The work is acceptable given the alternative.
For other operators running similar DKIM rotation operations, the cost-benefit of more rigorous procedures is favorable. The incident cost was €18,000 in customer impact plus 35 hours of our engineering time. The ongoing procedural cost is roughly 2-3 hours per rotation in additional verification time. The investment recovers from one prevented incident every several years.
We continue rotating DKIM keys quarterly. The schedule continues. The procedures are now more reliable. The customer relationships continue. The next incident may come from a different cause, but it will not come from the cause that produced this incident.