I started writing this post because last month we rotated DKIM keys for a customer and watched their authentication pass rate drop from 99.6% to 73% for about six hours before we caught and corrected the issue. The root cause was a propagation delay we should have anticipated and a verification step we should have included earlier in the sequence. This is the kind of operational mistake that happens to teams that are doing rotation correctly in principle but missing one detail in execution.

DKIM key rotation is treated as a routine operational task in most documentation. The documentation is technically correct and operationally incomplete. What follows is the actual sequence we use, what we have learned from doing this across many customer environments, and the specific failure modes that produce the kind of outage I described above.

If you have not rotated DKIM keys in production before, this post is the kind of detailed procedure you can follow. If you have, the value is probably in the failure modes section, because the things that go wrong are not the things you read about in vendor documentation.

Why rotate at all

The argument for rotation has three parts. The first is cryptographic hygiene. A DKIM signature is a proof that the message was signed by the holder of the private key matching the published public key. If the private key is compromised, an attacker can sign mail that appears to come from your domain, indistinguishable from genuine mail. Rotation limits the window during which a compromise is exploitable. The longer a key is in use, the longer the exposure window if it leaks.

The second part is forward compatibility. DKIM key sizes have been climbing slowly. In the 2010s, 1024-bit keys were the standard. Microsoft started flagging 1024-bit signatures in 2022 and is widely expected to begin rejecting them by 2026. The recommended size is now 2048-bit. Some implementations support 4096-bit but the marginal benefit is small and signature sizes become unwieldy. If your operational practice includes regular rotation, upgrading key sizes is a routine event. If your operational practice is “set it once and forget it,” upgrading is a crisis.

The third part is operational hygiene. The act of rotating regularly forces you to keep your DKIM tooling working. We have onboarded customers whose previous team had the rotation tooling but it had not been used in three years. By the time it was needed, nobody remembered the command sequence, the PowerMTA configuration file had drifted, and the DNS provider had changed without the deployment scripts being updated. Regular rotation, even when not strictly required, keeps the muscle memory and the tooling fresh.

The M3AAWG recommendation, which is the de-facto industry guidance, is rotation every six months for keys at 2048-bit and every three months for keys at 1024-bit. We rotate every four to six months for most customers. We rotate more aggressively (every two to three months) for customers with high deliverability stakes who want shorter exposure windows.

The actors involved in a rotation

Before walking through the sequence, it helps to understand what is moving and what stays still during a rotation.

The private key lives on the sending infrastructure. For PowerMTA-based setups, the private key is in a file the MTA process reads at startup, typically in /etc/pmta/keys or similar. For ESP-based setups, the private key is held by the ESP and not accessible to the customer. For self-hosted Postfix or other open-source MTAs, the private key is in a file the relevant module (opendkim, milter, or built-in) reads.

The public key lives in DNS. It is published as a TXT record at selector._domainkey.yourdomain.com, where selector is a label you choose. Common selectors are descriptive (mail, news, default), date-based (2024q1, may2023), or completely arbitrary. We recommend descriptive selectors that include the year, like email2024, because they make rotation history visible at a glance.

The receiver caches the public key after the first lookup. The cache duration is determined by the DNS TTL on the TXT record. Common TTLs are 3600 (one hour), 14400 (four hours), or 86400 (one day). Receivers also have their own behavior on top of the TTL. Some respect it strictly. Some hold cached records longer for performance reasons. Some refresh more aggressively.

In flight, you have mail that has been signed but not yet delivered. This is the category that makes rotation tricky. When a receiver verifies a signature, it looks up the public key matching the selector specified in the signature header. If the signature was made with the old private key and signed with the old selector, the receiver needs to find the old public key in DNS. If you have already removed the old public key from DNS, the verification fails.

The sequence in detail

Here is the full rotation sequence we use for a customer environment running PowerMTA. The same logic applies to other MTAs with different commands, but the steps are conceptually identical.

Day 0: Pre-rotation audit

Before generating any new key, we audit the current state. We confirm the existing selector is correct in DNS. We confirm PowerMTA is signing with the expected private key. We capture the current DKIM signature pass rate from the last 30 days of aggregate reports as a baseline. We identify any unusual signing patterns, like mail that was being signed by a key we did not know about. We document the current TTL on the existing _domainkey record.

This audit takes thirty to ninety minutes. It surfaces surprises about half the time. The most common surprise is that the customer has more selectors published than they realized, sometimes because a previous rotation left an old selector in DNS that nobody removed. We clean these up at the audit stage so we are not making changes against a confused baseline.

Day 1: Generate the new key pair

We generate the new key pair on the management host, not on the production MTA. The reason is that we want the key creation under version control and audit trail before it touches production. The command, for OpenSSL, is straightforward:

openssl genrsa -out private-newselector.pem 2048
openssl rsa -in private-newselector.pem -pubout -outform DER 2>/dev/null | openssl base64 -A > public-newselector.txt

The output is a private key file and a single-line public key that we will publish in DNS. The new selector name is chosen at this stage. We use a date-based name like email-202305 to make the rotation history transparent in DNS.

Day 2: Publish the new public key in DNS

The new selector goes into DNS as a TXT record at email-202305._domainkey.yourdomain.com. The TXT record value is the standard DKIM format: “v=DKIM1; k=rsa; p=”. We use a TTL of 3600 (one hour) for the new record. Short TTLs let us roll back faster if something goes wrong, but make caching slightly less efficient. The trade-off is worth it during rotation windows.

We publish the new record but do not yet change anything on the MTA. The old selector and old key are still being used for signing. The new selector exists in DNS but nothing references it. We wait.

Day 4: Verify propagation

We wait at least 48 hours after publishing the new TXT record before we switch the MTA to start signing with the new key. The 48 hours is to ensure that the new public key has propagated to every authoritative and secondary DNS server, and that any receivers who happened to look up the selector during the window have it cached.

We verify propagation by querying the new selector from multiple geographically dispersed resolvers. We use a combination of Cloudflare’s public resolver (1.1.1.1), Google’s (8.8.8.8), Quad9 (9.9.9.9), and several open resolvers in different regions. We expect all of them to return the new public key. If any return NXDOMAIN, the propagation is incomplete and we wait longer.

This is the step that the customer rotation I mentioned at the start of this post skipped. They published the new record and switched signing the next morning, eighteen hours after publishing. Some downstream resolvers had not yet picked up the new record, and signatures from the new key were failing verification at receivers using those resolvers. The fix was to wait longer next time, which we now do.

Day 4 (after verification): Switch MTA to sign with new key

This is the actual change that takes effect. In PowerMTA, the configuration looks something like:

<domain-key newselector,yourdomain.com,/etc/pmta/keys/private-newselector.pem>
</domain-key>

The configuration is reloaded with pmta reload. The MTA picks up the new key without dropping connections or restarting. Within minutes, outbound mail is being signed with the new key under the new selector.

We immediately verify by sending a test message from the MTA to a receiver we control (a Gmail account, a Microsoft 365 account, a Yahoo account). We pull the message source. We confirm the DKIM-Signature header contains the new selector. We confirm the signature validates when we manually check the public key in DNS.

If anything looks wrong at this point, we roll back to the old key by reverting the PowerMTA config and reloading. The old key is still valid because we have not yet deprecated it.

Day 4-11: Soak period with both keys valid

For at least seven days after switching signing to the new key, we leave the old public key published in DNS. This is the in-flight mail consideration. Mail signed minutes before the switch, delivered hours later, signed seconds before delivery, signed hours before delivery, all need the public key that matches the signature on their signed envelope. Receivers verify on delivery, and the time between signing and delivery can be anywhere from milliseconds to many hours depending on retries.

The seven-day soak is conservative but it covers essentially all realistic delivery delays. Some receivers retry for as long as 72 hours before giving up. Beyond 72 hours of in-flight mail, the volume is small enough that we accept some authentication loss in exchange for cleaner DNS.

During the soak period, we monitor the DKIM pass rate from aggregate reports. We expect to see the new selector showing up in the reports as messages with the new selector start being received. We expect the old selector pass rate to taper off as the old in-flight mail clears. After 5-7 days, the old selector should be essentially absent from new reports.

Day 11+: Deprecate the old selector

Once the old selector is no longer appearing in aggregate reports for any meaningful share of incoming verifications, we remove the old TXT record from DNS. The deprecation is permanent. If we ever wanted to re-use the same selector name in the future, we would need to be aware that the old public key is gone and any mail signed against it (which should not exist) would fail.

We do not delete the old private key file from the management host immediately. We keep it for an additional 30 days as a safety in case we need to investigate any anomaly in the aggregate reports that references the old key.

The failure modes we have seen

Across roughly fifty rotations we have done, here are the things that have actually gone wrong.

Failure 1: TTL mismatch on the old record

We had a customer whose old DKIM record had a TTL of 86400 (one day) when we started the rotation. Some receivers had cached the old key for the full day. We waited 48 hours for the new key to propagate but did not wait for the old key’s cache to age out. We then switched signing. Some receivers were still operating on the old cached key and were now seeing signatures from a different key than expected. The verification failures lasted about 12 hours for a small percentage of mail before clearing.

The lesson is that during a rotation, all relevant TTLs need to be aligned with the rotation schedule. We now reduce the TTL on the old DKIM record from whatever it was (often 14400 or 86400) to 3600 about 48 hours before we begin the rotation. This forces receivers to refresh their cached version of the old key within an hour, so by the time we switch signing, the cached records are short-lived and any inconsistency clears quickly.

Failure 2: Multiple selectors signing without operator awareness

We onboarded a customer whose MTA was correctly signing with the selector they thought it was using. They also had a marketing automation service sending under their domain that signed with a different selector via a CNAME. The marketing service’s selector was three years old, used a 1024-bit key, and the customer had no record of who had set it up.

We discovered this only because aggregate reports surfaced the third-party selector. The rotation we did for the customer’s primary key was clean. The third-party selector remained at 1024-bit until we tracked down the marketing service, identified the team responsible, and arranged a rotation on their side.

The lesson is that DKIM is not single-source. Any third party signing under your domain has its own rotation lifecycle that may or may not be aligned with yours. You discover these through aggregate reports. You manage them through periodic audits, not just rotation events.

Failure 3: The configuration that lost its signing rule

A customer’s PowerMTA configuration had been edited by a previous administrator in a way that put the domain-key directive inside a scope that did not apply to all outbound mail. Specifically, the directive was inside a <virtual-mta> block that only applied to one of several VMTAs. Mail going through the other VMTAs was being sent unsigned.

The customer had not noticed because most of their volume went through the VMTA with the signing rule, and the receivers were not aggressive about rejecting unsigned mail when other authentication was in place. After we did a rotation for them, the aggregate reports showed a long tail of unsigned mail we had not predicted. The remediation was to move the domain-key directive out of the VMTA scope to the global scope, where it applies to all outbound mail regardless of routing.

The lesson is that the rotation surfaces drift in your existing configuration. Use the rotation as an opportunity to audit not just the key material but the surrounding configuration.

Failure 4: The selector that picks up legacy mail

We had a customer who had used the selector name “default” historically. When we rotated, we wanted to move away from “default” and onto a date-based name. We generated the new key, published the new selector, and switched signing. The customer also had a forgotten mail queue on a secondary MTA that was still using the old configuration with selector “default” and the old private key.

The secondary MTA discovered some emails in its retry queue that had been signed years before with the old key. When it retried delivery, the signatures verified correctly because the old “default” selector was still in DNS during the soak period. After we deprecated the old selector, the retries started failing authentication. The volume was tiny, fewer than thirty messages, but it was a surprise.

The lesson is that any infrastructure that touches your sending domain needs to be in the rotation plan, including dormant or backup MTAs that might have queued mail.

Failure 5: The receiver that cached for longer than the TTL

DNS specifications say receivers should respect TTLs. In practice, some receivers cache for longer than the TTL specifies, either deliberately for performance reasons or because of bugs in their resolver implementation. We have observed cache durations of up to 48 hours on records with TTLs of 3600.

This bit us once when we shortened a rotation cycle to test whether we could move faster. We did everything within 24 hours from publishing the new record to switching signing. Some receivers had still cached the old public key for longer than the published TTL and were verifying against the now-stale cached key when we switched signing.

The lesson is that the safe rotation window has to be longer than the longest realistic cache duration, not the published TTL. We use 48 hours as the minimum gap between publishing new records and changing what we sign with. We have not had a related failure since.

The aggregate reports as the primary feedback channel

I keep mentioning aggregate reports because they are the single most useful diagnostic tool during a rotation and most teams underuse them.

A DMARC aggregate report is an XML document the receiver sends to the email address you specified in the rua= tag of your DMARC record. The report covers a window (typically 24 hours) and lists, for each source IP that sent mail under your domain, how many messages were seen, how many passed SPF, how many passed DKIM, how many aligned with the From domain, and the policy that was applied.

During a rotation, the relevant fields are the DKIM result and the DKIM domain. After switching signing to the new key, you expect the new selector to start appearing in the reports as new mail comes in and gets verified. After the soak period, you expect the old selector to be essentially absent. If the old selector is still showing up at any meaningful volume after seven days, you have a source of mail that did not switch to the new key when you expected it to.

The reports lag by 24-48 hours. They are not real-time. For real-time monitoring during a rotation, we send a test message from each known sending source after the switch and check its DKIM headers. The aggregate reports catch what the test messages miss.

If you do not have aggregate reports configured, the first step before any rotation is to publish a DMARC record with rua= pointing at a mailbox you can actually parse. The reports are XML and need either tooling or patience to read at scale. There are open-source parsers (parsedmarc is the standard one) that turn the XML into something more usable.

What I would do differently if I were designing this from scratch

We have built our rotation tooling over years of doing this. The current state is fine but reflects a lot of accumulated history. If I were starting from scratch today, here is what I would do differently.

I would parameterize the entire rotation as a state machine. Each step is a state. Each state has entry checks and exit checks. The state machine cannot advance until exit checks pass. This is more rigorous than the manual sequence we currently use, where the operator decides whether to advance based on their judgment.

I would build a rotation history database. Every rotation we have done would be a row in the database, with the keys (public, not private), the selectors, the dates, the customers affected, and any anomalies observed. This would make it possible to spot patterns across rotations that we currently spot only through experience.

I would integrate with DNS providers directly via API rather than treating DNS changes as manual. We have done both manual and API-based DNS in different customer environments. The API-based environments are cleaner because the change is atomic, version-controlled, and timestamped. The manual environments have more drift.

I would build aggregate report ingestion into the rotation workflow. Currently we check the aggregate reports separately from the rotation tooling. If they were integrated, the rotation could detect anomalies automatically rather than relying on the operator to notice them.

The case for rotating even when not strictly required

I want to close with the argument I made at the beginning more pointedly. The reason to rotate regularly is not just to limit cryptographic exposure. It is to keep your operational practice sharp.

We have onboarded customers whose previous administrators had not touched DKIM in five years. The DNS records were in place. The keys were 1024-bit. The signatures were verifying because receivers had not yet broadly rejected 1024-bit. The customer’s deliverability was fine. There was no immediate problem.

The problem was that when we needed to rotate because of the impending Microsoft enforcement, nobody knew how the keys had been generated, where the private keys lived, who had access to them, or what would happen if they were rotated. The institutional knowledge had decayed to zero. Doing the first rotation in five years was a multi-week project that involved tracking down old administrators, reading commit logs from forgotten repositories, and reconstructing the deployment process from scratch.

If the same customer had been rotating every six months, they would have all of that infrastructure live and working. The rotation would have been a routine evening’s work, not a multi-week archaeology project. The cost of rotating regularly when not strictly required is a few hours every six months. The cost of needing to rotate after years of neglect is much higher.

This is the argument I make to customers who ask whether they really need to rotate every six months when their key is fine. The answer is that you need to rotate every six months not because the key is broken but because rotation is a competence that decays without practice. Keep the competence sharp.

What to take from this if you are about to rotate

If you are about to rotate DKIM and you have not done it before, the most important takeaways are these.

Wait at least 48 hours between publishing the new public key and switching signing. The biggest failure mode is propagation delay you did not allow for.

Keep both selectors valid for at least 7 days after switching signing. The second-biggest failure mode is in-flight mail signed with the old key being verified after the old key is gone.

Monitor aggregate reports continuously through the rotation. The third-biggest failure mode is a sending source you did not know existed signing with an unexpected selector.

Test the rotation in a non-production sending domain before doing it in production. Most customers do not have this luxury, but if you do, it is worth a few hours of work to catch issues against a low-stakes target.

Document every rotation. Future you will not remember what selector you used last time or why you chose it. Future you will not remember which day in the soak you saw the last traffic on the old selector. Write it down.

The procedure I described is conservative. There are faster procedures. We choose conservative because the failure mode of being too fast is mail that does not deliver, and the cost of being a few days slower is essentially zero. Pick conservative until you have done enough rotations to know when you can safely cut corners.

If you are running our SMTP infrastructure, we handle this for you on a regular cadence. If you are running your own and you have questions, the contact page is the way to get them answered.