The structural problem with DNS migration is that
changes do not propagate atomically. When you change
the NS records at your registrar, the change updates
in the parent zone within minutes, but resolvers
worldwide are caching the old NS records with their
own TTL windows. Some resolvers pick up the change
within minutes. Others wait until their cached NS
records expire, which can take hours. During the
intermediate window, queries route to either the old
or new provider depending on which resolver the
requesting client is using. If the old and new
provider serve identical responses for every record,
this intermediate window is invisible to customers.
If the old and new provider differ on any record,
some customers see the new behaviour while others
see the old behaviour. For 4-12 hours.
For email infrastructure specifically, the mixed-state
window is more consequential than for web. A web
visitor whose resolver returns the old IP just gets
served a cached or older version of the site (rarely
fatal). An email message whose recipient MX lookup
returns inconsistent results may bounce with a
temporary delivery failure, get held in queue,
retried after 4-15 minutes, retried again, eventually
delivered or eventually bounced as permanent. Worse,
SPF lookups during the mixed-state window can return
inconsistent SPF policies, which causes some recipient
mail servers to mark inbound mail as SPF-failed and
others to pass. DKIM lookups likewise can return
inconsistent or missing public keys during the
window. The combination produces the worst-case
deliverability scenario: random recipients see your
mail as authenticated, random others see it as
unauthenticated, and the inconsistency itself looks
to ISP filtering algorithms like a spoofing attack.
The parity-then-flip pattern eliminates the mixed-state
window by construction. Before changing any NS
records at the registrar, the new provider serves
identical responses to every query the old provider
serves. During the window where some resolvers are
using the old provider and others are using the new
provider, every resolver returns the same answers
because both providers have identical zone data. When
the last cached NS record expires worldwide,
everyone is using the new provider, and the migration
is complete. The customer never sees an inconsistency
because no inconsistency exists at any point during
the transition.
Building parity is the work this engagement actually
does. The TTL pre-staging is preparation that makes
the post-flip window short. The zone export and
import is mechanical. The DNSSEC coordination is
specific but rare. The most labour-intensive phase
is parity verification: for every hostname in the
zone, we run dig queries against both the old
authoritative nameservers and the new authoritative
nameservers and compare the responses character by
character. Mismatches happen for predictable reasons.
DKIM public keys longer than 255 characters get split
differently by different providers. SPF records with
multiple include statements can hit the Section 4.6.4
ten-lookup limit at different points depending on how
providers structure their records. CNAME chains
terminate differently if one provider auto-resolves
and the other does not. Geographic load balancing
built into the old provider does not transfer
directly to a new provider with different geographic
coverage. Each mismatch gets flagged, investigated,
and resolved before delegation flips.
The 7-day overlap window is the safety net. After
NS delegation flips at the registrar, the old
provider zone stays active and identical to the new
provider zone for 7 days. Any resolver still caching
the old NS records keeps getting correct answers.
Any unexpected mismatch discovered post-flip can be
rolled back by reverting the NS records at the
registrar (which propagates worldwide within 5
minutes because TTLs are pre-staged at 300 seconds).
At day 7, after worldwide propagation has completed
and no rollback has been needed, the old provider
zone gets decommissioned and TTLs raise back to
normal operational values (3,600-86,400 seconds) to
reduce ongoing query volume. The whole sequence is
documented in writing before execution; no
improvisation during cutover.