On June 15 at 09:47 UTC, our upstream transit provider in Bucharest had a BGP session reset that lasted about forty seconds. The session re-established cleanly. Eleven hours later, at 20:53 UTC, it happened again. The second flap lasted twenty-three seconds before the session came back up. Both flaps were caused by an issue in the provider’s edge router that they later attributed to a memory pressure incident on a route-processor blade.
For most workloads, this kind of brief BGP instability is invisible. Routes converge, traffic finds new paths, in-flight TCP sessions retry. We have customer infrastructure on that pop that has flapped at the BGP level many times over the years without anyone noticing.
For email, this kind of instability has a long tail of consequences that took us about three weeks to fully unwind. The deliverability impact at peak was severe enough that we were running triage on Saturday and Sunday, working with three customers to recover their inbox placement, and ultimately changing how we structure our edge routing across all seven of our jurisdictions.
This is the post-mortem. The incident was not catastrophic and nobody lost mail permanently. The interesting parts are what the flap actually did to deliverability that nobody was expecting, and why our incident response was slower than it should have been.
What happened on the wire
The BGP session in question was between our upstream provider’s edge router in Bucharest and one of their transit suppliers. The flap caused approximately 40 seconds of partial unreachability for our IP ranges on the first event and 23 seconds on the second. During those windows, mail outbound from our RO1 pop to receivers on certain network paths failed to establish SMTP connections.
The receivers’ behavior during a connection failure is to retry. SMTP has retry logic with exponential backoff. Gmail will retry an unestablished connection in five minutes, then thirty minutes, then an hour, then four hours, and continue with increasing intervals up to roughly 72 hours before giving up and bouncing the message. Yahoo’s retry intervals are similar but not identical. Microsoft retries less aggressively in some scenarios.
What this meant operationally was that approximately 4,000 outbound messages from our RO1 pop encountered connection failures during the two flap windows combined. Of those, about 95% retried successfully within the next thirty minutes and were delivered with only a small delay. The remaining 5% retried later and were delivered, with one or two getting into the multi-hour retry territory. The mail itself was not lost.
The mail being delayed was not the problem. The problem was what the brief unreachability did to our IPs’ reputation at the receivers, and what happened next.
The reputation signal we did not expect
Reputation scoring at major mailbox providers is not a single number. It is a multi-dimensional model with inputs that include authentication pass rates, complaint rates, engagement, content scoring, sending volume patterns, and a handful of network-layer signals. The network-layer signals include things like connection success rates, TLS negotiation behavior, and consistency of routing.
What we learned, the hard way, is that some receivers treat sudden changes in connection success rate as a signal that something has changed about the sender. The hypothesis the receiver appears to be testing is: this IP normally connects successfully 99.9% of the time, suddenly it has a 12% failure rate over a 40-second window, then comes back to normal, what does that mean?
The plausible interpretations from the receiver’s perspective are several. The sender’s infrastructure could be unstable. The sender could be experiencing an attack. The sender could be a botnet whose nodes are coming online and offline. The sender could be a legitimate operator whose datacenter had a momentary issue. The receiver does not know which of these is true. The receiver’s response is to treat the IPs with slightly more skepticism for some period, applying additional content filtering and slightly higher latency on connections from those IPs.
We watched this happen in real-time at Microsoft. Connections from our RO1 IPs to Outlook.com started taking 200-400ms longer to establish for the next several hours. The deliverability stayed similar but the connection profile changed in a way that affected throughput. Yahoo was less dramatic but showed similar patterns.
Gmail did something we did not anticipate. About six hours after the second flap, the Postmaster Tools compliance status for one of our customers running on RO1 changed from compliant to a warning state. The warning specifically mentioned “consistency of sending source” as a contributing factor. The customer’s actual sending behavior had not changed. What changed was that Gmail was reading the network-layer signal from the flap and applying it as a reduction in domain trust.
The cascade through customer reputation
Once the reputation signal kicked in, the customer-side effects started compounding.
A customer running cold outreach campaigns from RO1 sent their normal morning batch on June 16. The batch had been planned for two weeks and was identical in audience, content, and volume to dozens of prior batches that had delivered cleanly. Approximately 11% of the messages to Gmail recipients went to spam in the inbox placement. The historical baseline for the same campaign was 1.5%.
A customer running a B2C newsletter sent their weekly broadcast on June 16. The broadcast went to 180,000 subscribers, with the usual mix of about 60% Gmail, 20% Microsoft, 15% Yahoo, and 5% other. Open rates dropped from a typical 28% to 19% on the broadcast. Most of the drop was on Gmail, where the open rate fell from 32% to 18%. Microsoft and Yahoo opens were slightly down but within normal variance.
A customer running transactional mail for an e-commerce platform did not see a drop in open rates because transactional opens are universally high. They saw something else. The customer’s support inbox started getting tickets from users saying they had not received their order confirmations. The volume was small, maybe a dozen tickets across the day, but each ticket required investigation. In every case, the order confirmation had been sent from RO1, had reached the receiver, but had been filed in the receiver’s spam folder rather than the inbox.
None of these effects would have been catastrophic in isolation. The cold outreach customer’s complaint rate did not spike. The B2C newsletter’s complaint rate stayed below 0.1%. The transactional customer’s bounce rate did not change. The mail was being delivered. It was just being delivered to a slightly worse folder for a slightly larger fraction of recipients than usual.
The trouble is that “slightly worse folder placement” is a leading indicator of further reputation degradation. If the cold outreach campaign continues at 11% spam placement, complaint rates rise as recipients in spam folders mark the few that they see as spam. If the newsletter open rate stays at 19%, engagement scoring drops over the following weeks, which becomes a multiplier on placement. The acute incident on June 15 was over in minutes. The reputation cascade had a half-life measured in weeks.
The triage we ran on June 16
By 04:00 UTC on June 16, our monitoring had flagged the elevated 4xx rates from RO1 and the connection time changes at Microsoft. The on-call engineer woke me up. We initially attributed the connection time increase to upstream congestion and the 4xx rates to a coincidental customer issue. We did not connect the two until 08:30 UTC, when the customer running cold outreach pinged us through Telegram with their morning batch numbers.
The triage sequence we ran, in order, was the following.
First, we confirmed the BGP flaps had occurred. Our upstream provider’s status page acknowledged the event. We verified our own logs showed the corresponding connection failures during the flap windows. We had the raw data.
Second, we assessed the immediate operational state. All BGP sessions were currently up. All transit was nominal. There was no ongoing instability. The acute incident was over.
Third, we needed to assess the deliverability impact. This is the part where we were slow. We had Postmaster Tools data for some customers but not real-time. We had the customer’s morning batch results but only for one customer. We did not have an automated way to roll up inbox placement signals across the whole customer base on RO1. We were assembling the picture manually by talking to customers and pulling individual receiver-side metrics.
Fourth, we needed to decide on remediation. The options were: do nothing and let reputation recover naturally over a few days, reduce sending volume from RO1 to allow faster reputation recovery, shift customer volume to a different pop temporarily, or some combination. We picked combination. We reduced RO1 outbound volume to 60% of normal for 48 hours and shifted the excess to BG1 in Bulgaria. We notified affected customers about the temporary routing change.
Fifth, we needed to communicate. We posted a status update at 10:30 UTC explaining the BGP event, our assessment of impact, and the temporary mitigation. We followed up with affected customers individually with the specific implications for their setup.
The triage took about six hours from the first signal to the customer notifications going out. In hindsight, the triage should have started at 04:00 UTC when the monitoring first fired, not at 08:30 UTC when the customer reached out. The reason we were slow is that we had not built the connection between network-layer signals and reputation impact into our runbooks. A 4xx rate increase from one pop was not on our list of things that automatically escalate.
What we changed afterward
The post-mortem identified five things to change. We worked on these over the following two months.
The first change was network-layer signal monitoring. We now track connection success rate from each pop to a curated set of major receivers continuously. A two-standard-deviation drop in success rate for any pop-to-receiver pair pages the on-call engineer immediately. This catches not just BGP events but TLS issues, route changes that affect specific paths, and partial reachability problems that would otherwise be invisible.
The second change was Postmaster Tools data ingestion. We now pull compliance status and spam rate data from Postmaster Tools daily for every customer that has it enabled. The data is in a dashboard that highlights any negative change within 24 hours of it occurring. The dashboard has prevented at least three subsequent reputation incidents from compounding because we caught them inside the first day.
The third change was pop diversity for high-stakes customer traffic. Several customers had been sending exclusively from RO1 because the latency to their primary audience was best there. We now run their volume across two pops with active-active routing, so a network event at one pop affects half of their traffic instead of all of it. The latency increase from this is 10-30ms depending on the customer’s geographic distribution, which is acceptable for the resilience benefit.
The fourth change was the routing change protocol. When we shift volume between pops for any reason, we now do it gradually rather than as a step function. A 40% reduction in sending volume from a pop done suddenly is itself a signal to receivers that something has changed. We now ramp such changes over six to twelve hours so the volume signature evolves smoothly. This is a counterintuitive lesson: the response to a reputation incident should not itself look like another reputation incident.
The fifth change was the runbook for upstream network events. We have a documented sequence for what to monitor, what to mitigate, what to communicate, and on what timeline when an upstream issue causes any reduction in connection success rate. The runbook reduces the cognitive load on the on-call engineer during an incident.
What the customers learned
The post-incident conversations with the three affected customers were interesting in different ways.
The cold outreach customer wanted to know whether they should have been on a different pop in the first place. The honest answer was that our routing decision for them was reasonable given the latency to their audience, but we had not given enough weight to the resilience aspect. We offered them a routing change at no cost. They took it. Their delivery now goes through RO1 primary and HK1 secondary, which is more expensive for us but produces a more resilient experience for them.
The B2C newsletter customer wanted to understand why a momentary network event affected their open rates twelve hours later. We walked them through the reputation cascade. The conversation took an hour and was largely educational for them. They had not understood that reputation at major mailbox providers is a continuous signal rather than a static score. They appreciated the explanation and did not push for any compensation beyond the routing change we offered.
The transactional customer was the most frustrated. Their support tickets continued for another two days as users discovered confirmations in spam folders and reached out. They wanted us to commit to specific deliverability SLAs. We declined. Deliverability is a property of the receiver’s behavior, not ours. We control the inputs and we control the operations, but we do not control the output. We were transparent about the limitation. The customer accepted it but with some friction. They have stayed with us, but the conversation was uncomfortable.
What the upstream provider did
Our upstream provider in Bucharest acknowledged the BGP flap, attributed it to a memory pressure issue on their edge router, replaced the affected blade within 72 hours, and credited our account for the affected hours. The credit was a small dollar amount, certainly less than the cost of our triage and remediation. We did not push for more.
The interesting question is whether the upstream provider’s response was adequate. From a network-engineering standpoint, yes. Their root-cause analysis was sound, their remediation was prompt, their credit was contractually appropriate. From a deliverability standpoint, they have no awareness of the downstream impact and no responsibility for it. The fact that their 40-second BGP flap cost our customer three weeks of reduced campaign performance is not something they care about, and it is not in their commercial model to care.
This is one of the cases where the layers above the network make assumptions about the network layer that the network layer is not designed to honor. SMTP and reputation systems have evolved on top of network infrastructure that prioritizes statistical availability rather than the kind of continuous availability that reputation models effectively require. The mismatch is not anyone’s fault. It is structural. The remediation is to engineer around the mismatch at the layer that experiences the consequences, which is our layer.
What this generalizes to
The specific event was a BGP flap in Bucharest. The general pattern is that brief network-layer instabilities can produce long-tail reputation effects that are out of proportion to the duration of the instability itself. This generalizes to other event types: brief routing changes, brief packet loss, brief reachability issues from specific receivers, brief TLS issues. Any acute event that causes connection failures or quality changes can leave a reputational shadow.
The implication for operators is that monitoring needs to span layers. Watching only the network layer is insufficient. Watching only the application layer is also insufficient. The connection between them, the place where receivers form judgments about what they see, is where the consequences emerge.
We have not had another event like this in the year since. We have had other events, smaller in scale, that we caught in the first hour and remediated before any customer-visible impact. The change in monitoring and runbooks paid for itself within three months of the original incident.
If you operate any sending infrastructure at scale, the takeaway is to assume that your network layer will eventually have brief incidents that you will not predict, and that the reputational consequences may not become visible for hours after the network has recovered. Build the monitoring to catch the gap. Build the runbooks to act on it. The reputational cascade is real and it propagates faster than you would intuit.
We have published this post-mortem partly because we promised affected customers we would and partly because we think the lessons are useful to other operators. The full timeline, with all the timestamps and the customer-side data, is in our internal records. We do not publish customer-identifying data. The aggregate picture is what we share here.