Migrating an A/B Testing API to Google Cloud: Zero Downtime and a Non-Blocking Cache

3 minute read

Global connectivity and infrastructure

In 2019 I led the migration and performance work for our company-wide A/B testing platform — the system product teams use to run experiments and read assignment in production. This post is a concise case study: why downtime was unacceptable, what we found in pre-migration review, how we fixed blocking cache refreshes with Java and Guava, and how we rolled out to Google Cloud without taking the API offline.

Product context

The platform has two main pieces:

REST API — on the order of 112 million requests per day from product teams across the company.
Admin tool — where product managers configure experiments, set traffic splits (for example 50% / 50%), and manage rollouts and rollbacks.

If the API is unavailable, teams cannot run experiments or trust measurement in production. That set a high bar for the migration.

The challenge

We needed to move this API from on-premise infrastructure to Google Cloud with zero tolerance for downtime: a hard outage blocks experimentation and reporting for everyone.

During pre-migration code review, I found a serious performance problem tied to our cache policy. Every 30 minutes the service refreshed configuration from Oracle with blocking queries. During those windows, servers could become unresponsive, latency spiked, and we occasionally failed to return experiment configuration to callers. That behavior was incompatible with a safe cloud cutover at scale.

Technical solution

I implemented a Guava-based asynchronous cache loader with three ideas:

Stale-while-revalidate — Serve cached data immediately (it may be slightly stale) while a background refresh runs against the database.
Non-blocking requests — Incoming traffic never waits on the refresh; only the async loader pays the Oracle cost.
Local cache per instance — Each node held its own cache. We discussed Redis as a shared layer, but roadmap and delivery pressure pointed to local caching first.

This aligns with the pattern I wrote about earlier on avoiding blocking calls when loading a cache in Java — here it was not theoretical; it was a prerequisite for a reliable migration.

Migration strategy

We used an incremental rollout over about a week:

Day 1: 10% of traffic to Google Cloud, 90% still on-premise.
Deployments at 6:00 AM in the lowest-traffic window.
Daily increases in the cloud share, with Datadog dashboards as the gate for the next step.
Close work with DevOps, who owned the Google Cloud footprint, so networking, capacity, and runbooks stayed aligned.

We did not advance the percentage until metrics looked healthy at the current slice.

Trade-offs

Redis vs. local cache — A shared Redis cache would have been a clean central abstraction. We chose local Guava caches because the team was product-led, time was tight, and we wanted less operational surface during a risky migration. We still got a large win: no blocking refresh and predictable request paths.

Stale data — Experiment configuration could tolerate a ~30 minute freshness window (our policy already implied that). Trading a bounded staleness for no blocking and better availability was the right call for this use case.

Org context

This was technical debt inside a product org: the PM was rightly focused on customer-facing roadmap. I had to negotiate explicit time for infrastructure work so we did not ship the migration on top of a known performance foot-gun.

Impact

Zero downtime across the migration.
Removed the refresh-induced stalls that had been hurting the fleet.
Migrated the full 112M requests/day workload to Google Cloud.
Sustained availability for teams running experiments (our bar was no customer-visible regression from the migration itself).
A repeatable pattern for later high-traffic API moves: review before cutover, fix the hot path, then ramp traffic with observability gates.

Stack

Area	Technologies
Backend	Java, Guava Cache
Database	Oracle
Cloud	Google Cloud
Monitoring	Datadog
Delivery	Incremental rollout with DevOps

This post is based on a real migration and optimization effort; figures and timelines reflect the project as I remember it.

Disclaimer: Opinions are my own and not the views of my employer

Twitter Facebook LinkedIn

Alex Manrique

Migrating an A/B Testing API to Google Cloud: Zero Downtime and a Non-Blocking Cache

Product context

The challenge

Technical solution

Migration strategy

Trade-offs

Org context

Impact

Stack

Comments

You May Also Enjoy

Kafka in Practice: Encoding Production Knowledge into Tests

Building a Transactional Email Platform: From One Team to a Company-Wide Service

Kafka Explained: Architecture, Producers, Consumers and Best Practices

Migrating to Java 17 and Spring Boot 3.3.5 using Claude Code and Cursor