The 3 Dumbest Mistakes We Made Migrating from Google Analytics to Snowplow (Part 3)
Let our pain be your gain
Now that we’ve gotten past the problems and the solution, let’s talk about mistakes — Snowplow’s, mine, Conde’s.
Worst execution error: data loss during our biggest event
We went down hard during our biggest event and permanently lost data.
Ouch.
As I mentioned in Part 2 you can pay Snowplow to manage your back end cloud infrastructure for you. That’s what we did at Condé Nast.
For the most part it worked out great: we had fine-grained control of our data collection, we could route data wherever we wanted and let Snowplow Inc. worry about Kinesis streams malfunctioning in the middle of the night.
…except when our Snowplow infra went down on Condé’s biggest night of the year: the Met Gala.
If you’re not chronically online the Met Gala is an exclusive annual event where ~600 snobs carefully selected people walk a red carpet and party at the Metropolitan Museum of Art for 75K/seat.
The event is livestreamed on Condé’s websites (and elsewhere), plus fashion looks and celebrity sightings are liveblogged. It’s by far Condé’s biggest traffic day and most important cultural event of the year.
Normal traffic on a non-Met Gala day to Condé’s portfolio of sites would be around 300 million Snowplow payloads (not pageviews, payloads, which includes pageviews but also all custom events, page pings — essentially anything sent to the collector). But on May 6, 2024, the evening of the Met Gala, we’d receive 1.5 billion payloads, 5x the normal traffic.
At least that’s what we were able to verify because our Snowplow-managed Snowplow pipeline crashed during the event.
What happened
We needed to upsize our pipeline to handle the 5x surge in traffic. We communicated that to Snowplow well in advance of the event in a Zendesk ticket.
However the Snowplow DevOps engineer in charge that day opted to rely on autoscaling instead of manually upsizing the instance sizes of our components in AWS.
Bad move.
Kinesis, an AWS tool used to stream high volume requests, has a service limit of only upscaling 5x within a single 24 hour period. We quickly hit that limit, the Kinesis stream became backed up (specifically: PutRecords calls failed due to insufficient shard capacity) and the servers that receive payloads could no longer write to the stream. Overwhelmed, they started to reject new payloads and we started to lose data.
After some frantic calls to AWS we were able to get the limit increased but by then the event was over and the crush of traffic had already passed.
That’s not all. In the midst of it I tried to organize a Zoom with the Snowplow DevOps engineer but was told they were logging off and to wait until the next guy was online, which would be soon.
A forensic analysis afterward estimated we “only” lost 1 million payloads out of 1.5 billion, but the real damage was to Snowplow’s reputation at the company. The system was still relatively new, not fully trusted and it went down during our most important event.
The good news
Snowplow Inc. got a lot better afterward at managing tentpole events. They created special tickets just for them. They created written playbooks. They did dry runs and manually upsized the infra to the specified size a week ahead of time to see if any problems emerged. They promised hourly updates on traffic levels. In 2025 the Snowplow portion of the Met Gala went totally fine.
(Of course, while Snowplow worked this past May Google randomly decided to de-index Condé articles during the Met Gala. I was told Anna Wintour herself had to call Google CEO Sundar Pichai to intervene, and that Google later claimed it was a mistake. If it’s not one thing, it’s another.)
Every engineering org has outages. Last month Google triggered a global GCP outage with a null pointer exception. I don’t think any IT professional has forgotten 2024’s Crowdstrike outage. This was just an execution error, not a structural one, and as far as I’m concerned Snowplow learned from the mistake and is better for it.
My biggest failure: subdomains
The biggest screwup I personally committed was listening to Snowplow’s advice by creating subdomains to receive Snowplow payloads instead of using a CDN reverse proxy.
Let me explain.
When Snowplow collects data from a user it needs to send that payload somewhere.
Part of that payload will include cookies. Aka files on your local computer, which store useful information like consent, user IDs, etc. Generally cookies are domain-specific — that is, if edwarddistel.com sets a cookie then Chrome, Safari and Firefox doesn’t allow badguy.com or other websites to read the cookies and the data contained within them from edwarddistel.com.
To make sure Snowplow could collect those cookies the Snowplow team recommended setting up separate subdomains for each brand website. So payloads for www.vogue.com would be sent to c.vogue.com. Lower environments would be qc.vogue.com. Since they’re on the same root domain (vogue.com) the subdomain (c) would be allowed to read cookies from the other (www).
Snowplow offered to create all the records for us but that would require ceding control of our DNS to them. Sure sure, let me do that — and while I’m at it, would you like control of all my bank accounts? My email? Maybe the keys to my apartment?
No. Instead I created over 500 DNS records for subdomains. But that was a big mistake and I should have known better.
Apple doesn’t like your cross-domain footsie
The main problem was Apple’s Intelligent Tracking Prevention (ITP). Introduced in 2017 to limit the effect of third-party cookies, it’s been updated 10 times over the years (2.7 is the most recent) but 2.1 was the most seismic: if you don’t revisit a website within 7 days the client-side cookies associated with that site will be wiped out.
Sidebar on cookies:
Client-side cookies are accessible by the browser (aka the client) to read whatever data is stored in them. However third-party scripts loaded onto the site can also read them, so they’re considered insecure.
Server-side cookies (confusingly named since they physically reside on your machine) are set by the server and can only be read by the server, not the browser. When your web browser requests a web page it’ll transmit these cookies to the site but once data is returned and the site rendered locally on your machine your browser forbids itself from reading those values.
Snowplow generates several user IDs, a client-side and server-side one. But due to Apple’s ITP rules the client-side one was regularly getting wiped out in Safari:
Expiration of cookies set with Set-Cookie HTTP response headers is 7 days at most, if the response originates from a subdomain that has a CNAME alias to a cross-site origin, or if the subdomain is configured with A/AAAA records where the first half of the IP address does not match the first half of the IP address of the website the user is currently browsing.
In my case I had set up “A alias” records (which you can do in AWS) but the first two octets of the IPv4 addresses of the subdomains (www.vogue.com and c.vogue.com) did not match, even though both were behind AWS Cloudfront. AWS does not guarantee stickiness of PoPs using different distributions, even if it’s the same user in the same location.
In English: because I routed Snowplow payloads to a separate subdomain instead of the same brand domain (e.g. www.vogue.com) the highly valuable Snowplow user ID cookies were getting vaporized after a week for Safari users (something like 40% of the total audience).
So I created a new behavior in Cloudfront off the main distribution and used it as a reverse proxy. That way payloads generated from www.vogue.com would be transmitted to a route on www.vogue.com, and the highly valuable user ID cookies would persist for a year instead of a week.
Returning user IDs are valuable because the more a site understands user behavior the better they’re able to slot said user into a more lucrative ad segment.
Biggest overall failure: lack of data modeling
The first mistake was an execution error. The second a technical one — both however got fixed. The third and most significant mistake still hasn’t (at least, as my layoff 3 weeks ago).
You can implement all the fancy technical hocus pocus you want when collecting and processing data from your website but at the end of the day it needs to have an impact on running the business to matter.
Enter data modeling.
We counted how many individual dimensions (data points) existed on every single Snowplow record: ~2,200. Granted, not every field was populated for every record, but that’s still a ton of data to sort through.
Condé’s biggest existential threat right now, like that of most digital publishers, is that of Google’s AI Overviews, which are killing traffic. Why click through to a result when Google just answers it for you?
There’s even a rumor circulating that Condé may be sold to Jeff Bezos.
During my time at Condé I could never successfully convince executives to invest in data modeling. They hired an agency from Bangladesh but that produced nothing. They promoted an analyst into a product role and she quit after a few months. Instead the company took all those 2,200 fields I mentioned and just threw them into two tables:
a “gold” pageviews table
a “silver” core events table
(bronze-silver-gold refers to the Databricks medallion model)
Each of these tables had hundreds of columns. How was any analyst or business user supposed to go from 2,200 columns (90% of which had little value in isolation) to finding signals in traffic and engagement patterns to develop a strategy in an era when Google traffic is plummeting?
Only super analysts deeply knowledgeable about advanced SQL and all the intricacies of each individual field (what it is, how it’s collected, how accurate it is) could hope to produce valuable insights, and even for them the process was glacially slow.
Why was Condé never able to model its data? Three reasons, IMHO:
Bottom-up project plan left competing tools in place
When launching a new tool it’s rarely a good idea to do a hard cutover from the old tool. Usually you launch the new tool in parallel, confirm it’s working to your satisfaction, then decommission the old tool.
Makes sense. The right choice most of the time. But not in this case.
Google Analytics and the homegrown Sparrow tool weren’t the only games in town. Many editorial staff loved Parse.ly, yet another data collection and reporting tool. Specifically they loved the “real-time” content performance analytics, something Parse.ly put a lot engineering effort into.
I saw it at NBC News too — editors spend days, weeks or months writing a story and when they hit publish they will frantically hit reload on their browser every 5 minutes to see how it performs. And why shouldn’t they? The first 24 hours after publish is when a story typically gets the most pageviews and social traffic.
We never built a dashboard to display similar metrics so editors had no reason to switch over.
I once created an internal facsimile of the Parse.ly dashboard with Snowplow data to show people we could just give editors the data they wanted in the format they wanted but I couldn’t get people interested. Why?
Over-investment in existing BI tools
Condé used Qlik, which it white-labeled internally as DASH. The BI team had used it for years and had done a lot of modeling on top of what was in the data warehouse to make reports intelligible.
Having spent so many years using Qlik and having built so much logic on top of the largely un-modeled data they didn’t want to pivot to new tools or give up control of Qlik.
Some Qlik reports were popular (especially the affiliate revenue dashboard) but most were not, instead analysts often just circumvented the platform entirely. The ads group moved a bunch of their data into Omni. Another analyst had a homegrown Rube-Goldberg process for manually extracting data and doing some transformations before emailing it around; when he went on vacation it had to be explained to the executive leadership team why the reports had suddenly disappeared.
I created a northstar architecture arguing a big chunk of our business analytics should be moved into a dedicated OLAP database. I said let’s make the data available headlessly and let people build their own reports and visualizations but I couldn’t get momentum or buy-in beyond vague agreement.
Corporate structures and Conway’s Law
Analysts were spread throughout the 8,000-person organization. There were ones embedded inside the technology group, ones inside the revenue group, ones inside the brands, analysts everywhere.
Conway’s Law basically says the output of an organization will match the org’s structure. The classic example is if you put together 3 teams to build a compiler you’ll get a 3-pass compiler.
If there was one group who owned analytics under a single executive they could have simply decreed by fiat what the model should have looked like, but dispersed across executives it was impossible to herd cats.
Takeaways
My advice to all others looking to migrate analytics tools: start in reverse.
Step 1: What business insights are you trying to get from your analytics reporting?
Step 2: What data model would you need to construct to support that?
Step 3: How can you collect and process that data?
You’ll have outages. You’ll make bad technical decisions that need to be reversed or fixed. But getting value out of your data is the single most important thing you can do with your data and it should be the starting place of your journey, not the end.
Next week
No more Snowplow or data analytics for now — I’m starting a new, quasi-monthly series called Horrible Bosses, where I’ll describe some of the um interesting management styles I’ve experienced over the years.
Meow. (I still remember the days of the 942 omniture column data, which felt inordinate at the time; sparrow, when it started, was the model of economy! All of our event schemas seem to gather moss over time, I guess.) Anyways, v enjoyable read!