Event-Level Data and Data Team Alignment with Snowcat Cloud’s João Correia
Discussing Open Snowcat, the evolution of event-level data, and lessons from building modern infrastructure
João Correia is the founder of Snowcat Cloud, a platform that gives companies full control of their behavioral event data through Open Snowcat, a fork of the analytics tool Snowplow. I’d encountered him a few times over the years in the Snowplow space and found him to have a thoughtful opinion on data collection, data processing and where the market is right now.
In this interview we discuss the evolution of data from derived statistics in black boxes to exposing raw event-level data, the implications of Snowplow closing its license, the challenges of unifying customer identity across devices, and how AI is reshaping the value of behavioral data. João also shares lessons on data team alignment, leadership from the C-suite, common mistakes in analytics strategy, and where he sees the future of modern data infrastructure.
Why don't you introduce yourself? You're an entrepreneur from Portugal living in San Diego. How did you get here?
João: Portugal is a really small market. One day I went on a trip with my wife and when we were landing in San Francisco I looked out the window and asked her, "Why don't we live here?" On a whim sent out two CVs — one for an East Coast company and another one for a West Coast company, Blast Analytics and Marketing. I did something interesting to call out their attention: I used UTM parameters when I visited their sites and I said, "This is João, I want to work there." They saw it and they interviewed me via Skype — an interview with the owner and the analytics guy. After a point they're like, "Yeah, we like you.” But at the time it was really small, like 12 people.
I said, “Well, now I'm gonna go there just to meet you guys and then we'll see.” It was Roseville, next to Sacramento. They said, "OK, we like you, come work here." But wait a minute, there's a thing necessary called a visa, a work visa, and mine had expired. The whole process took a year and I started working at Blast in 2012. Blast mainly serviced Google Analytics 360 customers, which was not with Google Analytics Premium at the time. It did also Search Engine Optimization and other things. I started working with Tableau.
As soon as I saw Snowplow, and this was 2013, I was hooked. I'm like, "What is this thing where I have access to the underlying data? I can run whatever queries I need in SQL? This is amazing." But it was 2013. Most of the digital analytics world didn't even know SQL. It was just people who knew the interface, marketers, obviously, as non-technical people, and even everyone that dealt with the Google Analytics was just not geared towards dealing with raw data. But I find that really interesting, because you could dive deep into the questions and understand the actual definitions that were a little obscure in Google Analytics.
I contacted them, I talked with Snowplow for a while, I gave them some customers. Fast forward I left Blast, I went to work client-side, tried to work at an agency. That didn't work out very well. I was fired in nine months.
Nine months at an agency is more than some people make it.
João: Oh, man, I could not go back to derived analytics after working with raw event level data, and SQL and other sources of data, I could not go back to the world of Google Analytics and Adobe anymore. It was just incompatible. The level of maturity was so low that I didn't have patience for that shit. Honestly I was so glad when I was fired. I literally thanked them.
“Honestly I was so glad when I was fired. I literally thanked them.” - João Correia
Yeah, one of my former employers had hundreds of ex-Google Analytics analysts. That was one of the great mismatches with the Snowplow data: those analysts often didn't have the technical skills, knowledge or sometimes even the desire to interface with raw event-level data. I tried to convince internal teams to build a product on top of Snowplow for them but the BI team feared losing control.
João: Yeah, I find that that has to come from the top from the C-suite. It doesn't matter what other people want. It either comes in — this is a little bit of a fixed mindset but here it goes — usually CEOs, they have it or they don't. If they don't, you shouldn't even bother. It's like trying to feed beef to a baby with no teeth. It's not going to happen. You can't rely on potential. It's the same with romantic relationships: you can fall in love with potential but potential is nothing. You have to deal with reality. You cannot deal with potential.
So you got fired from the agency. Then what?
João: I went to work at private equity. It was interesting because I was part of a group I was on the data side, I had another person on the marketing side, and another one on the engineering. It was interesting because our role was to go in to companies of the portfolio and help them up their game in these respective areas. The problem is private equity is very risk-averse. They don’t want to spend money, they want to save money. So anything that has to do with investment and change is not seen with good eyes.
I’ve interviewed with private equity and other finance firms. They sit on these vast piles of money, tens of millions of dollars, sometimes more. Yet you try to get them to modernize — say, "Maybe you shouldn’t be running your system on a COBOL program from 40 years ago?" and they look at you like you’re a leper. They’re like, “I’m not spending money on that. Everything’s in a spreadsheet. It’s fine.”
João: Exactly. They still make a shit ton of money because the time horizon is shorter: they acquire a company, stay for a couple of years, then package it and sell it again. It doesn’t really matter. It’s not like they want to keep the company for 10+ years. And they’re not the founder — so the spirit is completely different from an entrepreneurial spirit. My core traits don’t align well with that kind of environment. I’m more of a risk-taker. Otherwise, I wouldn’t be here.
Right. So how’d you go from private equity to Snowcat Cloud?
João: Well, during this whole time — way before 2017, even after leaving Blast — I started a consulting company. But consulting was very difficult because I had a job, and you can’t do a job and do consulting at the same time, it's just too much. I focused on a niche, Snowplow, which at the time was only used by large enterprise customers. I ran that consultancy, had a couple of projects, was not too bad. But then someone asked, “Hey, I’d like to run Snowplow but I don’t want to manage services.” I looked at the landscape and saw dbt democratizing access and democratizing knowledge and forcing people to know SQL; Snowflake making compute and storage cheaper; and with dbt on top I could see where this was going. So I said, "OK, there’s going to be more demand for raw-event data. Let’s make this more accessible to people and create multi-tenant infrastructure." I set up a really ugly page, got the first customer, and it built from there. Lots of learnings from there but that was the beginning of Snowcat Cloud in late 2019 and we've been growing ever since.
How would you describe Snowcat Cloud for people who’ve never heard of it?
João: Now it’s evolved quite a bit because Snowplow is no longer open source — they closed the license in 2024. We forked it right away and called it Open Snowcat. So if you want full control of your behavioral event-level data, you have options. You can either run Open Snowcat yourself, or you can hire our services and use our hosted version which provides exactly the same and delivers the data into your warehouse without hassle and at a fraction of the cost of running your own pipeline.
Got it. What are your backends? Snowplow is the collection tool on the front end, but if I recall correctly you’ve got a graph database/Neo4j backend. Is that the only one? Do you support multiple backends like Snowflake or Databricks? What does Snowcat Cloud’s backend look like?
João: It’s very similar to Snowplow. Essentially infrastructure on AWS. Obviously we have some mechanisms for routing events since it's a multi-tenant infrastructure. On the back end we have infrastructure to manage billing on top.
I understand the collectors, the Kinesis streams, the S3 buckets, you need to have that stuff to process the data, I'm thinking more on the data warehousing side -- Neo4j, Databricks, Snowflake, etc. Once the data has been collected and stored what's the lifecycle after that?
João: We deliver data into the customer’s preferred data warehouse. It could be an S3 bucket, it could be Snowflake, it could be Redshift — it doesn’t matter. Today any data warehouse or database you run should be able to ingest data directly from a Kafka stream or from an S3 bucket or from Google storage bucket or from an Azure blob, it doesn't really matter. But we deliver the data to our customers and we expire data from our infrastructure after 48 hours, so we don’t retain anything.
Do you have a preference or opinions on these backend warehouse solutions? You’ve worked with Redshift, Databricks, Neo4j. What have you learned streaming the data to all these different warehouse options?
João: That's a good question. I see a lot of Snowflake obviously because it’s relatively inexpensive to run if it's run efficiently. But picking one of these platforms the first thing to understand is, "What is the default stack of the company?" In our case we use AWS we also use Snowflake on top of AWS. But if you run GCP or if you run Azure you should stick with the platforms that are available for your cloud infrastructure so you don't disperse yourself from a technological perspective, which increases complexity. I would say: stay within your cloud and use whatever technology and platform you already have specific know how for it. From a from a data warehouse perspective, with Azure, we see a lot of Databricks. With AWS, Redshift is increasingly rare. Most of our customers use Snowflake.
When I first spoke with you a couple of years ago you were big on graph databases and Neo4j as a backend, that it would eliminate the analyst's need for joins. Are you still pursuing that?
João: From a graph perspective I was hopeful a couple of years ago. I was hopeful that graphs would grow because it answers very specific questions but what I've seen instead is an abstraction of platforms and products and services that sit on top of graphs and provides answers to the questions people are looking for because the adoption of a graph requires specific know-how, investment in software. It's just, I don't know why it hasn't taken off the way I was hoping it would.
I always thought it was an interesting value proposition because you're right that so much of the churn and burn on the analyst side is around joins. Like, "Where is the table? What are these multiple tables? What state are they in? How do I inner join them?" They keep evolving these SQL queries until they're so complex they make your eyes bleed. Eventually only two people at the company understand them and the queries get handed around like some secret prayer from a lost civilization. They're so complicated nobody knows what they actually mean.
The promise of graphs was, "Look, we're going to figure it out once. We're going to have no joins. Everything is just baked into the model and instead focus on the value." But you're right, I think people were afraid of implementing Neo4j. I think that it's a specialist skill. And I think, ultimately, the value isn't in the underlying technical infrastructure, it's in the questions that are being answered about the business. So the people who are trying to run the business don't really care if it's a graph or if it's Databricks or if it's Snowflake. They just want to know, "How am I making money? How am I losing money? What should I do?" That's my perspective. But is that what you think? What's your point of view?
João: Yes, I agree. I actually have seen a new company or relatively new company pop up called Puppy Graph, where they built a graph from your relational database following a declarative model in YAML. And you just spin it up. It spins up the graph. You can write your queries in Cypher or Gremlin. You can make your business questions that are more appropriate to be asked to a graph and then shut it down. So very cost efficient when compared with Neo4j. But also, and if you want to run it full time, in real time, 100% of the time, you can also do that. So it kind of keeps like a mirror of your relational database in a graph format following the model that you have declared, which is really interesting.
Yeah, when I worked on Databricks there was this desire to unify some of the customer data. The team tried running Graphframes -- I don't know if you've had exposure to this -- the library developed jointly between Databricks and MIT that basically takes all of everyone's favorite graph algorithms from computer science and implements them in Spark.
The algorithms of course are time-tested and absolutely work (like breadth-first search, depth-first search). These things obviously have been verified mathematically to be accurate. The problem was the team would run them, they'd get this output in a graph but be too afraid to store it in Neo4j because they didn't have any database experience, much less graph database experience. They would take the output and they would turn it into a table and use joins. Maybe that graph was accurate that one day you ran Graphframes but how are you going to incrementally add nodes to your graph if you need to rebuild your graph every day? But I couldn't convince them they needed to store state in a graph.
João: Yeah, that's what Puppy does. Yeah, Puppy does that. It rebuilds, if you leave the graph engine up, it rebuilds the graph, it keeps rebuilding the graph and adding as you add new rows of data to your database. Pretty cool.
Another cool thing you did that was when I first reached out to you was you had a Shopify plugin. And I was looking to add Snowplow to Shopify and at the time they didn't have a solution. I didn't want to have teams of developers adding a custom implementation because we had too many Shopify sites managed by non-technical teams. People often choose Shopify because it's this turn-key solution and you don't need to know any technical things to run a store, but all of a sudden you need to do a technical task and the teams are not gonna do it. I said I am not gonna be training non-technical teams to implement behavioral analytics. I mean, I'll never get any sleep. That's when I first saw your Shopify plugin. How many platforms have you expanded Snowcat Cloud to? Is it just Shopify? Are there other platforms you feel that are worth your time?
João: Yes, so we created an app for Shopify and another one for Magento. It's not that we get a lot of direct customers from it, but we feel it's important for adoption of the platform by serious retailers that wanna own their behavioral data and use Snowplow or Snowcat or OpenSnowcat. It doesn't matter because those same plugins, those same applications, they are compatible with Snowplow too. So we are sort of putting ourselves out there as in facilitating the implementation of Snowplow or OpenSnowcat on a given site, on an e-commerce website.
Very cool. Yeah. I gotta ask you about AI. How is AI affecting changing your business?
João: Currently, it is not affecting our business, but I do believe that our customer's business and for people who collect behavioral data, running AI on top of that behavioral data is extremely valuable because now you can automate the detection of patterns of recommendation engines, which you could before, but now you can use AI to further customize a customer experience based on their behavior. So it's good news for us because AI is not a disruptor of our business. It's something that is going, it makes our business even more important because now you're collecting behavioral data and you're gonna run AI. If you wanna run AI, you need behavioral data.
I do think there's a big challenge ahead, which is tying multiple devices to a single individual with increased regulations in privacy, with technology, technology like Apple cloud relay, like privacy blocking cookies, expiring cookies. It's increasingly harder to consolidate the behavior across multiple devices and tying that to the individual. And I think this is where a lot of platforms miss is they're actually not doing customer oriented activities. They're doing device-based activities because they haven't thought out fully how are we joining all of those breadcrumbs from different devices into a single customer?
Totally agree. At a previous employer I tried to guide the team into doing ID unification but the team wanted credit before they ever accomplished anything. Instead they took all these huge data sets, some of which were semi-anonymous or completely anonymous and started atom-smashing them together with inner joins. I said, "Guys, this is the worst way to go about this because not every data set is equal. There's gonna be noise. And if you have 3% error rate in a big data set and you do an inner join, now the errors will be multiplying. Your errors are gonna be having children. Your final data set will be so messy and noisy, you can't use it." But they did it anyway and the data was so error-ridden internal stakeholders refused to use it. Leadership didn't want to hear about failure because they thought it would reflect poorly on them. Even though it didn’t work we just had to sit there and pretend that everything was amazing.
But speaking of ID unification I saw on your website you still offer some fingerprinting, but I think when we talked last, you talked about how Apple is -
João: Yeah, we are not, I'm actually going to discontinue that because we haven't had the time to develop that. I even thought about open sourcing all of that stuff. It has some serious backend magic happening. It uses cosine similarities, does searches, it has a server-side component, it has a client-side component. I'm thinking about what to do with that actually. Because currently I don't think it's even working because it's something that the browsers keep evolving. And it's so fast, I think we got it to a point where it was working, but then after a few months, Safari it changes something and we always have to be constantly adapting. We just haven't invested much in fingerprinting at this point.
I find the customers are shifting. Back in 2021, 2022 with Covid it pushed so much companies to doing more — we were seeing smaller shops wanting and creating their own CDPs with dbt, Snowflake, OpenSnowcat, all that stuff. Now we are seeing software as a service on top of our infrastructure, on top of Open Snowcat, on top of a Snowcat cloud, doing all of that, the fingerprinting, the identity resolution, everything in the backend and just delivering results for e-commerce owners. Because they don't want to own a data team or they don't have the dimension to own a data team. It's like the Formula One team, what is it? McLaren Mercedes, they don't make the engines, they use somebody else's engines because they just don't have the in-house capacity to build engines at that level. So we're seeing more and more of that rather than smaller clients trying to do this by themselves because it's just too expensive and they've done it. And the other thing is because people have done this so many times, it's just been too expensive. Identity stitching, it's been productized. Attribution, it's been productized. So you do it once, you sell it to multiple, you have a SaaS.
“Identity stitching, it's been productized. Attribution, it's been productized.” - João Correia
What's the biggest mistake your customers make and do you have any other greatest success that they make? What is the best thing your customers are doing and what's the biggest mistake they're making?
João: Our customers are very mature. So I don't see them making big mistakes because they are mature, they have a data team. They know what they are doing. What I see is some potential customers wanting to adopt this without the data team. And I'm usually very honest and say, "If you do not have a data team, this is just not for you." I can recommend them to some of our customers that use our infrastructure to productize things that they would be doing from scratch.
I totally agree with that. Even sometimes when you have a data team if they don't have all the right skills it won't work. Where I saw gaps often were on the data modeling and the BI side. The engineers are very good at data collection and processing but once you get to try to connect it to the end user they struggle to connect with business analysts. The skill set that helps them write PySpark notebooks and Databricks isn't necessarily the same skill set that allows them to talk to a business analyst and say, "Hey, what do you need? What answers your questions?"
Business analysts end up circumventing engineers because they feel like they're not getting the support they need. Then they write these crazy queries or they do their own data processing, which is even worse. There's some Rube Goldberg machine on someone's computer somewhere generating reports to the CEO. I mean, that's not a hypothetical. That was a real thing that I experienced. An analyst was out for a week and suddenly the CEO couldn't get his bespoke report. So I totally agree with that, I think that's a really important insight.
João: The one thing I would say, and I've seen this throughout my career, even with companies that have data teams, is not having a CEO that provides proper direction. I see CEOs as the captains of a sailboat. When they say tack, everyone knows exactly what to do and everyone tacks. But what I see is some companies have these siloed teams that work alone and not in alignment with what the leadership is doing or what the leadership goals are.
I've also seen data teams become reporting monkeys, where, "Hey, we have a data team! Yay, now we can ask a gazillion questions to these guys. What is the angle of the moon with the sun when it rains? What is the impact on sales? Yay, let's spend two weeks having meetings and discussing this."
“ I've also seen data teams become reporting monkeys, where, "Hey, we have a data team! Yay, now we can ask a gazillion questions to these guys. What is the angle of the moon with the sun when it rains? What is the impact on sales? Yay, let's spend two weeks having meetings and discussing this." - João Correia
I agree. Suddenly there's this infinite series of side quests to find a bunch of signals that look good on a deck but don't actually answer core business questions. There really is a key need for that person who understands the business, understands strategy, and then can talk to the data engineers and say, "Look, this is what we're trying to understand. I'm going to need these values in this format."
I agree with you on the silos too. There's Conway's Law that says the output will match the corporate structure, where if you've got three teams working on a compiler, you get a three-pass compiler. That's absolutely what I saw where we had analyst teams spread out across the company and they just each did their own thing. There was no unity of purpose. It was very difficult to service that when you have all these different groups of varying levels of influence demanding different things.
João: Yeah, I've been on a journey. This is something I would actually want to work on in the near future, is providing that framework for alignment between executives and CEOs and the captains and how they create an aligned team and how they can use a data team or a data provider or a consulting firm to help support and see if everyone is rowing in the right direction.
Thanks for your time.
A lengthy one but I read till the end 🏌️♂️
Insightful one ✨✨