All Articles

Democratizing data: Enablement to data debt

Making data accessible to everyone sounds awesome, right? That’s what it means to democratize data in an org - to enable & empower every team member to make data driven decisions without them needing to be a tech wizard. For example, someone from finance & operations should be able to access user consumption metrics to understand how it drives revenue, without really having to worry about where this data is coming from, how to fetch it or what complex tools to use, for viewing this. Sounds amazing until it doesn’t.

💡 Data literacy ≠ data democracy, although both have a common intersection.

So what exactly is the problem with this & when does it arise?

Let’s take a deeper look. If you intend to democratize data, it is very likely that you plan to use a data warehouse with a query engine, commonly known as analytical database. And with this, you most probably also intend to enable self serve BI. Essentially, you want data to be available to every team, at all times, to build those dashboards that refresh daily & show them the latest trends. And this is exactly where mis-management of data begins. We believe we’re democratizing data & making the best use of it, but under the hood, while moving fast, we’re creating a massive rotting ground for thousands of dashboards, tables, views & pipelines that soon enough, no one will be using. Basically, we will be creating data debt.

🌡️ Hot take - Investing in analytics & BI, without investing in data teams is like ideation without execution (a mere delusion).

Additionally, if your org is building rapidly and chasing ideal values of north-star metrics, it is very likely that you’ll have multiple databases & micro-services, 1000s of events & lakhs of tables stored in your warehouses. Of course, we always want to have access to all the data our users / services generate & store it because as a data first company we’d want to consume it “sometime”. But, do we end up using it at all?

Realization: data debt

“Do we use all the data we store”, boils down to “do we need it at all”? As the data platform team (read, horizontal “data” team) supporting a couple of rapidly scaling products, we realized that we actually used (& needed) only 20% of the data we stored (this included needs of all end consumers - data science, analytics, product, experimentation, growth & financial decision making).

And so, we realized that we have been accumulating data debt for as long as the products existed. We are not even talking about the quality of data, just plain redundant storage & compute here. To top it up, we’re using BigQuery’s Physical storage service to store our data - this included user generated data, events data, analytical data, model training data, “temp” data & what not. For BI, we were using multiple BI tools - mostly open-sourced, so everyone was used to creating & abandoning dashboards. In just 2-3 years of democratizing data, we were deep in data debt - paying for what we didn’t need or use at all - orphaned tables, views & dashboards nobody cared about anymore.

So what do we do? We have to start somewhere, & so our journey towards “Clearing data debt” started.

Clearing data debt: Unspoken responsibility of data teams

The first step to clearing data debt is just getting rid of the stuff we don’t need anymore. Data Quality, Observability, literacy & cataloguing are still secondary. (more on this soon)

Oh, did you know that most data teams have to take over the role of garbage collectors and cleaners, apart from the occasional data heroes? I won’t every loudly acknowledge that it accounts for >70% of my workload because, ✨ why should I when I can hide it under the mask of cost savings ✨

Here’s what we did, and are doing:

  1. Firsts first - Some Analysis

    We identified which data (tables) were not being used & for how long, by leveraging metadata. We conducted alignment sessions with end consumers to understand what data they use on a daily basis, what data they intend to use in future, & what data they store for “just in case” edge cases. Everything outside this was not needed.

  2. Horizontal Approach - Delete ALL

    We deleted unaccessed. After identifying unaccessed tables, views & dashboards, we deleted them all. One time cleanup - cost savings + one time effort.

  3. Creating Systems & policies

    One time cleanup & some org wide education does not necessarily ensure suboptimal practices won’t be followed in future. So we worked on creation of storage policies, which would define retention period for data stored in form of tables & views. We worked on enforcing this vertically - dataset by dataset, team by team. For teams that needed extended data retention, we chose GCS cold storage - which is much cheaper than BQ physical storage.

  4. Creating automated processes so we don’t have to be garbage pickers daily

    We created a bunch of automated processes that systematically, regularly deleted unaccessed data. By purging unnecessary data, we could streamline our data management processes, optimize resource allocation, and ultimately enhance also data quality. We call these automated processes BQ bots & watchers 😆

What we achieved:

  • Storage & compute cost savings
  • Better, cleaner, faster compute.
  • Data Management policies in place
  • Peace of mind

TL;DR

✅ Democratise data, but keep controls & checks in place, right when you democratize.

✅ Prioritize data management. Because if you don’t now, it’ll come back to haunt you.

✅ Know that data that is useful is getting used, if it’s not getting used, it’s not required.

This piece is just about unused data, as the biggest data debt. Eliminating unused data is a crucial step in enhancing data quality and reducing operational costs. This dormant data not only fails to contribute to business value but also burdens IT resources, increases storage demands, and potentially leads to misguided decisions. In our experience, we could effectively archive or discard upwards of 20% of our data without compromising decision-making.