One of the biggest blockers for any team to successfully leverage AI is data quality (garbage in → garbage out). If you feed poor quality data into your ML models, you’re likely to end up with an incoherent AI producing results much below your expectations. The quality of data often holds more significance than any hyper parameter tuning when it comes to improving your ML models. But we often don’t have enough resources, time, bandwidth & capacity to clean data. Which in turn keeps adding to the ever growing data debt.
Building at scale, for Bharat, we too faced data quality issues. We realized that our ML models, recommender systems can’t be trusted if our data can’t be trusted.
So much data that nobody trusts!
An example of one of the many data quality issues we dealt with.
One of our key user action table had a high% of duplicates. While data producers knew about the %duplicates, there was no straight fwd way to identify this, hence downstream consumers did not account for duplicity. After it was raised by a couple of teams, producers added UUID column to identify duplicate rows, without fixing the issue. 👽
Downstream consumers were still not aware than for about 2 years, they’d been monitoring northstar metrics where base data had 2-5% duplicate% per day, which fluctuated day on day. 💀
After much debugging, here’s what we understood:
- Most Engg teams look at data as a “by-product” of their builds. And, they have very less visibility into “where” the data is being used, or by “who”, and for “what”.
- Analytics, Product & data teams often filter data, i.e. they have their own cleaning / pre-processing methods (hacks) to fill any gaps in data and ensure they’re getting expected results.
- As these hacks grow more complex, and in size, data debt keeps on increasing. Some details here.
- When a data quality isssue is identified as being of a certain severity (SEV), it is fixed. But if the issue is below certain SEV, it’s put in the backlog for future fixes. Soon enough, context is lost, and data quality worsens day by day.
- Since there’s no agreement between data producers and consumers, both are free to operate as they’d deem correct. For eg, consumers will build additional layers of SQL filters to pass through a seeming quality issue, rather than pushing producers to fix it.
- As a result, nobody trusts data.
How to solve this?
The very first step is to know, identify & acknowledge the issue with your data. Get all your stakeholders to an alignment where everyone acknowledges the data quality issue(s).
I think it’s very important to conduct a thorough audit of your data sources is essential. This involves not only looking at the surface-level metrics but delving deeper into the nuances of your data. Are there inconsistencies in formatting? Missing values? Outliers? Too many NULL values? Duplicate%? Trust me, identifying these anomalies early on can save a lot of headaches down the road.
Post identification, we realized it was high time we solved most of these data quality issues. So, as a one time activity, we created a war-room for data quality, with data producers, consumers & all other important stakeholder. Here we tracked all existing data quality issues & tracked them to resolution.
Now the next crucial step is to create controls around these issues. This could involve implementing data validation checks, setting up automated processes to flag anomalies, and establishing clear protocols for data entry and maintenance. And, if you have the time & energy you should completely invest in establishing data contracts 👀
Monitoring Data Quality
Continuous monitoring is key to ensuring that your data remains clean and reliable over time. This means setting up systems that regularly scan and verify incoming data, as well as periodically reviewing and updating your data quality protocols / rules.
This is where we encountered the classic build vs buy dilemma: should we build or buy the monitoring tool?
Building your own monitoring system gives you full control over customization and integration with your existing infrastructure. You can train your anomaly detection models over your own data so it perfectly fits & identifies issues. But as fancy as it sounds, it can be super time-consuming and resource-intensive. Neither did we have enough engineering bandwidth to invest in this, nor the time.
So we decided to use a pre-built solution for convenience and potentially faster implementation.
Some tools we considered:
- Great Expectations (open sourced)
- Metaplane
- BigEye
- Monte Carlo
- GCP’s dataplex
Which one did we end up choosing? Let’s wait for another piece for that 😛
Ultimately, the choice between building and buying depends on factors such as your budget, timeline, and technical expertise. We chose buy, but whichever route you choose, the goal remains the same: to ensure that your data is clean, consistent, and trustworthy.
Because, remember garbage in → garbage out – and in the world of AI, that’s a recipe for disaster.