Bad data. It seems this is the modern villain that keeps us from having a truly data-driven business environment. It keeps popping up in conversations, reports, planning and forecasting. It seems easy and simple to blame bad data for not achieving a goal.
According to a study in the Harvard Business Review half (50%) of people working with data waste their time searching for data, finding and correcting errors, and looking for corroborating sources for the data they do not trust.
When we point to bad data as the reason or the biggest obstacle to achieving our goals, it is just a small sample of a symptom that can have many causes.
What is bad data?
First, what is bad data, often described as "dirty" or "rogue"? Simply, data that contains errors, such as errors in spelling or punctuation, incomplete data, outdated data, duplicate instances in the database and incorrect data associations. Bad data is data that teams do not trust, or worse, it is data that we trust but should not trust.
But, what causes bad data? A lot of things. Bad data is a result or manifestation of a series of events that cause it. We list the events, along with the causes and possible steps for a quick fix.
Cause: This can come in a few forms: completely missing or partially completed data. Incompleteness not only limits the insights we can extract from the data (such as reporting and analytics), but also limits any data-driven operations (such as AI/ML). Solution: Implement data creation "gatekeepers" that stop the creation of incomplete data. Help customers (and your company) fill out forms, for example, with a typeahead or auto-complete feature that relies on a robust set of external reference data to populate the form. Practice on governance to ensure mandatory fields are completed intelligently through data quality checks.
Cause: This occurs when records unintentionally share characteristics with other records in the database. When duplicate records exist in a data ecosystem, the consequences can include overcounting when aggregating data, producing incorrect values for reports and analyses. This wastes effort and creates confusion about the numbers. Business management becomes increasingly challenging as the impact of duplicate data increases.
Solution: You want to know which "duplicates" you want to keep, let go or archive. You decide this through clustering (match/merge) techniques. Bring together similar versions of these records as part of that cluster. Choose the best version as the main entity and the rest as parts of that group. This is a systematic way to de-duplicate the data. Since not all duplicates are equal, you may want to keep a few (due to business or legal needs) and keep them within a manageable cluster. This is the concept of a golden record.
Disparate source systems (data silo's)
Cause: It is almost inevitable to have many different source systems. In fact, a 2021 Dun & Bradstreet study found that the average sales and marketing technology contains at least 10 tools. The complex business arena we have today practically enforces them. Managing and keeping track of all these tools can be a daunting task. While they may not share the same processes, the data should relate to other data sets. After all, you want uniform data everywhere. The concepts of datawarehousing, data lakes and now data meshes were devised to make the management of data coming from different systems possible and scalable.
Solution: The obvious response is to construct a data lake, but it is not enough to house all the data in one place. Without the data that enter the data lake managed and qualified, your wonderful data lake turns into a data swamp. In addition to technically securing the flow of data through connections, such as APIs, you need to think about managing your data in the data lake using clustering methods to house data from different sources in a common environment. When you are able to create a golden record by clustering similar entities, you gain a better understanding of overlapping and new data. With a match/merge engine, you can better manage existing and new data sources in your data lake.
Cause: Of all business-to-business master data, contact data appears to decay arguably the fastest. In some areas, data can decay by 34% annually. This can be very alarming for data-driven organizations, which derive their decision-making insights from data. This statistic can be quite daunting as we become increasingly dependent on data to run our businesses. The current economic situation makes it much more pressing to pay attention to data decay. Companies going out of business, supply issues and The Great Resignation are examples that add complexity to the expected mergers, acquisitions and divestitures the market is experiencing. How can you make sure your data stays relevant?
Solution: data enrichment. You need to be able to periodically enrich the data with a reliable external reference data source. As an old saying from 1914 goes: don't throw the baby out with the bathwater. It is so easy to label your current data resources as underperforming by their poor performance or by hearing anecdotes from those who depend on the resources. Work with outside sources or third parties to provide current attributes to your existing contact data. As discussed above, we face data attrition rates of 34% or more per year. You need an effective enrichment schedule tied to your organization's threshold for data accuracy. Running it ad hoc can do a disservice to your users because it is not scalable. Get an enrichment strategy and schedule in place and communicate with your stakeholders.
A case for data governance
These recommendations and best practices are just pieces of a larger puzzle. There is a strong need for data governance to set policies and adhere to data quality standards so that the hemorrhage of substandard data in your data-stopped. The good news is that many of the proposed solutions are feasible -- and can be automated at scale -- with AI and ML.
The above recommendations, in addition to understanding where, when and how to implement these steps, are crucial to your data strategy. The solution and root cause are the same: data governance. It is a function we can no longer do without. Our increasing reliance on data is the perfect example.