For many organizations, data quality is a key issue and represents a major challenge. In this article, we share typical lessons learned and best practices for your data quality initiatives.
Data quality is one of the key challenges for data-driven organizations
Studies have found that 41% of organizations consider inconsistent data from different systems as one of their biggest challenges. More and more data is being generated, and everyone wants to take full advantage of its value. But if the quality of the data is neglected, it's hard to make sense of it. Failed or discarded analytics and data science projects are common consequences, as are non-optimal management decisions. The use of poor quality data can cause enormous financial damage. Sales and marketing departments, for example, incur losses of €10,000/year and many hundreds of hours of wasted work time. Often, sales losses can be as high as 20%. Thinking ahead, with all modern technologies such as AI, automation and IoT (in which almost all companies invest), high data quality is essential.
What makes data quality such a complex challenge?
Few data topics touch as many business areas as data quality. It is relevant wherever data is produced and used: whether in production, marketing, IT or business strategy. This also results in a key learning regarding data quality initiatives: Everywhere and preferably immediately is not a good strategy. Just like large projects, such as migrations, a data quality initiative should be cleanly planned and well supported technologically, as well as by management.
So how can a company concretely improve data quality? Even determining the status quo seems to be a time-consuming task, as is the subsequent analysis and elimination of problems.
The "project" data quality initiative
Basically, the typical flow of a data quality initiative is very similar to other data projects. It starts with defining the framework - in this case, choosing an appropriate department or area to pioneer. The next step is project initiation (from experience a very critical step), followed by project execution and sustained process establishment. You can read about what needs to be considered in the individual phases in the following best practices.
Best practices
Step 1: Identifying a good starting point
A good starting point for DQ initiatives are areas in the company where data is very valuable and has a big impact. Usually these are systems & data sources that are at the beginning of the data journey, e.g. CRM systems where a customer is registered for the first time. Helpful questions to identify the right starting point are:
- Where is data generated at a high manual level?
- What data is scattered across many systems and departments?
- In which area do we see a high risk of data being incorrect / of poor quality (e.g. production systems, packaging at food companies)?
Step 2: Inventory of the entire project framework
In the next step we recommend to gain transparency about data and processes in the area of your choice. For this end-to-end analysis, it is crucial to get stakeholders from the entire process chain on board, from data generation to processing to users. For such evaluations, especially the investigation of data flows and interrelationships, it can be helpful to use specialized tools. At this point, also consider real-world implementation and human realities, e.g., communication between different teams.
Step 3: Analysis of the first data package
Now select a reasonable set of specific data entries and identify the relevant attributes to that data. For example, the target data set could be a table full of customer data. Relevant attributes here could be name, phone number, gender, status. We do not recommend selecting all attributes, as this will only add extra effort that will not add value in the end. If defining the relevant attributes proves challenging, it is often worth taking a look at the data flows: Which data will end up being used by the user?
Step 4: Prioritize the data quality issues found
In the next step, go through the selected set of entries and identify all errors. Identify the three to five attributes, not too many, with the highest number of errors. Next, prioritize them in terms of estimated impact and risk. To return to the customer data example given earlier, a missing name makes it impossible to contact them correctly. If the gender and thus the salutation is not correct, this can lead to a worse impression, but is not necessarily a knock-out criterion for further purchases.
Step 5: Definition of future actions
Finally, it comes to the core task. Now measures have to be defined to a) correct the data and b) avoid future errors. This could be done, for example, through automatic checks when the data is created or by introducing monitoring. At this point, it is advisable to align future measures with the overall data strategy and any planned innovations, e.g. cloud migrations.
Tip: A data catalog as support
Historically evolved data architectures full of distributed data sources and knowledge silos are typically major hurdles in such initiatives.
Data catalogs offer a holistic approach to overcoming these challenges. By creating transparency across all existing data sets, their storage location, responsibilities and origin (data lineage), the enterprise gains a holistic overview.
Based on this, advanced data catalogs with data profiling functions enable those responsible to quickly identify where problems occur. Thanks to the stored responsibilities, the right person can be contacted. In addition, conclusions about the causes of problems can be drawn from the inventoried metadata. The data history also shows where the data records are also used. In this case, critical analyses, for example, can still be corrected in time.
As new data is constantly being generated, and above all as more and more data streams and sources are tapped, it is essential to build a robust infrastructure. This ensures that data quality is maintained in terms of accuracy, completeness, consistency, timeliness, validity, and uniqueness for all future initiatives. Investing in a data catalog to improve enterprise-wide data quality pays for itself very quickly.
How to create a business case for data quality initiatives
Let's say you record your customers as a B2C company in your CRM, but you are not required to provide a valid email address for each customer (no validation, no required field). The company gains 20,000 new customers per year, 90% of email addresses are entered. About 5% of them will be wrong as there is no checking mechanism. This results in 86% valid email addresses that the marketing department can use. In this example, two types of problems occur:
1. relevant data is missing in important fields
2. the data entered in this field is on valid
Both are data quality issues that now cost 15% of the potential revenue of a marketing campaign (e.g., an upselling campaign) because these customers cannot be reached. Studies also show that the level of data quality directly correlates with increased process effort.
Data quality initiatives pay off
In addition to the purely quantitative benefits, increased data quality and an established data catalog provide an optimal foundation for the transformation to a data-driven enterprise. All employees can rely on correct data that is available to them in a central location. Both points are essential for data democratization.
In summary, a data quality initiative is a highly customized process. The best place to start varies by company activity and existing processes, but can be identified by answering the questions above. Proven steps for project implementation include:
- Gain visibility
- Select target data entries
- Attribute identification
- Identify defects in a subset and prioritize them
- Take action
Data catalogs can be a powerful tool to ensure sustainable data quality throughout the data architecture.