This blog is part 3 of a 5-part series providing insight into Edgeworth Analytics’ approach to data analytics for businesses. In our last post we discussed “Crafting the Fundamental Business Question.” Check back soon for “Data Analytics: Developing Rigorous Analysis.”
Unfortunately for all of us, data do not come ready to use. Even identifying relevant information under a mountain of data can be a daunting task. However, the process of finding the pertinent data, cleaning it, and validating it is one of the most important stages of any analytics project.
Why is this you may ask? Analysis and insights are only as good as the data you use. A common saying, originating from computer science, is “garbage in, garbage out,” meaning nonsense input data produces nonsense output regardless of how good your processing in the middle is. The same holds true in analytics.
Identifying Relevant Data Sources
The first step in collecting and preparing data is identifying which data can be used to answer the business question at hand. When we work with clients, this process is guided by our understanding of the clients’ business and how the data reflects those business realities. While companies often track tremendous amounts of data, not all of that data is useful for the project at hand. Further, data may be kept by different teams and divisions that don’t normally work together. We work across the organization to facilitate the identification of data that can be used from different sources. By identifying just the relevant sources, you can limit time and resources spent extracting, cleaning, and validating.
Understand, Clean, and Combine
Next, data analysts must combine relevant data sources, make sure they understand the structure and meaning of fields, and identify and correct errors and inaccuracies in the data that may impact results. During this process we may find that certain relevant fields have missing or inaccurate values, or the way in which the data is tracked changed over time. The ultimate goal is to have a combined single source of accurate data to analyze. Understanding data intricacies will allow for the most accurate and robust analysis and insights moving forward.
Validate and Document
Finally, validate that the data is ready to use by running simple tests. For example, are your sales totals in the data consistent with business realities? Are there outliers that are correct data but might later impact your analysis and need explaining? Once the data is ready to analyze the last step is to document any actions you took when preparing the data and any important information users should know about the data when analyzing it.
While collecting and preparing data we like to use this general checklist to guide our data review process.
Checklist for Data Review