How do I gauge the quality of a dataset?

Assessing the Quality of a Dataset

Big Data algorithms cannot be any more accurate than the data used to train them. If the data is sub-par in any way, then decisions made based on the analysis are inherently flawed. Therefore, it is important that your data quality be assessed by evaluation of several criteria that are outlined below:ForBlog-DataQualityA_web

  • Validity: Make sure that the data set has all the appropriate and relevant input variables required for the analytic model to produce best results.
  • Completeness: Determine the extent to which there are missing and/or incorrect entries, and estimate level of effort needed for corrections.
  • Consistency: Consider that business rules might have changed during the collection period, thus rendering earlier data inconsistent with later data.
  • Accuracy: Ensure that the data is from a sample that is large enough to realistically represent the subject being modeled by the analytics.
  • Timeliness: Confirm that any data from the distant past is not too outdated to be relevant, if it is to be used to make predictions about the future.


In general, most data sets will be quite complex, so careful judgment will be needed when using these criteria. However, a few basic examples can be offered to illustrate situations in which these criteria would provide useful guidance:

  • Consistency Example 1: Consider a situation in which customer ages were recorded in 5-year bins (21-25, 26-30, etc.) during the first year of business, but then were recorded by exact age in the second year. In this case the data set lacks consistency, and the second year of data should be recoded into the 5-year bins before processing.
  • Consistency Example 2: Consider two data sets having exactly the same variables, but collected by different methods. Set A comes from voluntary online responses, whereas Set B comes from a phone survey of randomly selected people. Randomization makes the phone survey unbiased, but the web survey might be biased because people holding strong negative opinions about the topic might be more likely to respond. Merging these two data sets would likely lead to misleading results.
  • Accuracy Example: Consider consumer information that has been collected across the city of Toronto. This would not be adequate for making predictions about the behavior of Canadians in other parts of the country.


Related posts


Back to top