- Most straightforward definition of data quality is that data quality is the quality of content (values) in one’s dataset. For example, if a dataset contains names and addresses of customers, all names and addresses have to be recorded (data is complete), they have to correspond to the actual names and addresses (data is accurate), and all records are up-to-date (data is current).
- Most common characteristics of data quality include completeness, validity, consistency, timeliness and accuracy. Additionally, data has to be useful (fit for purpose) and documented and reproducible / verifiable.
- At least four activities impact the quality of data: modeling the world (deciding what to collect and how), collecting or generating data, storage/access, and formatting / transformation
- Assessing data quality requires disciplinary knowledge and is time-consuming.
- Data quality issues: how to measure, how to track lineage of data (provenance), when data is “good enough”, what happens when data is mixed and triangulated (esp. high quality and low quality data), crowdsourcing for quality.
- Data quality is responsibility of both data providers and data curators: data providers ensure the quality of their individual datasets, while curators help the community with consistency, coverage and metadata.
Read more about data quality or join the conversation on twitter and facebook using #LYD17 or #loveyourdata