
“Data is like garbage. You’d better know what you are going to do with it before you collect it.”
– Mark Twain, American writer and humorist
What is “Clean” Data?
Having clean data has been forefront in my mind lately. I have several clients using excel spreadsheets to track data, I’ve been training a company on creating project plans with a multitude of tasks involved with lots of variables, and I’ve been training another company on adding metadata to their files in Microsoft. In addition to that, I’m in the process of implementing a new CRM (client relationship management) system, which holds a ton of data.
The formal definition of clean data is a set of data that is accurate, complete, consistent, and formatted correctly. It needs to be free from errors, duplicates, or structural inconsistencies. It ensures high-quality information suitable for analysis, reporting, and machine learning (AI), having undergone cleaning to remove irrelevant, incorrect, or missing records.
Key Characteristics of Clean Data:
- Accurate: Reflects true, valid values without errors.
- Complete: Contains all necessary information, with no missing or NULL values.
- Consistent: Follows uniform formatting, structures, and units across the dataset.
- Unique: Free from duplicate records.
- Valid: Adheres to defined rules or constraints for the data type.
Essentially, clean data is ready for analysis without requiring further repair or changes.
The Bottom Line
The most important thing is having clean, correct data. You and your team have to be consistent and on the same page about what data is being inputted and exactly where that data comes from so that it’s accurate. For instance, I told the company with the project plans that they need to be clear at the beginning as to what fields they want to use in each task. What data do they want to be able to analyze? Is it division of labor (who has more tasks)? Is it the time it takes to get things done? Is it the progress of the tasks? Knowing this information in advance allows the company to communicate with their staff regarding what data should be entered and why.
Clean Data for All!
Even if you don’t have a company, you can do this personally. What do you want to track and improve upon in life? Think in advance about what you’d like to analyze and what data would let you do that, whether it’s your food intake, your exercise, your budget or something fun like art classes. Then create an excel spreadsheet or find an app that will help you with the tracking.
“Garbage In, Garbage Out”
George Fuechsel, an IBM programmer and instructor, is generally credited with coining the term – “Garbage In, Garbage Out” – in the early 1960s.
Want to know more? Check out this article from TechTarget!