The extent to which a data set follows a set of commonly expected guidelines will often affect how much time you have left to think about your analysis. When working on data analysis you might aim to spend 20% of your time cleaning data, and the remaining 80 per cent carrying out the actual analysis. But often, it can turn out to be the complete opposite due to a messy, non-standardised data set.
The best way to avoid this pattern of work is by spreading the word on what ‘clean’ data actually means, and to encourage a wider adoption of best practices.
To assess the condition of your data, start by looking at consistency. Simple rules such as using the same code consistently across the database to indicate missing or unknown values, and using the same unit of measurement throughout an entire column go a long way to prevent future confusion. When naming related variables, follow the same naming scheme and do not include spaces within variable names.
Using consistent formatting also matters, even for small details like spelling, capitalisation and the way a date is listed. This will help to minimise time spent tidying these up down the road. Other careful formatting, such as removing any blank spaces padding actual values within the cells, can make a big difference to the dataset.
Typically, there should also only be one ‘atomic’ value per cell, so that a column represents only one specific characteristic. For example, instead of having a full name in one cell, the cell should be split so that the surname is in one column and forename is in the other.
Of course, clean data still needs clear documentation. For any dataset, an accompanying data codebook should be provided as a record of all variables within a database, along with documentation of what each variable means, a range of possible values and the column type – whether it is numeric or categorical.
If multiple datasets are provided to the data scientist, clear indications should be given for whether there is a linking key between these, and what that key may be. For instance, if two datasets are required for a project, and one contains GP referral data while the other contains specialist care medical data, then the two datasets should contain a common key for patient IDs. This key should share the same meaning between datasets: patient ‘1234’ would refer to the same person across both datasets.
Overall, the steps towards cleaner data are often simple, and tend to be easier to implement correctly from the start, than to adjust later. Following these simple guidelines will save you time so that you can focus more on the analysis. The two essential points for clean data are consistency and clear documentation. If you keep these things in mind, you’ll be on the way to using data more efficiently.
Dr. Caterina Constantinescu is a data scientist at The Data Lab, and is Chair of the DataTech organising committee. DataTech is a new DataFest event, happening on the 14th March, 2019. For more information on DataFest and DataTech, visit https://www.datafest.global/data-tech.