Module 20 – Dirty Data

In our previous module:

We gave you a quick run-down of various CIOOS tools.

In this module we will use one of the datasets from the Data Explorer tool to learn about dirty data.

Dirty Data

Dirty data is something that you may have encountered in your research career before, just without a specific name. Dirty data is any sort of data that doesn’t make sense and muddies the rest of the data. Dirty data often happens in the process of downloading and transforming the data from one format to the other, or during the recording process, if there are issues with the device taking measurements.

For example, a missing value or ‘blank’ is dirty data, and that blank value could be caused by the machine losing power for a few moments. If not caught, dirty data can skew results and create inaccurate results graphs, among other things. Being able to recognize dirty data before it becomes a problem is an important skill.

This is important because dirty data often looks like regular data to machines, and therefore is not filtered out. In some cases it is even subtle enough that many humans wouldn’t recognize dirty data at first glance. Dirty data is enough of a problem that ignoring it can compromise the project, due to inaccuracies this data introduces into the dataset.

When creating your data management plan, part of the process must include a provision for dealing with dirty data: whose job will it be to find and clean the data, and how much time and resources will you need to budget for the process? While dirty data isn’t inevitable, it is likely that you will encounter at least some of it during your project, simply because machines are imperfect. This is where Quality Assurance/Quality Control (QA/QC) come in. Sometimes there will be a team specifically for QA/QC within an organization that goes through and handles dirty data. However, if you have a small team, or small institution, you will likely have to do QC on your own.

Dirty Data Example

This following data is a very small selection of data from the Bay of Exploits Buoy. If you’d like to see the full data set, please check it out here. It has been modified to show an example of dirty data. The complete data set is clean.

Activity:

Can you see what is ‘dirty’ about this data?

station_nametimewind_spd_maxsurface_temp_avg
UTCm s-1degree_C
smb_bay_of_exploits2017-12-26T00:23:01Z14.11.27
smb_bay_of_exploits2017-12-26T00:53:01Z1.27
smb_bay_of_exploits2017-12-26T01:23:01Z15.31.28
smb_bay_of_exploits2017-12-26T01:53:01Z-51.29
smb_bay_of_exploits2017-12-26T02:23:01Z18.13.5
smb_bay_of_exploits2017-12-26T02:53:01Z16-274

There are several places in this dataset where the data is dirty! The first and easiest to see is the blank space in the wind speed column. This dirty data could’ve been caused by an error in recording the number, wherein it failed to record, or an error in translating the data to the table. In some cases, it is possible to find and complete the missing data in a set. If there were other buoys for instance, in the area, it would be possible to find an average or approximation of what the wind speed was like that time of day.

The other dirty data in the table is data that is technically there, but doesn’t make sense! There are three places in the table where such an event has occurred. The first two are very obvious. -5 wind speed and -274 C surface temperature are both impossible numbers to have. The first because it is impossible for wind speed to be a negative number and the second because it is not only an absurd difference compared to all the rest, but also because it is colder than absolute zero, a temperature which is not possible anywhere in the universe.

The 3rd however, is on the table and a possible temperature that could reasonably exist, but it is significantly more than the other temperatures in the non-dirty data sections. 3.5 is much higher, especially if you take in even more of the data source where it becomes apparent that the average temperature stays within a narrow range of fluctuations that does not include such a large leap.

The last kind of dirty data to think about is null data vs 0 data. A recording of 0 when a measurement wasn’t taken (or couldn’t be taken because of malfunctions with the instruments) is also dirty data, as that zero might be read as a number, instead of the null value/malfunction that it really represents. For this kind of dirty data it is important to understand how your instruments record data and whether null values are recorded as blanks, or zeroes – this should also be kept in mind for translating the data to different formats, such as downloading to CSV and asking if doing so will record null values as blanks, or as zeroes.

This kind of dirty data is more subtle and is why data dictionaries are necessary as well as things like expected averages. Data dictionaries will help you determine if data is suspect because it is outside of the normal ranges.  

A data dictionary is a kind of metadata. These dictionaries outline and define the data an instrument measures and the kind of data it outputs. It lists the relationship between each source of data as well as standard deviations. Many also outline rules for interpreting data outputs, allowing a researcher to feel more confident in deciding what is dirty data and what might be actual data that is simply exceptional.

In the ocean sector Quality Control of Real Time Oceanographic Data (QARTOD) is a QA/QC program created and helmed by IOOS US. They provide manuals for different instruments that outline best practices to evaluate the quality of ocean data across different parameters, such as salinity or PH.

If you’d like to see some of the QARTOD Data Dictionaries they can be seen here.

Final Question

Is it possible to create a system where dirty data doesn’t happen? How perfect of a system is possible?