This module is an introduction to data management, building on the previous two modules with brief introductions to a few data principles and giving examples of data management pitfalls and how they can be avoided.
In our previous module we asked:
Why it was important to think about other uses for your data and what you could do with it?
Knowing these answers to questions will improve not only your communication with other researchers, especially if you want to share your data, but it will improve your ability to replicate and build on your previous research going forward. Asking yourself to remember everything about a previous project is unrealistic and unfair to yourself, so thinking about communication between your past and future self when it comes to data is incredibly important!
Why is data management important?
In a way, data management is the language and process we use to ensure our research is properly contextualized, understood, and recorded. The process can allow for data to be shared, re-used, and replicated efficiently, helping to make sure your data projects stay relevant for years to come!
Video Activity
Please enjoy this short video about data misunderstandings that illustrates why data management is so important to think about.
Watching this video, it’s easier to understand what we mean when we say that data management is almost a language. It was very easy to see that the two researchers were not on the same page, and one was talking in circles around the other. Creating a good research environment will help you feel more confident with your own research and make it easier for future researchers to understand, share or build upon your work.
As mentioned before, thinking about how you communicate about data can help you with your own research. Avoiding the pitfalls in the video you just watched, means you will be able to go back to your previous research more easily, you will be able to use parts of the data in other ways, or more easily be able to replicate or compare experiments.
Data Management Best Practices
Follow these data best practices, and you’ll be swimming past most pitfalls like a champ!
I. Multiple data storage solutions
It is important to have data stored in more than one place. While it may be easier and tempting to solely keep your data on your laptop during the project, it is best practice to have a data back-up. Creating a plan at the outset of the project to manage and continue the backups once you are done will ensure the data persists after the end of the project. Considering if and how project data may be be used in the future and whether the data will be shared in a public repository is important during the planning stage of the project or research.
Consider potential long-term data storage solutions. Where, how and when will your data be stored? Some questions/considerations for your data storage: if you live in a rental and know you might move in less than 5 years, that could change the storage options you choose for your long-term data storage, compared to if you own a house. Or if you’re graduating soon and all the data will need to go back to the school, that will change how you decide to store your data for the future. What online/ cloud storage options are available and approved by your institution? And what public repositories may be a good fit for your data if you are required to provide or would like to provide open access to the data?
II. Use standardized labels, or create descriptive labels
In a later module, we will cover controlled vocabularies and labelling schemas in detail. Using standard labels will help your data be interoperable, meaning, making it easier for your data to be used and re-used in more or additional future applications. This is especially helpful for datasets. If you end up wanting to compare your dataset to another dataset, it will be a faster and easier process if you used a standardized name for your variables, including a standardized format–such as surface_temp_avg_degree_C to refer to the average temperature of the ocean surface in degrees Celsius. If both of your datasets have the same variable names it is much easier to use various softwares to compare and graph the two datasets.
However, if you do not know the standardized labels or they do not have them, then: it is better to be overly detailed, but contextually clear, than under-detailed. Data should ideally be able to be parsed from sight alone, to someone of the relevant background, without the additional context of having to double-check with you. So, it is far better to name a data column a long title such as ‘wind speed at 3 am in North End Halifax’ than to name it ‘wind 3’ just to get it shorter. It is also better to be overly detailed if it means it will be contextually clear, and there will be no misunderstanding what the data is. This level of detail will make it easier to do standardization later.
Standardized and detailed labels will also help you in the future, if you come back to research that you did! Especially if you wanted to show the research to someone else, or had to explain it to a potential employer or post-secondary application.
III. Keep multiple copies of the raw data
With an excellent research report or paper other researchers may want to reuse your original raw research data. Thus, it is important to retain copies of your raw data and not only processed data. It may be necessary to publish the raw data when submitting to a scientific journal or for the raw data made available in an open data repository based on requirements made by your funding agency.
IV. Use FAIR and CARE
Following the FAIR and CARE principles will help you implement the best practices for others to find and build on your data. FAIR principles focus on increasing sharing between researchers, by making sure your data is able to be found by machines on the web, while CARE principles are focused on increasing and affirming Indigenous data sovereignty and making sure that their data is used ethically.
V. Submit your data to a repository
Repositories contain datasets from one or many databases to be stored, managed, and shared. A repository is by far the best way to share and preserve your data long term. CIOOS is the ocean organization behind this tutorial, but it is also in fact a repository, and uses industry standard conventions! We’re a great host for your data. Another repository to consider is OBIS, who have over 4500 datasets, and one of the world’s largest knowledge bases of ocean bio data!
VI. Implement metadata
Metadata will be covered in more detail in modules 16, 17, and 18 in Thing 6: Managing your metadata. For now, what is important to know is that metadata is data about data. Following a relevant metadata schema will help you organize your data and aid in making it findable by machines on the web once it is shared on the web. One particular ocean metadate schema is Darwin Core which is based around biological taxa. Another incredibly important ocean metadata schema is the Climate and Forecast (CF) conventions, which provide the ability to describe the temporal and spatial properties of the data.
Summary
Implementing these best practices is a great beginning to manage any data you might work with, whether it’s data for a personal project or one that you do for an institution.
These pitfalls can be mitigated with a plan, a data management plan to be precise. These are just a few things to consider when planning how to use your data in a project.
In our next module we will learn more about data management plans (DMPs) and how each future module relates to the different parts of a DMP. If you had to guess, what do you suspect is the most important element of a data management plan?
Before you go! Please consider for the next module:
Can you recognize any other data pitfalls from the example video shown here? How would you combat them?