Module 9 – Data Discovery

This module will be a quick tutorial about database searching, Boolean operators and where you can find more information. 

In our last module…

We left you all with the activity of searching out an ocean data organization and reaching out to them about the work they do. How did this search go for you? Was it easy to find relevant organizations? Or did it take longer than you expected? Did you manage to find the relevant person to talk to and how did that talk go?

One thing that can be surprisingly complex, is completing good online searches that bring up accurate and relevant results. It’s not an uncommon story to hear about someone spending hours searching online and still not being able to find what they need.

Even just knowing a few simple search tricks can make searching for data in databases much more efficient! One of these simple tricks is to employ Boolean operators in searches.

For those of you with a coding or computer science background, you likely already know what a Boolean expression/operator is! For those that do not however, here is the quick definition:

Boolean Operators are a set of standard words/phrases that can be used in search engines to help narrow down a search and make things easier to find.

The main Boolean Operators are:

AND – use this operator to let a search engine know that it should return only results that have the first term and the second term in them.

OR – use this operator for results that have either only the first term, only the second term or the first and second term together. OR is also useful for capturing results that might be referred to by a number of synonyms. One example is teenagers. Research talking about teens could use any of these terms: teens, teenagers, youth, young adults, adolescents and you’d likely want to capture all of them. This is where OR can be useful.

NOT – use this operator to exclude any results that have the term that follows the NOT.

Make sure to use all-caps for AND or OR as that guarantees that the search engine will read it properly as an operator.

Note: Google tends to default to ‘AND’ between all terms in a search, even if it isn’t explicitly written out.

Here’s an example of AND vs OR

You’re doing research on a rare mutation in lobsters that make them blue. If you search for Blue OR lobster you would get:

This example is taken from Novanet, and includes the first five search results. Novanet is the primary search engine for library holdings from libraries across Nova Scotia and beyond. Dalhousie is just one of a consortium of libraries, working together to connect Maritime post-secondary libraries and increase students access abilities.

  • Canadian Atlantic Lobster: Buyer’s guide
  • The blue moment : Miles Davis’s Kind of blue and the remaking of modern music
  • Blue Spirit the Blue Note all stars
  • Report of the commissioners appointed by His Excellency the governor general in council, of date 4th July, 1887, to enquire into and report upon the lobster and oyster fisheries of Canada with appendices
  • 40 years under the blue : a history of Blue Cross of Atlantic Canada

Since this search used ‘OR’ it brings up everything that has blue in it, and everything that has lobster in it.

Now, compare that to Blue AND lobster you would get:

Top five Novanet results:

  • Bunny [this is a novel]
  • Short people: stories [a collection of short stories]
  • Wicked good barbecue : fearless recipes from two Damn Yankees who have won the biggest, baddest bbq competitions in the world
  • Sea kayaking in Nova Scotia : a guide to paddling routes along the coast of Nova Scotia
  • Spatial processes and management of marine populations : proceedings of the Symposium on Spatial Processes and Management of Marine Populations, October 27-30, 1999, Anchorage, Alaska

Why did the first two results bring up fiction is a question that is perhaps on your mind. After all, we did type it that we wanted only results with both blue and lobster in them. But what this search forgot to take into account was that it was a subject search. Hence why the first result was a novel, because its subjects were tagged as blue and lobster.

There are a few ways to help combat this, and narrow your search to find what you are actually looking for.

First is using the NOT operator, and adding to the search NOT fiction. But if a novel isn’t tagged with fiction, it will slip through the operator. Filters can also help. If you filter out books the third result is about blue lobsters. 

Another thing that can help here is changing the search parameters to search by title. Most journal titles or dataset titles try to capture the most important pieces of information, even if it means a very long title. It is reasonable to expect to find blue lobster in the title.

Finally, the last trick to helping with searches is using is exact which you specify with quotation marks. There is also the option to search something exactly as you have written, or to search for a phrase exactly as it is. If you search for: blue ocean lobster, most search engines will be smart enough to look for places where that phrase shows up, but they will also look for searches that just have blue, or ocean or lobster in them, in any order! If you wanted results that had the words blue ocean lobster with those exact three words in that exact order, you would simply need to use double quotation marks like this “blue ocean lobster” to indicate that that exact phrase in that specific order is what you want to search for.

Combining the Operators Together

Like a math equation, it can sometimes be necessary to add brackets and make sure that things are grouped properly and read properly by the machine!

Here’s an example of a couple of different ways to combine the operators together, using a keyword, lobster and some EOVs:

lobster AND (“subsurface temperature” OR “surface temperature”) What this means is that the searcher wants results that have lobster somewhere in it and also has to have either subsurface temperature (with the words in that exact order) or surface temperature with the words in that exact order.

Using all the operators together in a search engine or database search can allow for a high degree of specificity!

However, there are some search engines that don’t use brackets and instead require you to ‘build’ searches, one step at a time, where each step is one part of the equation. This is especially necessary for synonym heavy searches.

Examples

Say you want to search for the effects social media has on teenagers. You’ll want to capture a wide range of results, so the first thing you’ll want to do is think up synonyms that other researches might’ve used:

  • Teenagers: young adults, adolescents, high schoolers
  • Social media: social sites

And then you would put each search into the search bar like so:

  • Teenagers OR young adults OR adolescents OR “high schoolers” and search that.

Then you would search

  • Social media OR social sites

And then you would click search builder on the website and combine the searches and it will properly format those two searches with an AND operator to get you the most possible results.

Activity:

What kind of searches would you conduct for a question like this: What are the yearly gross profits of tuna fishing vessels on the east coast compared to the west coast?

Beyond knowing Boolean operators, another important trick is understanding the usefulness of repository guides and database guides.

Databases

Much of the previous examples have concerned searching in general, including to find databases. Many of these tricks can also be used within the database searches themselves.

One trick to help you with searching within databases is understanding the base data management system that makes it up. Take CKAN, for example. This Data Management System (DMS) is open source and powers hundreds of different portals around the world. This means that though the appearance is different, many will have similar functions.

Some databases that are really useful can be difficult to find however, even when looking on the main site that hosts them. CKAN powers the Statistics Canada database search. Using the tips and tricks here, you can most easily first find the StatsCan website and then once on it, use the Boolean operators we explored earlier to find datasets more easily.

CKAN follows best-practice for data sharing and has a specific metadata schema and controlled vocabulary. This means that it is easier for datasets across hundreds of worldwide organizations to interoperate with each other! So, you could, for example, easily search for something like ‘population density’ in both a Canadian and American database and see if there are similar datasets and compare them together incredibly efficiently.

When searching for data, here are some good databases to start with:

Government of Canada Open Data: This is where you can find datasets that are open access from the Canadian government.

Google Dataset Search: Finds any dataset that Google has indexed across many websites and other databases. Keep in mind that not all of these datasets will be accessible. Google sometimes finds data that is behind a paywall or only accessible to members of a particular organization.

Doing all this research takes time, even with the best tricks in the book! When searching for datasets or other such information for your research project, it is important to include in your data management plan, how much time you will allocate to researching and how much resources you will need.

Understanding what goes into searching makes it more apparent how easily little things can cause data to be completely unfindable. If a researcher spells a term wrong when they are submitting their data, then that means that a machine will not be able to find it- unless of course, the researcher submitted synonym terms and those were spelled correctly. This kind of redundancy makes searching easier!

This is another reason why it is important to know how talk about data with researchers and organizations, as when all else fails with searching, you may need someone to explain where to find the data you are looking for. A fellow researcher may be able to point you in the right direction, or provide a direct connection to the data you are looking for!

It is also why it is so important and best practice to have controlled vocabularies and well organized metadata. No matter how good of a searcher you are, if the metadata is poor, you are not going to find the dataset!

With these search tips you should be well on your way to being able to comfortably prepare for data discovery during your project and to put it in your DMP.

Before you go! Things to consider for the next module:

Reflect on whether Boolean operators make you feel more or less comfortable doing searches going forward. Is this the best way to find data? Are there other options, like filters, or dedicated sites that you feel could fulfill this role of searching better?