Quality Data Part 2: Coverage and Consistency
The only limitations our data has is in our strict quality standards.
This blog is part of a bigger series. Missed part one? Check it out here.
Quality data will never be the result of a single metric or of executing on one dimension of quality perfectly. Rather, it is the sum of many crucial elements all working together, which is kind of a convoluted way of saying that none of our metrics for quality can be seen as the be-all-end-all of high-quality data.
It’s also worth remembering that quality data for one user could be defined quite differently than for another, so we can’t rely too heavily on one metric more than another. It is for these reasons that we take into careful consideration so many elements of quality.
For the rest of the year, we’ll be taking a deep dive into numerous dimensions of quality, how we define quality at DrugBank, and why we think you should care. This month we’ll be exploring what we mean when we say that DrugBank data has quality coverage and consistency.
If you haven’t read last month’s introduction to our philosophy of quality data, we suggest you do that now so you can get a picture of all the dimensions of quality we’ll be covering in this series. But if you’re a keener and you’ve already read it let’s jump right in.
Quality through Coverage
What is it?
In its simplest form, quality coverage at DrugBank means that we ensure our data covers the biomedical knowledge necessary to solve the problems at hand.
How do we do it?
To achieve appropriate coverage, DrugBank doesn’t simply attempt to collect the most information. We aim to collect the most information while also checking every piece’s accuracy and relevance. Theoretically, you could collect every single piece of data the world has to offer and you’d have perfect coverage, but it could be unusable or misleading if it isn’t carefully vetted. When we seek data sources we work hard to collect and maintain coverage of data that is both true and valuable.
Next, we continuously ensure all of our data is well-maintained and up-to-date. Attaining quality coverage isn’t an end in itself, but an ongoing process that we are always working on. We have specifically designed proprietary AI that is constantly analyzing and seeking new information to add and improve our coverage. Then we integrate human expertise into the loop to provide feedback, vital checks and balances, and to ensure the overall accuracy of every piece of information that our AI brings in. Our team also spends their days authoring novel content and supplementing our knowledge base with information that they find through their own reading and researching. This multi-dimensional approach ensures that we have the greatest coverage possible.
Why should you care?
First, we understand that each of our users have unique needs, and the data that one researcher or clinician might need can differ greatly from the next. For this reason, we sweat the details and obsess about overall coverage and consistency to ensure that no matter what you’re looking for, we have it and you can trust it.
We also see how quickly biomedical information is growing and know how unmanageable a task it is for our users to source, analyze, and compile high quality data on their own. This growing body of evidence is becoming increasingly difficult to use, and when faced with such an overwhelming amount of data it can feel impossible to navigate what is evidence-based and useful information, versus what is contradictory and distracting.
We focus on maintaining non-redundant information that removes the burden of trying to extract meaning from the depths of the cumulative data available in the world. DrugBank’s coverage goes beyond merely maintaining a vast scope of data, and ensures our coverage can aid in bridging gaps in knowledge. With each additional connection made there will be the possibility for stronger evidence-based decisions and more confidence in research outcomes.
Quality through Consistency
What is it?
At DrugBank, data consistency means that our customers can trust and expect that equivalent data will be presented the same way regardless of how, when, and where it is consumed. We ensure this by normalizing external sources and connecting the same data together rather than storing it redundantly. These connections are maintained through rigorous processes, regardless of who creates or integrates information on our team.
How do we do it?
For us, consistency is about rigour. It’s about having relentless standards for our data’s completeness and structure, and then doing everything it takes to deliver on it every day.
In a lot of ways, consistency is a direct result of the fundamental activity of structuring data. At DrugBank we’ve established a strict process for standardizing our data so that it reliably translates third-party information into a similar and predictable format. We will cover data structure more in-depth later in this blog series, but it’s worth mentioning here because without it our data cannot be consistent.
Basically, in order for our data to be shaped the same way across datasets and sources, it all must be structured the same way. This ensures that no matter what data you access, you will experience it similarly to any other piece of data. And, it allows you to manipulate and put our data to work reliably across datasets.
Another vital element of consistency is completeness. Because some of the data and evidence we collect is from third parties, it is important that once we’ve verified and structured it, we then assess its completeness. With the help of automation, each new piece of data will pass through a minimum of two in-house experts that utilise multiple sources to identify inconsistencies, mistakes, and gaps in the information. Then, our team of in-house experts, aka our Curation Team and Data Review Specialists, look for ways to fill those gaps and right the inconsistencies.
This enables us to anticipate future problems, needs, or changes that are necessary to guarantee quality and consistency in our datasets.
Why should you care?
Consistent data saves you from frustration and lets you focus your time and resources on what you do best. When data is consistent it is easier to use and can more rapidly be integrated into research or applications. When data is easier to use, it is faster to extract meaning from and emerge with reliable results or land on evidence-based decisions.
By maintaining well-organized datasets we are working to enable better decision-making by reducing errors and lowering the risks associated with unreliable information. But again, consistent data is only as useful as it is accurate and up-to-date with the most relevant and valuable evidence.
Data quality is about more than perfecting one singular metric. It’s about a greater, more flexible set of interwoven dimensions of quality working together to solve problems. And, depending on a specific user’s needs or the problems they are trying to solve, the metrics they prioritize can skew in a number of different directions.
At DrugBank we’re not aiming to unlock some absolute form of quality. Instead, we are intent on obsessing over our customers’ needs so that we can offer a multi-dimensional approach to quality that equips them with the best tools to solve their problems.
As we’ve discussed earlier, quality comes from having the right level of coverage within your data and strong consistency built into it. These two elements ensure you have the range of information you need as well as reliable, usable data you can trust.
Check back next month for our next deep dive into quality data where we’ll be exploring the importance of cross-referenced data and the impact it has on reliability and usability.