The important role of data governance in high quality models
Data is arguably the most important asset for any organization that is looking to implement an AI system. In order to get the most out of your AI system, it is important to have proper data governance throughout the entire lifecycle of your data. This will help you achieve better results and ensure that you avoid biased or inaccurate outcomes.
Data is the backbone of AI
For a long time, improvement in AI and machine learning outcomes has been centered around algorithms, with the simple idea that improving an algorithm is the best way to get improved results. The generally accepted elements of an AI solution look like this:
AI system = Data + Code (model/algorithm)
In the traditional model-centric approach, you gather a whole bunch of data and then iterate the model to handle any noise in the data and get the best results possible. Now, the idea of a data-centric approach is gaining support, where you instead keep the model/algorithm the same but iteratively improve the data set you are working with.
Data is one of the most important assets a company can have and is the backbone of any machine learning or AI application. Having good data will help to produce great model results that you and your organization can truly derive benefit from. But, an important component of good data is proper data governance.
Data governance is ensuring that you have procedures and policies in place throughout the data lifecycle. This can include rules that define how you input and maintain data, including labels and storage, for example. Included in data governance are also procedures around how you enforce any rules and organization-wide compliance.
Data governance is more than just policy. It ensures that data is collected in a secure and thoughtful manner, while also ensuring that this data is used for the right purposes and applied to the correct problems.
Bad Data Governance = Bad Results
Lax data governance continues to be the root of many problems, including biased results and inaccurate findings. Good data governance can benefit companies since it helps to ensure that the data used is consistent and the results can be trusted.
For example, public “Frankenstein” style data sets appear when a dataset is comprised of several other datasets and distributed under a new name. Public data sets may be a way to get more data to use in machine learning research, but it can create issues if there is overlapping or potentially identical data taken from different sources and presumed distinct. Lastly, if a research model is only tested on a subset of the training data, it doesn’t prove that this would be viable in a real-world setting.
Another example is that proper data governance includes ensuring that the data you’re training your computer vision models on is properly labeled. OpenAI’s computer vision system can be fooled by a very simple trick, writing the name of an object and placing this text onto another object. The system will identify the image as what is written in the text, not the actual object in the image. This “typographic exploit” was successful since OpenAI’s model is trained on 400 million indiscriminate image-text pairs scraped from the internet. (“OpenAI’s state-of-the-art machine vision AI is fooled by handwritten notes” from The Verge).
An example from OpenAI where the addition of text leads to the wrong labeling of an object
Researchers found pervasive labeling errors in some of the most commonly used AI training data sets, including ImageNet. The error rates are 5-10% depending on the data set, which means that some of our “best” computer vision models can actually be identifying objects incorrectly. (“Error-riddled data sets are warping our sense of how good AI really is” from MIT Tech Review).
In both of these situations, the root cause is the poor training data due to poor labeling. You end up with an element of poor quality and uncontrollable noise levels, which is not only introducing noise in the training data but in practice leads to misrecognition and inaccurate answers. This is important because computer vision goes beyond recognizing images on the internet and can potentially have real-world negative consequences if poor data is used in a field such as medical imaging for diagnostics.
Lax data governance continues to be the root of many problems, including biased results and inaccurate findings.
Ensuring proper governance
In order to start off on the right track, there are a couple of steps you can take to get the most out of your data:
- Know where your data is coming from
- Know what your data is made up of and recognize any limitations before using it
- Ensure that it is properly labeled and cleaned
- Maintain a data pipeline that is easily trackable and easy to follow
- Establish rules on how you collected data, store your data, and use it in models
It’s important to remember that data governance is not just a one-off action, but rather a continuous set of procedures that will continue to develop over time. With proper data governance, the results of AI systems will be better and more consistent. You will also be able to comply with any necessary regulations and compliance requirements more easily.
The usage of data within organizations will continue to grow and to take full advantage of it, ensuring proper data governance from day one will provide long-term benefits.
This blog post comes from The Big Y Newsletter. If you’d like weekly updates, sign up for The Big Y here.
About the author
PRODUCT MANAGER, 2021.AI
Yina is a Product Manager at 2021.AI working to bring Responsible AI to every enterprise. She has experience working with AI platforms and investing in early-stage startups. Yina is also the author of the newsletter, The Big Y, where she focuses on interesting and relevant AI topics.
You might also like…
The Ethical AI Newsletter
It’s not fake. It’s not artificial. It’s real news! Sign up for our Ethical AI newsletter and get the latest AI insights from our data science and AI experts.