Getting started with AI: How much data do you need?
Thirteen years ago, the Economist claimed that oil was no longer the world’s most valuable resource, instead, it was data. Thirteen years later, with all the spectacular advances made in the AI field, companies and corporations are starting to realize the inherent value contained in their data. Data is the fuel that can be leveraged to create new products and services, improving the existing ones, or even sold as an immaterial commodity. The question we often get asked when exploring new AI opportunities is: “But how much data do we need?”
When working with data, there is no perfect amount needed. As mentioned in a previous post, most companies struggle to provide high-quality datasets that can be leveraged in synergy. Many problems can arise when carrying out data due diligence. Usually, the datasets under scrutiny fall short in at least one of these categories:
Categories of the datasets
However, some of these aspects are more critical than others and more or less hard to fix. Missing records can in some cases be backfilled or inferred and mistakes may be corrected based on rules or logic, etc. But, if the data is in limited quantity it might be difficult to collect more data in terms of time or cost.
This technical blog post intends to provide the reader with a comprehensive high-level view of state-of-the-art techniques dealing with limited or incomplete data and a broad understanding of the methods which can be used to address this challenge.
Dealing with limited or incomplete data
1. How much is enough?
The minimal size of the dataset can depend on many factors such as the complexity of the model you’re trying to build, the performance you’re aiming for, or sometimes it can even be due to the time frame that is at your disposal. Usually, machine learning practitioners will try to achieve the best results with the minimum amount of resources (data or computation) while building their AI predictive model. This means that they will first try simple models with few data points before trying more advanced methods potentially requiring vast amounts of data.
Let’s imagine that you are trying to work out a linear model between your target variable (i.e. what you are trying to predict) and your features (i.e. your explanatory variables). As you may remember from high school math, a linear model has two parameters only (y = a*x+b). You may also remember that two data points are generally enough to fit a straight line.
If you consider a quadratic model with three parameters (y = a*x²+b*x+c), you’ll need at least three data points, etc. Usually, even if there is no one-to-one relationship, the more complex your model becomes, the more data you will need to determine its parameters. For instance, one of the latest models for classifying images like Inception V3 from Google contains a bit less than 24 Million parameters and requires about 1.2 Million data points (in that case labeled images) to be trained.
The amount of data needed may also have to do with the particular problem you have at hand. If you are trying to forecast a time series with a simple structure but with very long seasonality or cyclical patterns (ex: 10 years), the bottleneck may not reside in the number of parameters in the model to be uncovered, but more in your ability to collect data points for the past 30 years. This represents “only” 360 data points for monthly data but they may not be possible to collect.
Finally, there are actual mathematical ways to figure out whether you have enough data. Let’s say that your team of data scientists has worked on a model and has reached the best possible performance with the data at hand, but it’s just not enough. What should you do now? Collect different data? Collect more of the same data? Or, should you collect both to optimize your time and efforts? This question can be answered by a diagnostic of the model and data by means of a learning curve, which shows how the model’s performance increases as you add more data points as depicted in Fig. 1.
Fig 1. Model’s performance as a function of the training dataset size. The figure is taken from Researchgate.
The idea is to see how much the model’s performance benefits from adding more data and whether or not the model has already saturated, in which case, adding more data will not help.
2. What to do if you are running short on data?
If you find yourself in a situation where you need more data, there are different strategies to consider depending on the problem at hand and your situation:
If collecting more data is not possible
If you are unable to collect more of the same data, you can try your luck in resorting to either data augmentation or data synthesis, i.e. creating artificial data based on the data you already have.
Data Augmentation – consists of generating new data points based on the ones you already have. For an image dataset, it would be required to create new images with lower or higher resolutions, cropped, rotated, with linear transformations, or added noise. This would help your algorithm become more robust to these types of perturbations. For further reading, have a look at unsupervised data augmentation.
Data synthesis – is sometimes used to remedy classification problems where one class is imbalanced. New data points can be created using complex sampling techniques such as SMOTE. More recent and advanced methods leverage the power of deep learning and aim at learning the distribution (or more generally a representation) of the data in order to artificially generate new data that mimics the real data. Among such methods, one can mention variational autoencoders and generative adversarial networks.
Discriminative methods – when data is limited, you want to make sure that you focus on the right part. A common technique is called regularization, where you penalize “non-important” data to give more weight to relevant data points, thus reducing the model complexity. More recently, in the realm of deep learning, a method called multi-task learning is used to exploit the limited amount of data at hand and alleviate overfitting in single-task model training. In essence, you are training several models instead of one in order to better generalize new, unseen data.
However, data augmentation and synthesis will most likely have marginal effects if your data is not well distributed, or too small in size to make use of the above-mentioned methods. In that case, you will have no other choice but to go out and collect new data points.
If collecting more data is an option
If collecting more data is the way to go, either because it is affordable to collect more of the same data you already have or because you possess or have access to large amounts of data— even partially complete data such as unlabeled data — you basically have two options:
Data Collection – is always the first option to consider. If your resources are limited and you have access to domain experts (aka SME, Subject Matter Expert) who can help you qualify (label) your data, you may want to have a spin at active learning. With active learning the process of learning is iterative: the algorithm is trained on a limited number of labeled data, then the model identifies difficult unlabeled points and asks in an interactive manner for an SME to label the data point, which is in turn included in the training set.
Data Labeling – is about using the data points that you already own, but which are not part of your training or testing data (i.e. data used for modeling) because they are incomplete (e.g. missing label data). In that case, it might be interesting to see how you can leverage the latest advances in AI in order to make use of this untapped data potential.
3. How to label your unlabeled data?
If you’ve already recorded a significant amount of data but missed some parts of the information such as the label, you could, of course, try to retrieve this information manually (data collection with traditional supervision) but it can turn out to be a very slow and painful process.
Fig 2. Different approaches to address the lack of data. The figure is taken from Stanford.
There are three major routes that you can try in order to get more usable data from unused and unlabeled data, summarized in Fig 2.
Let’s examine them one by one:
Semi-supervised learning – is particularly interesting if you find yourself in a situation with a small amount of labeled data and a large amount of unlabeled data. The idea is to use both the labeled and unlabeled data to achieve higher modeling performance either by inferring the correct labels of the unlabeled data or by using the unlabeled data if possible. Semi-supervised learning makes specific assumptions about the topology of the data, i.e. points being close to each other are assumed to belong to the same class. The interested reader can have a look at the MixMatch algorithm, the latest development from Google Research.
Fig 3. Illustration of a case for semi-supervised learning where it demonstrated the use of unlabeled data (gray points) in order to determine the right model (decision boundary as a dotted line). The figure is taken from Wikipedia.
Transfer learning – is about “recycling” models that have been trained for a similar task and rewiring them to perform another one. For instance, you can easily make use of transfer learning in the above mentioned Inception V3 model from Google, which has been trained to recognize the difference between 1000 different categories and “fine-tune” it for your own application. This way, the model will become really effective at differentiating between a couple of new categories of interests. This approach, however, requires that you have access to an already pre-trained model and that you can actually use transfer learning with these models, which may not often be the case. Transfer learning can, in some ways, be considered as a weak supervision method.
Weak supervision – The rationale behind weak supervision is to use noisy data (low-quality data) in order to find missing information in your existing data. The case for weak supervision is relevant when a very large number of labeled data is needed and a certain level of domain expertise is available to be leveraged. This is not a method that is applicable to all cases of a limited dataset. A remarkable example of a new weak supervision tool comes from Stanford research and goes by the name of Snorkel. It removes humans from the labeling process and instead makes use of “labeling functions,” though still incorporating human knowledge. Snorkel starts by asking humans to write a set of labeling functions called heuristics, and then label some data points. Of course, these labels are going to be imprecise and noisy, but Snorkel will automatically build a generative model using these labeling functions to create a probabilistic label reflecting the confidence over the label.
The brave reader will have by now understood that there is no fatality in case of data shortage and that many solutions already exist to address that commonly faced challenge. However, it can be somewhat difficult to identify which approach might be best suited for you. Particularly, most of the recently developed approaches are designed for unstructured data (images, videos, text, audio, speech, etc.) but do not always directly translate to more traditional tabular data types.
About the author
FORMER HEAD OF DATA SCIENCE, 2021.AI
Benjamin is an AI & ML expert with knowledge of development and implementation across industries and sectors. Before 2021.AI, Benjamin worked as the Lead Data Scientist at eBay.
You might also like…
The Responsible AI Newsletter
Get the latest know-how from those in the know. Sign up for our Responsible AI Newsletter and receive the latest insights from our experts.