Why you don't need big data to train ML
When somebody says
artificial intelligence (AI), they most often mean
machine learning (ML). To create an ML algorithm, most people think you need to collect a labeled dataset, and the dataset must be huge. This is all true if the goal is to describe the process in one sentence. However, if you understand the process a little better, then
big data is not as necessary as it first seems.
Why many people think nothing will work without big data
To begin with, let’s discuss what a dataset and training are. A dataset is a collection of objects that are typically labeled by a human so that the algorithm can understand what it should look for. For example, if we want to find cats in photos, we need a set of pictures with cats and, for each picture, the coordinates of the cat, if it exists.
During training, the algorithm is shown the labeled data with the expectation that it will learn how to predict labels for objects, find universal dependencies and be able to solve the problem on data that it has not seen.
>>Don’t miss our special issue:
The quest for Nirvana: Applying AI at scale.<<
One of the most common challenges in training such algorithms is called overfitting. Overfitting occurs when the algorithm remembers the training dataset but doesn’t learn how to work with data it has never seen.
Let’s take the same example. If our data contains only photos of black cats, then the algorithm can learn the relationship: black with a tail = a cat. But the false dependency is not always so obvious. If there is little data, and the algorithm is strong, it can remember all the data, focusing on uninterpretable noise.
The easiest way to combat overfitting is to collect more data because this helps prevent the algorithm from creating false dependencies, such as only recognizing black cats.
The caveat here is that the dataset must be representative (e.g., using only photos from a British shorthair fan forum won’t yield good results, no matter how large the pool is). Because more data is the simplest solution, the opinion persists that a lot of data is needed.
Ways to launch products without big data
However, let’s take a closer look. Why do we need data? For the algorithm to find a dependency in them. Why do we need a lot of data? So that it finds the correct dependency. How can we reduce the amount of data? By prompting the algorithm with the correct dependencies.
Skinny algorithms
One option is to use lightweight algorithms. Such algorithms cannot find complex dependencies and, accordingly, are less prone to overfitting. The difficulty with such algorithms is that they require the developer to preprocess the data and look for patterns on their own.
For example, assume you want to predict a store’s daily sales, and your data is the address of the store, the date, and a list of all purchases for that date. A sign that will facilitate the task is the indicator of the day off. If it’s a holiday now, then the customers will probably make purchases more often, and revenue will increase.
Manipulating the data in this way is called feature engineering. This approach works well in problems where such features are easy to create based on common sense.
However, in some tasks, such as working with images, everything is more difficult. This is where
deep learning neural networks come in. Because they are capacious algorithms, they can find non-trivial dependencies where a person simply couldn’t understand the nature of the data. Almost all recent advances in
computer vision are credited to neural networks. Such algorithms do typically require a lot of data, but they can also be prompted.
Searching the public domain
The first way to do this is by fine-tuning pre-trained models. There are many already-trained neural networks in the public domain. While there may not be one trained for your specific task, there is likely one from a similar area.
These networks have already learned some basic understanding of the world; they just need to be nudged in the right direction. Thus, there is only a need for a small amount of data. Here we can draw an analogy with people: A person who can skateboard will be able to pick up longboarding with much less guidance than someone who has never even stood on a skateboard before.
In some cases, the problem is not with the number of objects, but the number of labeled ones. Sometimes, collecting data is easy, but labeling is very difficult. For example, when the markup is science-intensive, such as when classifying body cells, the few people who are qualified to label this data are expensive to hire.
Even if there is no similar task available in the
open-source world, it is still possible to come up with a task for pre-training that does not require labeling. One such example is training an autoencoder, which is a neural network that compresses objects (similar to a .zip archiver) and then decompresses them.
For effective compression, it only needs to find some general patterns in the data, which means we can use this pre-trained network for fine-tuning.
Active learning
Another approach to improving models in the presence of undetected data is called active learning. The essence of this concept is that the neural network itself suggests which examples it needs to label and which examples are labeled incorrectly. The fact is that often, along with the answer, the algorithm gives away its confidence in the result. Accordingly, we can run the intermediate algorithm on unnoticed data in search of those where the output is uncertain, give them to people for labeling, and, after labeling, train again.
It is important to note that this is not an exhaustive list of possible options; these are just a few of the simplest approaches. And remember that each of these approaches is not a panacea. For some tasks, one approach works better; for others, another will yield the best results. The more you try, the better results you will find.