Training Data: What is it?

Andhika S Pratama
4 min readJan 11, 2022

It is widely known that machine learning is as good as the data that we feed them on. We used an extremely large dataset to teach the machine learning model and that is called training data.

Before we go through training data, it is worth mentioning that in machine learning there are three types of machine learning datasets: training dataset, test dataset, and validation dataset. Both of these 3 uses different datasets and you can’t use the same dataset for those three to get the best and unbiased results.

Now in training data, there are two types of it: Labeled data and unlabeled data.

  1. Labeled data

So labeled data is used for supervised machine learning models. These labeled data are tagged, labeled, or annotated by humans according to their criteria so that the machine learning model could reach the desired output. These labeled data also can even have more than one label if the criteria that they set said so.

For example, an image of a drink could have more than one tag such as the product itself, could be a can, a plastic bottle, or a glass, and is the drink damaged or not? these two labels could attach to one image so that the machine could learn the difference between different types of drink and whether it is damaged or not in the future.

The process as we called it data annotation is very time-consuming and also expensive to do.

2. Unlabeled data

As for unlabeled data, it is the opposite of labeled data where we feed the machine learning model with raw data and let the model learns the pattern by itself. No human tagging is involved in unlabeled data.

If we used the drink example, then the model will evaluate the images based on their characteristics and in this case its shape. After dozens of images are fed into the model, the model should then be able to recognize the difference between those drinks

There are also hybrid models which combine both supervised and unsupervised machine learning.

After learning the differences between labeled and unlabeled now comes this question, how do we know that our training data is good?

There are at least three things that you need to be aware of in order for your training data to be good:

  1. Relevancy

Relevancy means that the data used must be related to the things that the models want to learn something from. You don’t want to use a picture of cars on a highway let’s say for your model to learn about the differences between drink types like the example that was stated above.

Focus on the dataset that’s related to what you want the model to learn from based on your criteria.

2. Consistency

Having very consistent data means that you will likely have a very good accuracy model too when in the testing phase. For example, the label used for specific characteristics is consistent throughout the entire dataset, the bounding box that was drawn is very tight to the entity and not loose, and the quality of the image is in very good quality and not blurry.

Having these two at your disposal and you will likely have a very high-quality dataset that translates to a very high accuracy also.

3. Tools

The tools that you use are very important in determining the output that you want for your machine learning models. Some tools are better than others, and some tools offer more features than others. Right now, there are a lot of data annotation tools that you can use, freely. You can see this link https://www.datasetlist.com/tools/ to see the tools that are available for free or using a paid subscription.

Garbage in, garbage out

You need to also remember that phrase because the key in that phrase is that you may want to feed your machine only from a very high-quality dataset because that will also lead to very high-quality results too from your model.

As of right now, there are lots of open-source datasets that you can find online. So in case you want to train your model on a specific cases, you might want to search it up online first before you start making your own dataset to save yourself some time.

That’s it! that’s the thing that you need to know at least for a start. There are actually lots of things that you need to know about training data but that would make the article very very long to read and a lot to process also. So yea, hopefully, this article would give you something to learn!

Reference:
https://learn.g2.com/training-data

https://monkeylearn.com/blog/training-data/

https://appen.com/blog/training-data/

--

--

Andhika S Pratama

Hi there! Currently, I’m a Data Annotator in Tictag.io who have an interest in writing such as Copywriting, and UX Writing.