For companies that use ML, labeled data is the key differentiator

Sylvain Kalache Contributor

Sylvain Kalache is the co-founder of Holberton, an edtech company training digital talent in more than 10 countries. An entrepreneur and software engineer, he has worked in the tech industry for more than a decade. Part of the team that led SlideShare to be acquired by LinkedIn, he has written for CIO and VentureBeat.

More posts by this contributor

AI is driving the paradigm shift that is the software industry’s transition to data-centric programming from writing logical statements. Data is now oxygen. The more training data a company gathers, the brighter will its AI-powered products burn.

Why is Tesla so far ahead with advanced driver assistance systems (ADAS)? Because no one else has collected as much information — it has data on more than 10 billion driven miles, helping it pull ahead of competition like Waymo, which has only about 20 million miles. But any company that is considering using machine learning (ML) cannot overlook one technical choice: supervised or unsupervised learning.

Any company that is considering using machine learning cannot overlook one technical choice: supervised or unsupervised learning.

There is a fundamental difference between the two. For unsupervised learning, the process is fairly straightforward: The acquired data is directly fed to the models, and if all goes well, it will identify patterns.

Elon Musk compares unsupervised learning to the human brain, which gets raw data from the six senses and makes sense of it. He recently shared that making unsupervised learning work for ADAS is a major challenge that hasn’t been solved yet.

A major part of real-world AI has to be solved to make unsupervised, generalized full self-driving work, as the entire road system is designed for biological neural nets with optical imagers

— Elon Musk (@elonmusk) April 29, 2021

Supervised learning is currently the most practical approach for most ML challenges. O’Reilly’s 2021 report on AI Adoption in the Enterprise found that 82% of surveyed companies use supervised learning, while only 58% use unsupervised learning. Gartner predicts that through 2022, supervised learning will remain favored by enterprises, arguing that “most of the current economic value gained from ML is based on supervised learning use cases.”

Supervised learning requires the crucial additional step of making raw data smart by labeling it. If we take the example of Tesla’s ADAS, a human has looked at and labeled pretty much every object in every image in all that training data to identify people, traffic signs, other vehicles, etc.

“Raw data, while plentiful and in theory, useful, cannot typically be used by an ML system without modification and preparation,” writes Peter Levine, a partner at venture firm Andreessen Horowitz. “Before being fed into an ML framework like PyTorch or Tensorflow, data has to be aggregated, transformed, cleaned, augmented, and — in most cases — labeled.”

5 machine learning essentials nontechnical leaders need to understand

It turns out that data labeling can take up to 80% of the resources in the average ML project. It’s also a big source of failure: 70% of companies report having problems labeling their data. To date, data labeling has been a brute force affair — the more Mechanical Turk workers or the larger the annotation farms a company throws at the problem, the faster it gets done. The cost and speed of iteration are linear to the number of workers the company can hire. In other words, it does not scale well.

However, AI itself has a solution to this problem: Leveraging ML to pre-label the raw data so workers only have to confirm what the computer has done. Human labelers can then focus on edge cases, making the process faster and cheaper.

It’s been more than five years since computers started to beat humans at image recognition, but the industry has only recently started booming. The data annotation market, which was only valued at $695.5 million in 2019, is expected to surpass $6 billion by 2027.

One of the main players in the space is Scale AI. Its recent $325 million round of funding brought the company to a whopping $7 billion valuation. In its fundraising announcement, the company said it was able to improve Toyota’s annotation throughput by 10 times in a matter of weeks. Toyota AI Ventures senior partner Chris Abshire defined the ability to “easily obtain data, and then extract value from that data with minimal human intervention” as the holy grail for many AI startups.

Data annotation also applies to more traditional industries. Blue River Technology, John Deere’s AI subsidiary, is also using supervised learning to improve John Deere smart sprayers’ ability to tell the difference between a weed and a crop. Using the Labelbox platform, one of Scale.ai’s main competitors, Blue River Technology, was able to cut its labeling time nearly in half, speeding up iteration while also saving money. “Over the course of 2020, we were able to lower our cost per label by 25%,” says Emma Bassein, Blue River’s director of data and machine learning.

Scale and Labelbox, the largest players in the field, represent different approaches to the labeling problem. Scale is an example of approaching it from a service perspective, where it takes data from its customers and returns it labeled, relieving companies of the task altogether. This approach is popular among enterprises that require large-scale training data sets — primarily self-driving car companies.

Labelbox is an example of a platform perspective that gives data owners the tools to annotate their data without giving up control. The platform approach is more popular among companies that depend primarily on quality rather than quantity in their training data.

Data quality comes as the second-biggest challenge for companies doing AI, and labeling data is a way to assess it. Data quality encompasses a number of elements, including volume, diversity, accuracy and bias. For instance, with ADAS technology, if there aren’t enough images of rainy conditions, the model won’t work well in a storm.

A good training data platform can identify and fix this problem before the model goes into production and a car crashes in the rain. The labeling process can also identify biases in the data, which would otherwise train your model to be racist or sexist — Amazon’s recruiting model discriminating against women is one such disastrous failure caused by poor data quality.

When a company chooses supervised learning, it needs to have a strategy that allows it to label data as quickly as it acquires it. It has been essential for software companies to hire top software talent to write the best lines of code, but the new paradigm will be to generate the smartest data to come up with the best AI models.

How we dodged risks and raised millions for our open-source machine learning startup