AI needs an open labeling platform

0 11

How open banking is the engine of enormous innovation

Find out how forward-thinking fintechs and FIs are accelerating personalized financial products with data-rich APIs.

These days, it's hard to find a public company that doesn't talk about how artificial intelligence is transforming their business. From the obvious (Tesla using AI to improve autopilot performance) to the least obvious (Levis using AI to make better product decisions), everyone wants AI.

To achieve this, however, organizations will need to be much smarter when it comes to data. To get closer to serious AI you need supervised instruction which, in turn, depends on tagged data . Raw data needs to be carefully labeled before it can be used to fuel supervised learning models. This budget line item is important enough to attract the attention of the C-suite. Executives who have spent the past 10 years storing data and now need to turn that data into revenue face three choices:

1. DIY and build your own custom data labeling system.Be prepared and budget for major investments in people, technology, and time to create a robust, large-scale production-quality system that you will maintain in perpetuity. Sounds simple? After all, that's what Google and Facebook did. The same goes for Pinterest, Uber, and other unicorns. But these are not good compositions for you. Unlike you, they had battalions of doctoral students and IT budgets the size of a small country's GDP to build and maintain these complex labeling systems. Can your organization afford this continued investment, even if you have the talent and the time to build a production system from scratch at scale in the first place? If you are the CIO, it

2. Outsource.There is nothing wrong with professional service partners, but you will still need to develop your own in-house tooling. This choice takes your business into risky territory. Many vendors of these solutions mix third-party data with your own proprietary data to increase the size of N samples, which theoretically results in better models. Are you confident in the audit trail of your own data to maintain it throughout the lifecycle of your persistent data labeling requirements? Are the processes you develop as competitive differentiators in your AI journey repeatable and reliable, even if your vendor goes out of business? Your decade of Hoarded IP - data - could potentially help enrich a competitor who also builds their systems with your partners. Scale.ai is the largest of these service companies, primarily serving the autonomous vehicle industry.

3. Use a training data platform (TDP).Relatively new to the market, these are solutions that provide a unified platform to consolidate all the work of collecting, labeling, and feeding data into supervised learning models, or that help build models themselves. themselves. This approach can help organizations of any size standardize workflows the same way Salesforce and Hubspot do for customer relationship management. Some of these platforms automate complex tasks using built-in machine learning algorithms, making the job even easier. Best of all, a TDP solution frees up expensive staff, such as data scientists, to spend time building the actual structures they were hired to do - not to build and maintain complex and fragile custom systems. More pure TDP players understandLabelbox , Alegion and Superb.ai.

Above: Labelbox is an example of a TDP platform that supports labeling of text and images, among other types of data.

Why you need a training data platform

The first thing any organization on an AI journey needs to understand is that tagging data is one of the most expensive and time consuming parts of developing a supervised machine learning system. Data labeling doesn't stop when a machine learning system has matured for use in production. It persists and usually develops. Regardless of whether organizations outsource their labeling or do all of this in-house, they need a TDP to handle the job.

A TDP is designed to facilitate the entire process of labeling data. The idea is to produce better data, faster, allowing organizations to build successful AI models and applications as quickly as possible. There are a few companies in the space that use the term today, but few are true TDPs.

Two things should be table stakes: business readiness and an intuitive interface. If it's not business-ready, IT will reject it. If it's not intuitive, users will move around IT and find something easier to use. Any system that manages sensitive and business-critical information needs enterprise-grade security and scalability, otherwise it will be a non-starter. But the same goes for anything that looks like an old school business product. We are at least a decade into the consumerization of IT. Anything that isn't as easy to use as Instagram just won't be used. Remember Siebel's famous Salesforce automation shelf? Salesforce has stolen this business under their noses with easy user experience and cloud delivery.

Beyond these basics, there are three main requirements: annotate, manage and iterate. If a system you are considering does not meet all three of these requirements, you are not choosing a true TDP. Here are the must-haves on your list of considerations:

Annotate. A TDP must provide tools for automating annotation. As much labeling as possible should be done automatically. A good TDP should be able to work with a limited amount of professionally labeled data. For example, he would start with tumors circled by radiologists in x-rays before pre-labeling the tumors themselves. The task of humans then is to correct anything that has been mislabeled. The machine assigns a trusted output - for example, it can be 80% sure that a given label is correct. The highest priority for humans should be checking and correcting labels that machines have the least confidence in. As such, organizations should seek to automate annotation and invest in professional services to ensure accuracy and consistency. integrity of tagged data. Much of the work around annotations can easily be done without human assistance.

Manage. A TDP should serve as a central registration system for data training projects. This is where data scientists and other team members collaborate. Workflows can be created and tasks can be assigned either through integrations with traditional project management tools or within the platform itself.

This is also where datasets can be displayed again for later projects. For example, every year in the United States, about 30% of all homes are listed for home insurance. In order to predict and assess risk, insurers depend on data such as the age of the house's roof, the presence of a pool or trampoline, or the distance of a tree from the house. To facilitate this process, companies are now leveraging computer vision to provide insurance companies with continuous analysis via satellite imagery. A business must be able to use a TDP to reuse existing data sets when classifying homes in a new market. For example, if a company enters the UK market, it should be able to reuse existing US training data and simply update it to accommodate local differences such as building materials. These iteration cycles allow businesses to deliver highly accurate data while quickly adapting to keep up with the ongoing changes in homes in the United States and beyond.

This means that your TDP should provide APIs for integration with other software, whether they are project management applications, data collection and processing tools, or SDKs that allow organizations to customize their tools and extend the TDP to meet their needs.

Repeat.A true TDP knows that annotated data is never static. Instead, it's constantly changing, always iterating as more and more data joins the dataset and models provide feedback on the effectiveness of the data. Indeed, the key to data accuracy is iteration. Test the model. Improve the model. Test again. And again and again. A tractor's smart sprayer can apply herbicide to one type of weed 50% of the time, but as more images of the weed are added to the training data, future iterations of the model sprayer computer vision can increase it to 90% or more. As and other weeds are added to the training data, the sprayer can recognize these unwanted plants. It can be time consuming and usually requires humans to go through the loop, although much of the process is automated. You have to iterate, but the idea is to get your models as good as possible as quickly as possible. The goal of a TDP is to speed up these iterations and make each iteration better than the last, saving time and money. get your models as good as possible as quickly as possible. The goal of a TDP is to speed up these iterations and make each iteration better than the last, saving time and money. get your models as good as possible as quickly as possible. The goal of a TDP is to speed up these iterations and make each iteration better than the last, saving time and money.

The future

Just as the 18th century shift to standardization and interchangeable parts sparked the industrial revolution, a standard framework for defining TDPs will begin to take AI to new levels. It's still early days, but it's clear that labeled data - managed through a true TDP - can reliably turn raw data (your company's precious IP) into a competitive advantage in almost any industry.

But C-suite executives need to understand the need to invest to harness the potential riches of AI. They have three choices today, and whatever decision they make is going to be costly whether to build, contract or buy. As is often the case with key business infrastructures, construction or outsourcing can lead to huge hidden costs, especially when adopting a new way of doing business. A real TDP “de-risks” this costly decision while maintaining the competitive gap of your company, your intellectual property.

(Disclosure: I work for AWS, but the opinions expressed here are my own.)

Matt Asay is a principal at Amazon Web Services. He was previously head of the developer ecosystem for Adobe and held positions at MongoDB, Nodeable (acquired by Appcelerator), HTML5 mobile start-up Strobe (acquired by Facebook) and Canonical. He is a member emeritus of the board of directors of the Open Source Initiative (OSI).

VentureBeat is always on the lookout for insightful Guest Posts from data experts and AI practitioners.

1
$ 0.00

Comments