Learning with less labels

Labels:

Description

The Learning with Less Labels (LwLL) program aims to make the process of training machine learning models more efficient by reducing the amount of labeled data required to build a model by six orders of magnitude, and by reducing the amount of data needed to adapt models to new environments from tens to hundreds of labeled examples. This program is funded by the Defense Advanced Research Projects Agency (DARPA).

Problem Context

In this program multiple teams from all over the world are trying to tackle this challenge. Among those teams are high players in this field: Cornell University, MIT, IBM, Georgia Tech, University of California, and TNO. Our team consists of five people from the department of Intelligent Imaging. The program consists of two phases, where the first phase should lead to reduction of three orders of magnitude in training data, and the second phase in reduction of six orders of magnitude, and after each phase the performance of each team is evaluated on blind datasets. Whether a team is allowed to continue its research in the second phase depends on the performance in the evaluation of phase one. The teams are evaluated on multiple so-called checkpoints. With each checkpoint a team can query a certain number of labels, starting from only one label per class. At the final checkpoint, all labels from the train set can be queried. A baseline is constructed, with which all teams are compared. Ideally, the evaluation metric at the first checkpoint should be equal to the baseline value of the last checkpoint. In this program multiple tasks are defined where TNO focuses on two: image classification and object detection. In both tasks the aim is to reduce the amount of labeled data required for training a model by six or more orders of magnitude.

Solution

We believe that a combination of smart and state-of-the-art techniques is necessary to tackle the LwLL challenge. We therefore combine a variety of ideas for optimal performance. Firstly, we make use of pretrained models for initial data embedding. We furthermore make use of domain transfer by using publicly available labeled data. We smartly query labels using clustering on the initially embedded data. We also use clustering to obtain pseudo labels: these labels are an initial guess of the actual label. As a final technique, we use augmentation and semi-supervised learning to exploit the structure of the unlabeled data.

Resources

DARPA page

TNO roadmap page

Contact

Klamer Schutte, senior scientist, TNO, e-mail: klamer.schutte@tno.nl