2

I am looking for a specific deep learning method that can train a neural network model with both clean and noisy labels.

More precisely, I would like this method to be able to leverage noisy data as well, for instance by not fully "trusting" noisy data, or weighting samples, or deciding whether to use a specific sample at all for learning. But primarily, I am looking for inspiration.

Details:

  • My task is sequence-to-sequence NLP,
  • I have both clean pairs of sequences of (clean input, clean output) and noisy ones (noisy_input, noisy_output),
  • I know for certain which samples in my data are noisy, and if possible, I would like the desired method to make use of this information

I am very glad to give more information about my use case if needed.

Edit: Noisy vs. negative examples

First, I wouldn't use the word "noisy" here because if you know which instances are "wrong" then these are not noise, they are negative examples.

My view is that the data I have are noisy examples, but not "negative". Using an example from machine translation from German to English:

clean (equivalent meaning)

DE Wenn es um die Medien geht, lebt Amerika in einem Paralleluniversum.
EN Regarding media, the US are living in a parallel universe.

noisy (meaning overlap)

DE Wenn es um die Medien geht, lebt Amerika in einem Paralleluniversum.
EN Regarding media, the US are weird.

negative (unrelated)

DE Wenn es um die Medien geht, lebt Amerika in einem Paralleluniversum.
EN Is Math related to science?

2 Answers2

3

First, I wouldn't use the word "noisy" here because if you know which instances are "wrong" then these are not noise, they are negative examples. In my opinion "noisy" is when positive and negative cases are mixed together in a way that makes it difficult (or impossible) to distinguish between them. I think this matters because you're more likely to find similar use cases and relevant methods using this terminology.

I don't have a precise method to suggest but I would check the state of the art in machine translation: it's also a sequence-to-sequence task in which there are potential positive/negative cases. In particular there has been some work done in MT quality estimation, where the goal is to predict the quality of a translation for a sentence. This might be related because it's about labeling or quantifying how good a translation is, and I would assume that there are works which re-use labelled/scored translations (including potentially wrong ones) in order to obtain a better model. Unfortunately I don't have any pointers since I haven't followed the field recently.

Erwan
  • 26,519
  • 3
  • 16
  • 39
2

There is a python package created exactly for this purpose of finding label errors and training ML models robustly and reliably even when your data has issues or you have noisy labels: https://github.com/cleanlab/cleanlab -- it works for any dataset you can train a classifier on and for most data formats, ML and deep learning frameworks, and data modalities, e.g. image, text, tabular, and audio data. I am an author on this package.

Find label issues in 1 line of code

from cleanlab.classification import CleanLearning
from cleanlab.filter import find_label_issues

Option 1 - works with sklearn-compatible models - just input the data and labels ツ

label_issues_info = CleanLearning(clf=sklearn_compatible_model).find_label_issues(data, labels)

Option 2 - works with ANY ML model - just input the model's predicted probabilities

ordered_label_issues = find_label_issues( labels=labels, pred_probs=pred_probs, # out-of-sample predicted probabilities from any model return_indices_ranked_by='self_confidence', )

Train a model as if the dataset did not have errors -- 3 lines of code

from sklearn.linear_model import LogisticRegression
from cleanlab.classification import CleanLearning

cl = CleanLearning(clf=LogisticRegression()) # any sklearn-compatible classifier cl.fit(train_data, labels)

Estimate the predictions you would have gotten if you trained without mislabeled data.

predictions = cl.predict(test_data)

cgnorthcutt
  • 161
  • 4