Is using GPT-4 to label data advisable?

Question

If I have a lot of text data that needs to be labeled (e.g. sentiment analysis), and given the high accuracy of GPT-4, could I use it to label data? Or would that introduce bias or some other issues?

score 4 · Answer 1 · answered Apr 06 '23 at 12:54

I agree with Jonathan Oren - in general, GPT-4 works fairly well for straightforward sentiment analysis, e.g. product reviews. One caveat though is that there are most certainly biases inherent in the dataset which was used to train GPT-4. The GPT-4 tech report here: https://cdn.openai.com/papers/gpt-4.pdf has much more information.

GPT-4 has various biases in its outputs that we have taken efforts to correct but which will take some time to fully characterize and manage.
It can represent various societal biases and worldviews that may not be representative of the users intent or of widely shared values.

score 1 · Answer 2 · answered Apr 05 '23 at 23:01

1

GPT-4 Can understand textual context and respond accordingly. For basic labeling of sentiment I think it would work pretty well, if given the correct prompt.

answered Apr 05 '23 at 23:01

Jonathan Oren

36
3

score 1 · Answer 3 · answered Apr 06 '23 at 14:46

Do you mean that you need to predict the labels for some data, or that you need to obtain the gold-standard labels for the data?

predicting means that the labels are inferred by the ML system, it's a statistical process so it's likely to produce some errors. For most applications, it is important to know the level of errors and/or performance of the system used to predict.
obtaining gold-standard labels usually means that the labels have been manually annotated or manually verified. Typically data annotated this way is considered very high quality, it can be used to train a ML system. A ML system cannot be used to predict gold-standard labels, unless it is proved to perform 100% correctly (and this requires evaluating the system on some labelled data, of course).

score 1 · Answer 4 · answered Apr 08 '23 at 03:02

Is using GPT-4 to label data advisable?

Yes. E.g., with GPT-2:

Veyseh, Amir Pouran Ben, Franck Dernoncourt, Bonan Min, and Thien Huu Nguyen. "Generating Complement Data for Aspect Term Extraction with GPT-2." In Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, pp. 203-213. 2022.:

[Task: Aspect Term Extraction (ATE)] We fine-tune the generative language model GPT-2 to allow complement sentence generation at test data. The REINFORCE algorithm is employed to incorporate different expected properties into the reward function to perform the fine-tuning. We perform extensive experiments on the benchmark datasets to demonstrate the benefits of the proposed method that achieve the state-of-the-art performance on different datasets.
Veyseh, Amir Pouran Ben, Viet Lai, Franck Dernoncourt, and Thien Huu Nguyen. "Unleash GPT-2 power for event detection." In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 6271-6282. 2021:

We propose to exploit the powerful pre-trained language model GPT-2 to generate training samples for Event Detection (ED). [...] We evaluate the proposed model on multiple ED benchmark datasets, gaining consistent improvement and establishing state-of-the-art results for ED

score 1 · Answer 5 · answered Apr 17 '23 at 16:12

The recent study ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks suggests that ChatGPT is more than suitable for data annotation:

Many NLP applications require manual data annotations for a variety of tasks, notably to train classifiers or evaluate the performance of unsupervised models. Depending on the size and degree of complexity, the tasks may be conducted by crowd-workers on platforms such as MTurk as well as trained annotators, such as research assistants. Using a sample of 2,382 tweets, we demonstrate that ChatGPT outperforms crowd-workers for several annotation tasks, including relevance, stance, topics, and frames detection. Specifically, the zero-shot accuracy of ChatGPT exceeds that of crowd-workers for four out of five tasks, while ChatGPT's intercoder agreement exceeds that of both crowd-workers and trained annotators for all tasks. Moreover, the per-annotation cost of ChatGPT is less than $0.003 -- about twenty times cheaper than MTurk. These results show the potential of large language models to drastically increase the efficiency of text classification.

Is using GPT-4 to label data advisable?

5 Answers5