If I have a lot of text data that needs to be labeled (e.g. sentiment analysis), and given the high accuracy of GPT-4, could I use it to label data? Or would that introduce bias or some other issues?
5 Answers
I agree with Jonathan Oren - in general, GPT-4 works fairly well for straightforward sentiment analysis, e.g. product reviews. One caveat though is that there are most certainly biases inherent in the dataset which was used to train GPT-4. The GPT-4 tech report here: https://cdn.openai.com/papers/gpt-4.pdf has much more information.
- GPT-4 has various biases in its outputs that we have taken efforts to correct but which will take some time to fully characterize and manage.
- It can represent various societal biases and worldviews that may not be representative of the users intent or of widely shared values.
- 1,420
- 1
- 4
- 10
GPT-4 Can understand textual context and respond accordingly. For basic labeling of sentiment I think it would work pretty well, if given the correct prompt.
- 36
- 3
Do you mean that you need to predict the labels for some data, or that you need to obtain the gold-standard labels for the data?
- predicting means that the labels are inferred by the ML system, it's a statistical process so it's likely to produce some errors. For most applications, it is important to know the level of errors and/or performance of the system used to predict.
- obtaining gold-standard labels usually means that the labels have been manually annotated or manually verified. Typically data annotated this way is considered very high quality, it can be used to train a ML system. A ML system cannot be used to predict gold-standard labels, unless it is proved to perform 100% correctly (and this requires evaluating the system on some labelled data, of course).
- 26,519
- 3
- 16
- 39
Is using GPT-4 to label data advisable?
Yes. E.g., with GPT-2:
Veyseh, Amir Pouran Ben, Franck Dernoncourt, Bonan Min, and Thien Huu Nguyen. "Generating Complement Data for Aspect Term Extraction with GPT-2." In Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, pp. 203-213. 2022.:
[Task: Aspect Term Extraction (ATE)] We fine-tune the generative language model GPT-2 to allow complement sentence generation at test data. The REINFORCE algorithm is employed to incorporate different expected properties into the reward function to perform the fine-tuning. We perform extensive experiments on the benchmark datasets to demonstrate the benefits of the proposed method that achieve the state-of-the-art performance on different datasets.
Veyseh, Amir Pouran Ben, Viet Lai, Franck Dernoncourt, and Thien Huu Nguyen. "Unleash GPT-2 power for event detection." In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 6271-6282. 2021:
We propose to exploit the powerful pre-trained language model GPT-2 to generate training samples for Event Detection (ED). [...] We evaluate the proposed model on multiple ED benchmark datasets, gaining consistent improvement and establishing state-of-the-art results for ED
- 5,862
- 12
- 44
- 80
The recent study ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks suggests that ChatGPT is more than suitable for data annotation:
Many NLP applications require manual data annotations for a variety of tasks, notably to train classifiers or evaluate the performance of unsupervised models. Depending on the size and degree of complexity, the tasks may be conducted by crowd-workers on platforms such as MTurk as well as trained annotators, such as research assistants. Using a sample of 2,382 tweets, we demonstrate that ChatGPT outperforms crowd-workers for several annotation tasks, including relevance, stance, topics, and frames detection. Specifically, the zero-shot accuracy of ChatGPT exceeds that of crowd-workers for four out of five tasks, while ChatGPT's intercoder agreement exceeds that of both crowd-workers and trained annotators for all tasks. Moreover, the per-annotation cost of ChatGPT is less than $0.003 -- about twenty times cheaper than MTurk. These results show the potential of large language models to drastically increase the efficiency of text classification.
- 28,203
- 1
- 49
- 83