What tokenizer does OpenAI's GPT3 API use?

Question

I'm building an application for the API, but I would like to be able to count the number of tokens my prompt will use, before I submit an API call. Currently I often submit prompts that yield a 'too-many-tokens' error.

The closest I got to an answer was this post, which still doesn't say what tokenizer it uses.

If I knew what tokenizer the API used, then I could count how many tokens are in my prompt before I submit the API call.

I'm working in Python.

pjama · Accepted Answer · 2023-03-10T00:28:05.290

14

Tokenizer for GPT-3 is the same as GPT-2:

https://huggingface.co/docs/transformers/model_doc/gpt2#gpt2tokenizerfast

linked via:

https://beta.openai.com/tokenizer

UPDATE March 2023

For newer models, including GPT-3.5 (turbo), GPT-4, and latest embeddings, use tiktoken tokenizer with the cl100k_base encoding:

https://github.com/openai/tiktoken

A full model-to-encoding mapping can be found here

edited Mar 10 '23 at 00:28

answered Mar 30 '22 at 05:10

pjama

256
3
4

What tokenizer does OpenAI's GPT3 API use?

1 Answers1