9

I'm building an application for the API, but I would like to be able to count the number of tokens my prompt will use, before I submit an API call. Currently I often submit prompts that yield a 'too-many-tokens' error.

The closest I got to an answer was this post, which still doesn't say what tokenizer it uses.

If I knew what tokenizer the API used, then I could count how many tokens are in my prompt before I submit the API call.

I'm working in Python.

Herman Autore
  • 93
  • 1
  • 3

1 Answers1

14

Tokenizer for GPT-3 is the same as GPT-2:

https://huggingface.co/docs/transformers/model_doc/gpt2#gpt2tokenizerfast

linked via:

https://beta.openai.com/tokenizer


UPDATE March 2023

For newer models, including GPT-3.5 (turbo), GPT-4, and latest embeddings, use tiktoken tokenizer with the cl100k_base encoding:

https://github.com/openai/tiktoken

A full model-to-encoding mapping can be found here

pjama
  • 256
  • 3
  • 4