Questions tagged [tokenization]

70 questions
9
votes
1 answer

What tokenizer does OpenAI's GPT3 API use?

I'm building an application for the API, but I would like to be able to count the number of tokens my prompt will use, before I submit an API call. Currently I often submit prompts that yield a 'too-many-tokens' error. The closest I got to an answer…
Herman Autore
  • 93
  • 1
  • 3
9
votes
6 answers

NLP: What are some popular packages for multi-word tokenization?

I intend to tokenize a number of job description texts. I have tried the standard tokenization using whitespace as the delimiter. However I noticed that there are some multi-word expressions that are splitted by whitespace, which may well cause…
CyberPlayerOne
  • 392
  • 1
  • 4
  • 15
8
votes
1 answer

Understanding the effect of num_words of Tokenizer in Keras

Consider the following code: from keras.preprocessing.text import Tokenizer tokenizer = Tokenizer(num_words = 5000) tokenizer.fit_on_texts(texts) print('Found %d unique words.' % len(tokenizer.word_index)) When I run this, it prints: Found 88582…
Mehran
  • 277
  • 1
  • 2
  • 13
5
votes
1 answer

Unigram tokenizer: how does it work?

I have been trying to understand how the unigram tokenizer works since it is used in the sentencePiece tokenizer that I am planning on using, but I cannot wrap my head around it. I tried to read the original paper, which contains so little details…
Johncowk
  • 205
  • 3
  • 6
5
votes
2 answers

Converting paragraphs into sentences

I'm looking for ways to extract sentences from paragraphs of text containing different types of punctuations and all. I used SpaCy's Sentencizer to begin with. Sample input python list abstracts: ["A total of 2337 articles were found, and, according…
Van Peer
  • 285
  • 1
  • 4
  • 12
5
votes
1 answer

Accuracy of word and sent tokenize versus custom tokenizers in nltk

The Natural Language Processing with Python book is a really good resource to understand basics of NLP. One of the chapters introduces training 'sentence segmentation' using Naive Bayes Classifer and provides a method to perform sentence…
MrKickass
  • 111
  • 8
4
votes
1 answer

NLP: what are the advantages of using a subword tokenizer as opposed to the standard word tokenizer?

I'm looking at this Tensorflow colab tutorial about language translation with Transformers, https://www.tensorflow.org/tutorials/text/transformer, and they tokenize the words with a subword text tokenizer. I have never seen a subword tokenizer…
zipline86
  • 399
  • 1
  • 5
  • 13
4
votes
2 answers

ChatGPT: How to use long texts in prompt?

I like the website chatpdf.com a lot. You can upload a PDF file and then discuss the textual content of the file with the file "itself". It uses ChatGPT. I would like to program something similar. But I wonder how to use the content of long PDF…
meyer_mit_ai
  • 63
  • 1
  • 1
  • 5
3
votes
1 answer

How to i get word embeddings for out of vocabulary words using a transformer model?

When i tried to get word embeddings of a sentence using bio_clinical bert, for a sentence of 8 words i am getting 11 token ids(+start and end) because "embeddings" is an out of vocabulary word/token, that is being split into em,bed,ding,s. I would…
3
votes
1 answer

What is the difference between TextVectorization and Tokenizer?

What is the difference between the layers.TextVectorization() and from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequences And when to use what ?
Pritam Sinha
  • 193
  • 1
  • 9
2
votes
1 answer

From where does BERT get the tokens it predicts?

When BERT is used for masked language modeling, it masks a token and then tries to predict it. What are the candidate tokens BERT can choose from? Does it just predict an integer (like a regression problem) and then use that token? Or does it do a…
Nick Koprowicz
  • 223
  • 1
  • 3
  • 10
2
votes
1 answer

Tokenization of data in dataframe in python

I am performing tokenization to each row in my dataframe but the tokenization is being done for only the first row. Can someone please help me. thank you. Below are my codes: import pandas as pd import json import…
Nedisha
  • 45
  • 1
  • 2
  • 7
2
votes
1 answer

NLP: What are some popular packages for phrase tokenization?

I'm trying to tokenize some sentences into phrases. For instance, given I think you're cute and I want to know more about you The tokens can be something like I think you're cute and I want to know more about you Similarly, given input Today…
John M.
  • 293
  • 2
  • 3
  • 8
2
votes
1 answer

How to customize word division in CountVectorizer?

>>> from sklearn.feature_extraction.text import CountVectorizer >>> import numpy >>> import pandas >>> vectorizer = CountVectorizer() >>> corpus1 = ['abc-@@-123','cde-@@-true','jhg-@@-hud'] >>> xtrain = vectorizer.fit_transform(corpus1) >>>…
helloworld
  • 23
  • 1
  • 3
2
votes
2 answers

how do we adapt LLM token embeddings with custom vocab

Hi im just getting started with understanding transformer based models and I am not able to find how the token embeddings are arrived at?. there are multiple tokenization approaches and multiple vocabularies/documents llms are trained on. so my…
dasman
  • 121
  • 1
  • 3
1
2 3 4 5