Questions tagged [tokenization]
70 questions
9
votes
1 answer
What tokenizer does OpenAI's GPT3 API use?
I'm building an application for the API, but I would like to be able to count the number of tokens my prompt will use, before I submit an API call. Currently I often submit prompts that yield a 'too-many-tokens' error.
The closest I got to an answer…
Herman Autore
- 93
- 1
- 3
9
votes
6 answers
NLP: What are some popular packages for multi-word tokenization?
I intend to tokenize a number of job description texts. I have tried the standard tokenization using whitespace as the delimiter. However I noticed that there are some multi-word expressions that are splitted by whitespace, which may well cause…
CyberPlayerOne
- 392
- 1
- 4
- 15
8
votes
1 answer
Understanding the effect of num_words of Tokenizer in Keras
Consider the following code:
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words = 5000)
tokenizer.fit_on_texts(texts)
print('Found %d unique words.' % len(tokenizer.word_index))
When I run this, it prints:
Found 88582…
Mehran
- 277
- 1
- 2
- 13
5
votes
1 answer
Unigram tokenizer: how does it work?
I have been trying to understand how the unigram tokenizer works since it is used in the sentencePiece tokenizer that I am planning on using, but I cannot wrap my head around it.
I tried to read the original paper, which contains so little details…
Johncowk
- 205
- 3
- 6
5
votes
2 answers
Converting paragraphs into sentences
I'm looking for ways to extract sentences from paragraphs of text containing different types of punctuations and all. I used SpaCy's Sentencizer to begin with.
Sample input python list abstracts:
["A total of 2337 articles were found, and, according…
Van Peer
- 285
- 1
- 4
- 12
5
votes
1 answer
Accuracy of word and sent tokenize versus custom tokenizers in nltk
The Natural Language Processing with Python book is a really good resource to understand basics of NLP. One of the chapters introduces training 'sentence segmentation' using Naive Bayes Classifer and provides a method to perform sentence…
MrKickass
- 111
- 8
4
votes
1 answer
NLP: what are the advantages of using a subword tokenizer as opposed to the standard word tokenizer?
I'm looking at this Tensorflow colab tutorial about language translation with Transformers, https://www.tensorflow.org/tutorials/text/transformer, and they tokenize the words with a subword text tokenizer. I have never seen a subword tokenizer…
zipline86
- 399
- 1
- 5
- 13
4
votes
2 answers
ChatGPT: How to use long texts in prompt?
I like the website chatpdf.com a lot. You can upload a PDF file and then discuss the textual content of the file with the file "itself". It uses ChatGPT.
I would like to program something similar. But I wonder how to use the content of long PDF…
meyer_mit_ai
- 63
- 1
- 1
- 5
3
votes
1 answer
How to i get word embeddings for out of vocabulary words using a transformer model?
When i tried to get word embeddings of a sentence using bio_clinical bert, for a sentence of 8 words i am getting 11 token ids(+start and end) because "embeddings" is an out of vocabulary word/token, that is being split into em,bed,ding,s.
I would…
cerofrais
- 131
- 4
3
votes
1 answer
What is the difference between TextVectorization and Tokenizer?
What is the difference between the layers.TextVectorization() and
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
And when to use what ?
Pritam Sinha
- 193
- 1
- 9
2
votes
1 answer
From where does BERT get the tokens it predicts?
When BERT is used for masked language modeling, it masks a token and then tries to predict it.
What are the candidate tokens BERT can choose from? Does it just predict an integer (like a regression problem) and then use that token? Or does it do a…
Nick Koprowicz
- 223
- 1
- 3
- 10
2
votes
1 answer
Tokenization of data in dataframe in python
I am performing tokenization to each row in my dataframe but the tokenization is being done for only the first row. Can someone please help me. thank you.
Below are my codes:
import pandas as pd
import json
import…
Nedisha
- 45
- 1
- 2
- 7
2
votes
1 answer
NLP: What are some popular packages for phrase tokenization?
I'm trying to tokenize some sentences into phrases. For instance, given
I think you're cute and I want to know more about you
The tokens can be something like
I think you're cute
and
I want to know more about you
Similarly, given input
Today…
John M.
- 293
- 2
- 3
- 8
2
votes
1 answer
How to customize word division in CountVectorizer?
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> import numpy
>>> import pandas
>>> vectorizer = CountVectorizer()
>>> corpus1 = ['abc-@@-123','cde-@@-true','jhg-@@-hud']
>>> xtrain = vectorizer.fit_transform(corpus1)
>>>…
helloworld
- 23
- 1
- 3
2
votes
2 answers
how do we adapt LLM token embeddings with custom vocab
Hi im just getting started with understanding transformer based models and I am not able to find how the token embeddings are arrived at?. there are multiple tokenization approaches and multiple vocabularies/documents llms are trained on. so my…
dasman
- 121
- 1
- 3