Highest Voted 'tokenization' Questions - Data Science Stack Exchange

9

votes

1 answer

What tokenizer does OpenAI's GPT3 API use?

I'm building an application for the API, but I would like to be able to count the number of tokens my prompt will use, before I submit an API call. Currently I often submit prompts that yield a 'too-many-tokens' error. The closest I got to an answer…

python-3.x tokenization gpt

asked Jul 08 '21 at 18:07

Herman Autore

93
1
3

9

votes

6 answers

NLP: What are some popular packages for multi-word tokenization?

I intend to tokenize a number of job description texts. I have tried the standard tokenization using whitespace as the delimiter. However I noticed that there are some multi-word expressions that are splitted by whitespace, which may well cause…

nlp nltk tokenization

asked Mar 02 '17 at 07:04

CyberPlayerOne

392
1
4
15

8

votes

1 answer

Understanding the effect of num_words of Tokenizer in Keras

Consider the following code: from keras.preprocessing.text import Tokenizer tokenizer = Tokenizer(num_words = 5000) tokenizer.fit_on_texts(texts) print('Found %d unique words.' % len(tokenizer.word_index)) When I run this, it prints: Found 88582…

keras tokenization

asked Aug 19 '18 at 21:50

Mehran

277
1
2
13

5

votes

1 answer

Unigram tokenizer: how does it work?

I have been trying to understand how the unigram tokenizer works since it is used in the sentencePiece tokenizer that I am planning on using, but I cannot wrap my head around it. I tried to read the original paper, which contains so little details…

nlp transformer tokenization

asked Feb 02 '21 at 13:28

Johncowk

205
3
6

5

votes

2 answers

Converting paragraphs into sentences

I'm looking for ways to extract sentences from paragraphs of text containing different types of punctuations and all. I used SpaCy's Sentencizer to begin with. Sample input python list abstracts: ["A total of 2337 articles were found, and, according…

nlp spacy tokenization information-extraction

asked Jan 11 '21 at 10:29

Van Peer

285
1
4
12

5

votes

1 answer

Accuracy of word and sent tokenize versus custom tokenizers in nltk

The Natural Language Processing with Python book is a really good resource to understand basics of NLP. One of the chapters introduces training 'sentence segmentation' using Naive Bayes Classifer and provides a method to perform sentence…

python nlp nltk tokenization

asked Dec 30 '17 at 11:22

MrKickass

111
8

4

votes

1 answer

NLP: what are the advantages of using a subword tokenizer as opposed to the standard word tokenizer?

I'm looking at this Tensorflow colab tutorial about language translation with Transformers, https://www.tensorflow.org/tutorials/text/transformer, and they tokenize the words with a subword text tokenizer. I have never seen a subword tokenizer…

tensorflow nlp colab tokenization

asked Oct 09 '20 at 08:37

zipline86

399
1
5
13

4

votes

2 answers

ChatGPT: How to use long texts in prompt?

I like the website chatpdf.com a lot. You can upload a PDF file and then discuss the textual content of the file with the file "itself". It uses ChatGPT. I would like to program something similar. But I wonder how to use the content of long PDF…

transformer gpt tokenization chatbot

asked Mar 18 '23 at 12:46

meyer_mit_ai

63
1
1
5

3

votes

1 answer

How to i get word embeddings for out of vocabulary words using a transformer model?

When i tried to get word embeddings of a sentence using bio_clinical bert, for a sentence of 8 words i am getting 11 token ids(+start and end) because "embeddings" is an out of vocabulary word/token, that is being split into em,bed,ding,s. I would…

nlp transformer stanford-nlp tokenization huggingface

asked Jan 13 '21 at 07:02

cerofrais

131
4

3

votes

1 answer

What is the difference between TextVectorization and Tokenizer?

What is the difference between the layers.TextVectorization() and from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequences And when to use what ?

keras nlp tokenization

asked Dec 07 '21 at 16:17

Pritam Sinha

193
1
9

2

votes

1 answer

From where does BERT get the tokens it predicts?

When BERT is used for masked language modeling, it masks a token and then tries to predict it. What are the candidate tokens BERT can choose from? Does it just predict an integer (like a regression problem) and then use that token? Or does it do a…

nlp bert language-model tokenization

asked Nov 16 '20 at 19:00

Nick Koprowicz

223
1
3
10

2

votes

1 answer

Tokenization of data in dataframe in python

I am performing tokenization to each row in my dataframe but the tokenization is being done for only the first row. Can someone please help me. thank you. Below are my codes: import pandas as pd import json import…

python dataframe tokenization

asked Feb 12 '20 at 17:57

Nedisha

45
1
2
7

2

votes

1 answer

NLP: What are some popular packages for phrase tokenization?

I'm trying to tokenize some sentences into phrases. For instance, given I think you're cute and I want to know more about you The tokens can be something like I think you're cute and I want to know more about you Similarly, given input Today…

nlp nltk tokenization

asked Jan 20 '19 at 09:45

John M.

293
2
3
8

2

votes

1 answer

How to customize word division in CountVectorizer?

>>> from sklearn.feature_extraction.text import CountVectorizer >>> import numpy >>> import pandas >>> vectorizer = CountVectorizer() >>> corpus1 = ['abc-@@-123','cde-@@-true','jhg-@@-hud'] >>> xtrain = vectorizer.fit_transform(corpus1) >>>…

python scikit-learn regex ngrams tokenization

asked Jun 14 '18 at 14:54

helloworld

23
1
3

2

votes

2 answers

how do we adapt LLM token embeddings with custom vocab

Hi im just getting started with understanding transformer based models and I am not able to find how the token embeddings are arrived at?. there are multiple tokenization approaches and multiple vocabularies/documents llms are trained on. so my…

nlp transformer tokenization

asked Aug 20 '23 at 16:14

dasman

121
1
3

Questions tagged [tokenization]