Highest Voted 'similar-documents' Questions - Data Science Stack Exchange

10

votes

1 answer

What is the difference between fasttext and DANs in document classification?

I came across two interesting papers that describe promising approaches for document classification using word embedding. 1. The fasttext algorithm Described in the paper Bag of Tricks for Efficient Text Classification here. (With further…

asked Apr 01 '17 at 19:48

user1043144

201
1
3

9

votes

2 answers

Text similarity with sentence embeddings

I'm trying to calculate similarity between texts with various lengths. My current approach is following: Using Universal Sentence Encoder, I convert text to a set of vectors. I average these vectors to create the final feature vector. I compare…

word-embeddings similarity similar-documents

asked Sep 19 '19 at 20:04

Kertis van Kertis

143
1
6

6

votes

1 answer

How to compute document similarities in case of source codes?

I try to detect the probability of common authorship (person, company) of different kind of source code texts (webpages, program codes). My first idea is to apply the usual NLP tools like any token based document representation (TF-IDF or…

machine-learning nlp text-mining similar-documents

asked Feb 21 '18 at 09:09

Hendrik

8,767
17
43
55

6

votes

1 answer

Can I use euclidean distance for Latent Dirichlet Allocation document similarity?

I have a Latent Dirichlet Allocation (LDA) model with $K$ topics trained on a corpus with $M$ documents. Due to my hyper parameter configurations, the output topic distributions for each document is heavily distributed on only 3-6 topics and all the…

nlp lda distance similar-documents

asked Nov 17 '17 at 12:04

PyRsquared

1,666
1
12
18

5

votes

1 answer

TS-SS and Cosine similarity among text documents using TF-IDF in Python

A common way of calculating the cosine similarity between text based documents is to calculate tf-idf and then calculating the linear kernel of the tf-idf matrix. TF-IDF matrix is calculated using TfidfVectorizer(). from…

scikit-learn recommender-system information-retrieval tfidf similar-documents

asked Oct 23 '19 at 23:30

kgkmeekg

153
6

5

votes

1 answer

Using Spark for finding similar users to a user?

I read about https://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html but couldn't find a spark library for this implementation. I have columnar string dataset. I have a dataset with around data of 15-20 million users with their…

apache-spark apache-mahout similar-documents

asked Jul 04 '17 at 12:35

Nikhil Verma

191
1
1
9

4

votes

2 answers

Data wrangling for a big set of docx files advice!

I'm looking for some advice on a data wrangling problem I'm trying to solve. I've spent a week solid taking different approaches and nothing seems to be quite perfect. Just FYI, this is my first big (for me anyway) data science project, so I'm…

python similar-documents data-wrangling

asked Jun 29 '19 at 11:16

mess1n

41
1

4

votes

3 answers

How to measure the similarity between two text documents?

Assume, I have 100 text documents, and I want to cluster those documents. The first step is the construct pairwise similarity matrix 100X100 for the documents My question is: what are common way to measure similarity between two documents? Thanks,

machine-learning deep-learning text-mining similarity similar-documents

asked Apr 14 '19 at 17:20

jason

329
2
4
9

4

votes

2 answers

Automatic code checking

I have some experience in machine learning, mainly clustering and classifiers. However, I am somewhat of a newbie when it comes to NLP. That said I am aware of all the various issues and difficulties involved in processing natural language eg…

machine-learning r nlp word2vec similar-documents

asked Apr 27 '18 at 11:16

user2948208

41
1

3

votes

1 answer

Training Doc2Vec and Word2Vec at the same time

As far as I can tell the typical Doc2Vec implementation (e.g. Gensim) first trains the word vectors and afterwards the document vectors were the word vectors are fixed. If my goal is that conceptually similar vectors (regardless of whether they…

machine-learning deep-learning word2vec word-embeddings similar-documents

asked Feb 08 '18 at 17:19

Markus RH

31
1
4

3

votes

3 answers

Which algorithm Doc2Vec uses?

Like Word2vec is not a single algorithm but combination of two, namely, CBOW and Skip-Gram model; is Doc2Vec also a combination of any such algorithms? Or is it an algorithm in itself?

python nlp word2vec gensim similar-documents

asked Jul 10 '17 at 07:27

Kshitiz

289
1
2
12

3

votes

2 answers

Gensim doc2vec error: KeyError: "word 'senseless' not in vocabulary"

I am new to machine learning and tried doc2vec on quora duplicate dataset. new_dfx has columns 'question1' and 'question2' which has preprocessed questions in each row. Following is the tagged document sample: input: q_arr =…

nlp word-embeddings gensim similar-documents doc2vec

asked Jan 13 '23 at 12:02

Ankit Rohilla

31
2

2

votes

0 answers

Preprocessing for Document Similarity Using Doc2Vec

I'm trying to determine document similarity using Doc2Vec on a large series of legal opinions, which can contain some highly jargonistic language and phrases (e.g. en banc, de novo, etc.). I'm wondering if anyone has any thoughts about the criteria…

similar-documents doc2vec

asked Jun 01 '21 at 19:18

user118648

21
1

2

votes

0 answers

Unsupervised document similarity state of the art

I have a set of N documents with lengths ranging from 0 to more than 20000 characters. I want to calculate a similarity score between 0 and 1 between all pairs of documents where a higher number indicates higher similarity. Assume below that…

unsupervised-learning similar-documents

asked Apr 06 '21 at 09:38

user7017793

21
2

2

votes

3 answers

Fastest way for 1 vs all lookup on embeddings

I have a dataset with about 1 000 000 texts where I have computed their sentence embeddings with a language model and stored them in a numpy array. I wish to compare a new unseen text to all the 1 000 000 pre-computed embeddings and perform cosine…

machine-learning bert embeddings cosine-distance similar-documents

asked Mar 15 '20 at 15:29

Isbister

193
1
10

Questions tagged [similar-documents]