Questions tagged [similar-documents]
58 questions
10
votes
1 answer
What is the difference between fasttext and DANs in document classification?
I came across two interesting papers that describe promising approaches for document classification using word embedding.
1. The fasttext algorithm
Described in the paper Bag of Tricks for Efficient Text Classification here.
(With further…
user1043144
- 201
- 1
- 3
9
votes
2 answers
Text similarity with sentence embeddings
I'm trying to calculate similarity between texts with various lengths. My current approach is following:
Using Universal Sentence Encoder, I convert text to a set of vectors.
I average these vectors to create the final feature vector.
I compare…
Kertis van Kertis
- 143
- 1
- 6
6
votes
1 answer
How to compute document similarities in case of source codes?
I try to detect the probability of common authorship (person, company) of different kind of source code texts (webpages, program codes). My first idea is to apply the usual NLP tools like any token based document representation (TF-IDF or…
Hendrik
- 8,767
- 17
- 43
- 55
6
votes
1 answer
Can I use euclidean distance for Latent Dirichlet Allocation document similarity?
I have a Latent Dirichlet Allocation (LDA) model with $K$ topics trained on a corpus with $M$ documents. Due to my hyper parameter configurations, the output topic distributions for each document is heavily distributed on only 3-6 topics and all the…
PyRsquared
- 1,666
- 1
- 12
- 18
5
votes
1 answer
TS-SS and Cosine similarity among text documents using TF-IDF in Python
A common way of calculating the cosine similarity between text based documents is to calculate tf-idf and then calculating the linear kernel of the tf-idf matrix.
TF-IDF matrix is calculated using TfidfVectorizer().
from…
kgkmeekg
- 153
- 6
5
votes
1 answer
Using Spark for finding similar users to a user?
I read about
https://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html
but couldn't find a spark library for this implementation.
I have columnar string dataset.
I have a dataset with around data of 15-20 million users with their…
Nikhil Verma
- 191
- 1
- 1
- 9
4
votes
2 answers
Data wrangling for a big set of docx files advice!
I'm looking for some advice on a data wrangling problem I'm trying to solve. I've spent a week solid taking different approaches and nothing seems to be quite perfect. Just FYI, this is my first big (for me anyway) data science project, so I'm…
mess1n
- 41
- 1
4
votes
3 answers
How to measure the similarity between two text documents?
Assume, I have 100 text documents, and I want to cluster those documents.
The first step is the construct pairwise similarity matrix 100X100 for the documents
My question is:
what are common way to measure similarity between two documents?
Thanks,
jason
- 329
- 2
- 4
- 9
4
votes
2 answers
Automatic code checking
I have some experience in machine learning, mainly clustering and classifiers. However, I am somewhat of a newbie when it comes to NLP.
That said I am aware of all the various issues and difficulties involved in processing natural language eg…
user2948208
- 41
- 1
3
votes
1 answer
Training Doc2Vec and Word2Vec at the same time
As far as I can tell the typical Doc2Vec implementation (e.g. Gensim) first trains the word vectors and afterwards the document vectors were the word vectors are fixed.
If my goal is that conceptually similar vectors (regardless of whether they…
Markus RH
- 31
- 1
- 4
3
votes
3 answers
Which algorithm Doc2Vec uses?
Like Word2vec is not a single algorithm but combination of two, namely, CBOW and Skip-Gram model; is Doc2Vec also a combination of any such algorithms? Or is it an algorithm in itself?
Kshitiz
- 289
- 1
- 2
- 12
3
votes
2 answers
Gensim doc2vec error: KeyError: "word 'senseless' not in vocabulary"
I am new to machine learning and tried doc2vec on quora duplicate dataset. new_dfx has columns 'question1' and 'question2' which has preprocessed questions in each row. Following is the tagged document sample:
input:
q_arr =…
Ankit Rohilla
- 31
- 2
2
votes
0 answers
Preprocessing for Document Similarity Using Doc2Vec
I'm trying to determine document similarity using Doc2Vec on a large series of legal opinions, which can contain some highly jargonistic language and phrases (e.g. en banc, de novo, etc.). I'm wondering if anyone has any thoughts about the criteria…
user118648
- 21
- 1
2
votes
0 answers
Unsupervised document similarity state of the art
I have a set of N documents with lengths ranging from 0 to more than 20000 characters. I want to calculate a similarity score between 0 and 1 between all pairs of documents where a higher number indicates higher similarity. Assume below that…
user7017793
- 21
- 2
2
votes
3 answers
Fastest way for 1 vs all lookup on embeddings
I have a dataset with about 1 000 000 texts where I have computed their sentence embeddings with a language model and stored them in a numpy array.
I wish to compare a new unseen text to all the 1 000 000 pre-computed embeddings and perform cosine…
Isbister
- 193
- 1
- 10