Questions tagged [similar-documents]

58 questions
10
votes
1 answer

What is the difference between fasttext and DANs in document classification?

I came across two interesting papers that describe promising approaches for document classification using word embedding. 1. The fasttext algorithm Described in the paper Bag of Tricks for Efficient Text Classification here. (With further…
9
votes
2 answers

Text similarity with sentence embeddings

I'm trying to calculate similarity between texts with various lengths. My current approach is following: Using Universal Sentence Encoder, I convert text to a set of vectors. I average these vectors to create the final feature vector. I compare…
6
votes
1 answer

How to compute document similarities in case of source codes?

I try to detect the probability of common authorship (person, company) of different kind of source code texts (webpages, program codes). My first idea is to apply the usual NLP tools like any token based document representation (TF-IDF or…
Hendrik
  • 8,767
  • 17
  • 43
  • 55
6
votes
1 answer

Can I use euclidean distance for Latent Dirichlet Allocation document similarity?

I have a Latent Dirichlet Allocation (LDA) model with $K$ topics trained on a corpus with $M$ documents. Due to my hyper parameter configurations, the output topic distributions for each document is heavily distributed on only 3-6 topics and all the…
PyRsquared
  • 1,666
  • 1
  • 12
  • 18
5
votes
1 answer

TS-SS and Cosine similarity among text documents using TF-IDF in Python

A common way of calculating the cosine similarity between text based documents is to calculate tf-idf and then calculating the linear kernel of the tf-idf matrix. TF-IDF matrix is calculated using TfidfVectorizer(). from…
5
votes
1 answer

Using Spark for finding similar users to a user?

I read about https://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html but couldn't find a spark library for this implementation. I have columnar string dataset. I have a dataset with around data of 15-20 million users with their…
Nikhil Verma
  • 191
  • 1
  • 1
  • 9
4
votes
2 answers

Data wrangling for a big set of docx files advice!

I'm looking for some advice on a data wrangling problem I'm trying to solve. I've spent a week solid taking different approaches and nothing seems to be quite perfect. Just FYI, this is my first big (for me anyway) data science project, so I'm…
mess1n
  • 41
  • 1
4
votes
3 answers

How to measure the similarity between two text documents?

Assume, I have 100 text documents, and I want to cluster those documents. The first step is the construct pairwise similarity matrix 100X100 for the documents My question is: what are common way to measure similarity between two documents? Thanks,
4
votes
2 answers

Automatic code checking

I have some experience in machine learning, mainly clustering and classifiers. However, I am somewhat of a newbie when it comes to NLP. That said I am aware of all the various issues and difficulties involved in processing natural language eg…
3
votes
1 answer

Training Doc2Vec and Word2Vec at the same time

As far as I can tell the typical Doc2Vec implementation (e.g. Gensim) first trains the word vectors and afterwards the document vectors were the word vectors are fixed. If my goal is that conceptually similar vectors (regardless of whether they…
3
votes
3 answers

Which algorithm Doc2Vec uses?

Like Word2vec is not a single algorithm but combination of two, namely, CBOW and Skip-Gram model; is Doc2Vec also a combination of any such algorithms? Or is it an algorithm in itself?
Kshitiz
  • 289
  • 1
  • 2
  • 12
3
votes
2 answers

Gensim doc2vec error: KeyError: "word 'senseless' not in vocabulary"

I am new to machine learning and tried doc2vec on quora duplicate dataset. new_dfx has columns 'question1' and 'question2' which has preprocessed questions in each row. Following is the tagged document sample: input: q_arr =…
2
votes
0 answers

Preprocessing for Document Similarity Using Doc2Vec

I'm trying to determine document similarity using Doc2Vec on a large series of legal opinions, which can contain some highly jargonistic language and phrases (e.g. en banc, de novo, etc.). I'm wondering if anyone has any thoughts about the criteria…
user118648
  • 21
  • 1
2
votes
0 answers

Unsupervised document similarity state of the art

I have a set of N documents with lengths ranging from 0 to more than 20000 characters. I want to calculate a similarity score between 0 and 1 between all pairs of documents where a higher number indicates higher similarity. Assume below that…
2
votes
3 answers

Fastest way for 1 vs all lookup on embeddings

I have a dataset with about 1 000 000 texts where I have computed their sentence embeddings with a language model and stored them in a numpy array. I wish to compare a new unseen text to all the 1 000 000 pre-computed embeddings and perform cosine…
1
2 3 4