1

I am new in topic modeling and text clustering domain and I am trying to learn more. I would like to use the DBSCAN to cluster the text data. There are many posts and sources on how to implement the DBSCAN on python such as 1, 2, 3 but either they are too difficult for me to understand or not in python.
I have a CSV data that has userID and message that they wrote as follows:

user.csv (number of csv rows:400 (#message))

userID         messages
112   The car was broken and Kevin fixed it
.
.
.

I know some steps to apply DBSCAN such as:

  1. Remove stop words
  2. Find similarity distance ( I have a code that does the cosine similarity)

I am also aware that sci-kit learn has the demo at 4 but I prefer the manual implementation that I can see what's going on in the code.

It would be great if you can provide your help with code that I can run in my side to learn it.

Bilgin
  • 111
  • 3

1 Answers1

1

Bilgin!

Anony-Mousse puts right questions and gives good suggestions. Before you use the self-implemented DBSCAN code - write it on paper. Perhaps it is not the best algorithm at all for your database so try sci-kit learn implementation first to see the results.

Here are the Python implementation https://github.com/chrisjmccormick/dbscan/blob/master/dbscan.py and here is the theory https://github.com/chrisjmccormick/dbscan/blob/master/dbscan.py

Good luck!

zina
  • 72
  • 5