How to compute similarity matrix for strings efficiently?

Asked Aug 24 '21 at 07:57

Active Aug 24 '21 at 07:57

Viewed 11 times

Here I'm trying to compute similarity between 1000 cross 10000 strings (using Levenshtein distance), I'm using a dataframe approach where you just need to compare n(n-1)/2 comparisons instead of n*n. But even this took a lot of time, is there a better way to optimise further?

import time
import random, string
import Levenshtein
import pandas as pd
random alphanumeric strings of length 10
rand_ls = [''.join(random.choices(string.ascii_letters + string.digits, k=10)) for i in range(1000)]
a dataframe filled with 0's but with shape n * n where n = len(rand_ls)
df = pd.DataFrame(0, index = rand_ls, columns = rand_ls)
s = time.time()
for i in range(len(df)):
    for j in range(len(df)):
        if i > j:
            dist = Levenshtein.ratio(rand_ls[i], rand_ls[j])
            df.iloc[i, j] = dist
            df.iloc[j, i] = dist
e = time.time()
print(e-s)
took 130 sec for 1000*1000 comparisons

asked Aug 24 '21 at 07:57

David Gladson

How to compute similarity matrix for strings efficiently?

random alphanumeric strings of length 10

a dataframe filled with 0's but with shape n * n where n = len(rand_ls)

took 130 sec for 1000*1000 comparisons

0 Answers0