1

Here I'm trying to compute similarity between 1000 cross 10000 strings (using Levenshtein distance), I'm using a dataframe approach where you just need to compare n(n-1)/2 comparisons instead of n*n. But even this took a lot of time, is there a better way to optimise further?

import time
import random, string
import Levenshtein
import pandas as pd

random alphanumeric strings of length 10

rand_ls = [''.join(random.choices(string.ascii_letters + string.digits, k=10)) for i in range(1000)]

a dataframe filled with 0's but with shape n * n where n = len(rand_ls)

df = pd.DataFrame(0, index = rand_ls, columns = rand_ls)

s = time.time() for i in range(len(df)): for j in range(len(df)): if i > j: dist = Levenshtein.ratio(rand_ls[i], rand_ls[j]) df.iloc[i, j] = dist df.iloc[j, i] = dist

e = time.time() print(e-s)

took 130 sec for 1000*1000 comparisons

0 Answers0