Questions tagged [edit-distance]

The edit distance (also: Levenshtein distance) between two strings measures the number of insertions and deletions it takes to convert one string to another.

68 questions
12
votes
1 answer

Edit distance of list with unique elements

Levenshtein-Distance edit distance between lists is a well studied problem. But I can't find much on possible improvements if it is known that no element does occurs more than once in each list. Let's also assume that the elements are…
user362178
  • 221
  • 1
  • 5
12
votes
5 answers

How does editing software (like Microsoft word or Gmail) pick the 2nd string to compare in Levenshtein distance?

I understand the textbook explanation of how to use dynamic programming to find the minimum edit distance between 2 strings but how do we get to pick the 2nd string? I don't think the entire dictionary is compared as sometimes the difference is…
heretoinfinity
  • 649
  • 1
  • 6
  • 16
9
votes
2 answers

Alternative to Hamming distance for permutations

I have two strings, where one is a permutation of the other. I was wondering if there is an alternative to Hamming distance where instead of finding the minimum number of substitutions required, it would find the minimum number of translocations…
9
votes
2 answers

What are some efficient ways to find the differences between two large corpuses of text that have similar, but differently ordered content?

I have two large files containing paragraphs of English text: The first text is about 200 pages long and has about 10 paragraphs per page (each paragraph is 5 sentences long). The second text contains almost precisely the same paragraphs and text…
8
votes
1 answer

Find all pairs of strings in a set with Levenshtein distance < d

I have a set of $n = $ 100 million strings of length $l = 20$, and for each string in the set, I would like to find all the other strings in the set with Levenshtein distance $\le d = 4$ from that string. The Levenshtein distance (also called the…
1''
  • 183
  • 1
  • 6
7
votes
2 answers

Efficient algorithm for edit distance for short sequences

I have an application that needs to compute billions of levenshtein distance between pairs of strings. The strings are short (70 in length) DNA sequences, consisting only of 4 characters. Also it can be assumed that one of the strings is fixed,…
Ameer Jewdaki
  • 539
  • 2
  • 14
7
votes
1 answer

Levenstein distance and dynamic time warp

I am not sure how to draw parallel between the Wagner–Fischer algorithm and dtw algo. In both case we want to find the distance of each index combination (i,j). In Wagner–Fischer, we initiate the distance by the number of insert we'd have to do from…
7
votes
0 answers

Number of strings at given edit distance

I would like to know the number of strings at edit distance $n$ of a string $s$. I guess this is textbook knowledge... but I cannot find the textbook in question. More formally, I have an alphabet $\Sigma$ (in my case, $|\Sigma| = 4$), and I…
6
votes
1 answer

Extending ordered tree edit distance to DAGs

Computing edit distance (shortest sequence of edit operations) on ordered trees is a well studied problem with many known algorithms (e.g. Zhang & Shasha, RTED). There is also considerable literature on edit distance for general graph (e.g., this…
Martin Modrák
  • 251
  • 1
  • 8
6
votes
1 answer

Why is the running time of edit distance with memoization $O(mn)$?

I understand without memoization it is going to be $O(3^{\max\,\{m,n\}})$ because every call results in extra three calls: thus we end up having a call tree with three children for each node, with height $\max\,\{m,n\}$, m and n being lengths of two…
5
votes
3 answers

Find member of CFL that is Levenshtein-closest to non-member string

Is there an (efficient?) algorithm which given a context-free language $L$ (given as a grammar) and a string $x$ with $x \not \in L$ computes a $y$ with $y \in L$ and $\forall y': y' \in L \implies d(x, y) \le d(x, y')$, where $d$ is the Levenshtein…
5
votes
1 answer

How to speed up process of finding duplicates/similar items in a large amount of strings?

Our software receives documents (in the order of tens of thousands) from various providers, each document flows through a number of steps, one of those steps finds duplicates and similar documents (within 80% threshold) to this document. We…
chester89
  • 151
  • 4
5
votes
1 answer

Understanding the heuristic used for approximate string searching through an FSA

The paper I'm looking at: Fast approximate string matching with finite automata (2009) Explanation of the algorithm (from my understanding anyway): A word is inputted into the automaton and from each state, a number of possible actions can be taken…
5
votes
2 answers

How fast can we identifiy almost-duplicates in a list of strings?

I'm having trouble figuring out the upper bound running time for this scenario: Input: $N$ number of strings $M$ upper bound of string length $T$ threshold for edit distance (2 strings with a Damerau-Levenshtein edit distance lower than $T$ are…
5
votes
1 answer

Semi-local Levenshtein distance

If you have a long string of length $n$ and a shorter string of length $m$, what is a suitable recurrence to let you compute all $n-m+1$ Levevenshtein distances between the shorter string and all substrings of the longer string of length $m$? Can it…
1
2 3 4 5