The edit distance (also: Levenshtein distance) between two strings measures the number of insertions and deletions it takes to convert one string to another.
Questions tagged [edit-distance]
68 questions
12
votes
1 answer
Edit distance of list with unique elements
Levenshtein-Distance edit distance between lists
is a well studied problem.
But I can't find much on possible improvements if
it is known that no element does occurs more than once in each list.
Let's also assume that the elements are…
user362178
- 221
- 1
- 5
12
votes
5 answers
How does editing software (like Microsoft word or Gmail) pick the 2nd string to compare in Levenshtein distance?
I understand the textbook explanation of how to use dynamic programming to find the minimum edit distance between 2 strings but how do we get to pick the 2nd string?
I don't think the entire dictionary is compared as sometimes the difference is…
heretoinfinity
- 649
- 1
- 6
- 16
9
votes
2 answers
Alternative to Hamming distance for permutations
I have two strings, where one is a permutation of the other. I was wondering if there is an alternative to Hamming distance where instead of finding the minimum number of substitutions required, it would find the minimum number of translocations…
user1357015
- 205
- 2
- 5
9
votes
2 answers
What are some efficient ways to find the differences between two large corpuses of text that have similar, but differently ordered content?
I have two large files containing paragraphs of English text:
The first text is about 200 pages long and has about 10 paragraphs per page (each paragraph is 5 sentences long).
The second text contains almost precisely the same paragraphs and text…
vikram7
- 191
- 2
8
votes
1 answer
Find all pairs of strings in a set with Levenshtein distance < d
I have a set of $n = $ 100 million strings of length $l = 20$, and for each string in the set, I would like to find all the other strings in the set with Levenshtein distance $\le d = 4$ from that string. The Levenshtein distance (also called the…
1''
- 183
- 1
- 6
7
votes
2 answers
Efficient algorithm for edit distance for short sequences
I have an application that needs to compute billions of levenshtein distance between pairs of strings. The strings are short (70 in length) DNA sequences, consisting only of 4 characters. Also it can be assumed that one of the strings is fixed,…
Ameer Jewdaki
- 539
- 2
- 14
7
votes
1 answer
Levenstein distance and dynamic time warp
I am not sure how to draw parallel between the Wagner–Fischer algorithm and dtw algo.
In both case we want to find the distance of each index combination (i,j).
In Wagner–Fischer, we initiate the distance by the number of insert we'd have to do from…
nicolas
- 325
- 1
- 6
7
votes
0 answers
Number of strings at given edit distance
I would like to know the number of strings at edit distance $n$ of a string $s$.
I guess this is textbook knowledge... but I cannot find the textbook in question.
More formally, I have an alphabet $\Sigma$ (in my case, $|\Sigma| = 4$), and I…
unamourdeswann
- 171
- 1
6
votes
1 answer
Extending ordered tree edit distance to DAGs
Computing edit distance (shortest sequence of edit operations) on ordered trees is a well studied problem with many known algorithms (e.g. Zhang & Shasha, RTED). There is also considerable literature on edit distance for general graph (e.g., this…
Martin Modrák
- 251
- 1
- 8
6
votes
1 answer
Why is the running time of edit distance with memoization $O(mn)$?
I understand without memoization it is going to be $O(3^{\max\,\{m,n\}})$ because every call results in extra three calls: thus we end up having a call tree with three children for each node, with height $\max\,\{m,n\}$, m and n being lengths of two…
Sandesh Kobal
- 163
- 1
- 5
5
votes
3 answers
Find member of CFL that is Levenshtein-closest to non-member string
Is there an (efficient?) algorithm which given a context-free language $L$ (given as a grammar) and a string $x$ with $x \not \in L$ computes a $y$ with $y \in L$ and $\forall y': y' \in L \implies d(x, y) \le d(x, y')$, where $d$ is the Levenshtein…
Jonas Kölker
- 729
- 3
- 11
5
votes
1 answer
How to speed up process of finding duplicates/similar items in a large amount of strings?
Our software receives documents (in the order of tens of thousands) from various providers, each document flows through a number of steps, one of those steps finds duplicates and similar documents (within 80% threshold) to this document.
We…
chester89
- 151
- 4
5
votes
1 answer
Understanding the heuristic used for approximate string searching through an FSA
The paper I'm looking at: Fast approximate string matching with finite automata (2009)
Explanation of the algorithm (from my understanding anyway):
A word is inputted into the automaton and from each state, a number of possible actions can be taken…
user2908849
- 81
- 3
5
votes
2 answers
How fast can we identifiy almost-duplicates in a list of strings?
I'm having trouble figuring out the upper bound running time for this scenario:
Input:
$N$ number of strings
$M$ upper bound of string length
$T$ threshold for edit distance (2 strings with a Damerau-Levenshtein edit distance lower than $T$ are…
Eran Medan
- 431
- 1
- 4
- 12
5
votes
1 answer
Semi-local Levenshtein distance
If you have a long string of length $n$ and a shorter string of length $m$, what is a suitable recurrence to let you compute all $n-m+1$ Levevenshtein distances between the shorter string and all substrings of the longer string of length $m$?
Can it…
Simd
- 1,036
- 6
- 17