Levenshtein distances #
We define the Levenshtein edit distance levenshtein C xy ys
between two List α
,
with a customizable cost structure C
for the delete
, insert
, and substitute
operations.
As an auxiliary function, we define suffixLevenshtein C xs ys
, which gives the list of distances
from each suffix of xs
to ys
.
This is defined by recursion on ys
, using the internal function Levenshtein.impl
,
which computes suffixLevenshtein C xs (y :: ys)
using xs
, y
, and suffixLevenshtein C xs ys
.
(This corresponds to the usual algorithm
using the last two rows of the matrix of distances between suffixes.)
After setting up these definitions, we prove lemmas specifying their behaviour, particularly
theorem suffixLevenshtein_eq_tails_map :
(suffixLevenshtein C xs ys).1 = xs.tails.map fun xs' => levenshtein C xs' ys := ...
and
theorem levenshtein_cons_cons :
levenshtein C (x :: xs) (y :: ys) =
min (C.delete x + levenshtein C xs (y :: ys))
(min (C.insert y + levenshtein C (x :: xs) ys)
(C.substitute x y + levenshtein C xs ys)) := ...
A cost structure for Levenshtein edit distance.
- delete : α → δ
Cost to delete an element from a list.
- insert : β → δ
Cost in insert an element into a list.
- substitute : α → β → δ
Cost to substitute one element for another in a list.
Instances For
The default cost structure, for which all operations cost 1
.
Equations
- Levenshtein.defaultCost = { delete := fun (x : α) => 1, insert := fun (x : α) => 1, substitute := fun (a b : α) => if a = b then 0 else 1 }
Instances For
Equations
Cost structure given by a function. Delete and insert cost the same, and substitution costs the greater value.
Equations
- Levenshtein.weightCost f = { delete := fun (a : α) => f a, insert := fun (b : α) => f b, substitute := fun (a b : α) => max (f a) (f b) }
Instances For
Cost structure for strings, where cost is the length of the token.
Instances For
Cost structure for strings, where cost is the log base 2 length of the token.
Equations
- Levenshtein.stringLogLengthCost = Levenshtein.weightCost fun (s : String) => (s.length + 1).log2
Instances For
(Implementation detail for levenshtein
)
Given a list xs
and the Levenshtein distances from each suffix of xs
to some other list ys
,
compute the Levenshtein distances from each suffix of xs
to y :: ys
.
(Note that we don't actually need to know ys
itself here, so it is not an argument.)
The return value is a list of length x.length + 1
,
and it is convenient for the recursive calls that we bundle this list
with a proof that it is non-empty.
Equations
- One or more equations did not get rendered due to their size.
Instances For
suffixLevenshtein C xs ys
computes the Levenshtein distance
(using the cost functions provided by a C : Cost α β δ
)
from each suffix of the list xs
to the list ys
.
The first element of this list is the Levenshtein distance from xs
to ys
.
Note that if the cost functions do not satisfy the inequalities
C.delete a + C.insert b ≥ C.substitute a b
C.substitute a b + C.substitute b c ≥ C.substitute a c
(or if any values are negative) then the edit distances calculated here may not agree with the general geodesic distance on the edit graph.
Equations
- One or more equations did not get rendered due to their size.
Instances For
levenshtein C xs ys
computes the Levenshtein distance
(using the cost functions provided by a C : Cost α β δ
)
from the list xs
to the list ys
.
Note that if the cost functions do not satisfy the inequalities
C.delete a + C.insert b ≥ C.substitute a b
C.substitute a b + C.substitute b c ≥ C.substitute a c
(or if any values are negative) then the edit distance calculated here may not agree with the general geodesic distance on the edit graph.
Equations
- levenshtein C xs ys = match suffixLevenshtein C xs ys with | ⟨r, w⟩ => r[0]