Find edit distance between two strings or documents
specifies additional options using one or more name-value pair arguments.d
= editDistance(___,Name,Value
)
Find the edit distance between the strings "Text analytics"
and "Text analysis"
. The edit distance, by default, is the total number of grapheme insertions, deletions, and substitutions required to change one string to another.
str1 = "Text analytics"; str2 = "Text analysis";
Find the edit distance.
d = editDistance(str1,str2)
d = 2
This means changing the first string to the second requires two edits. For example:
Substitution – Substitute the character "t"
to an "s"
: "Text analytics"
to "Text analysics"
.
Deletion – Delete the character "c"
: "Text analysics"
to "Text analysis"
.
Find the edit distance between two tokenized documents. For tokenized document input, the edit distance, by default, is the total number of token insertions, deletions, and substitutions required to change one document to another.
str1 = "It's time for breakfast."; document1 = tokenizedDocument(str1); str2 = "It's now time to sleep."; document2 = tokenizedDocument(str2);
Find the edit distance.
d = editDistance(document1,document2)
d = 3
This means changing the first document to the second requires three edits. For example:
Insertion – Insert the word "now"
.
Substitution – Substitute the word "for"
with "to"
.
Substitution – Substitute the word "breakfast"
with "sleep"
.
The editDistance
function, by default, returns the lowest number of grapheme insertions, deletions, and substitutions required to change one string to another. To also include the swap action in the calculation, use the 'SwapCost'
option.
First, find the edit distance between the strings "MATALB"
and "MATLAB"
.
str1 = "MATALB"; str2 = "MATLAB"; d = editDistance(str1,str2)
d = 2
One possible edit is:
Substitute the second "A"
with "L"
: ("MATALB"
to "MATLLB"
).
Substitute the second "L"
with "A"
: ("MATLLB"
to "MATLAB"
).
The default value for the swap cost (the cost of swapping two adjacent graphemes) is Inf
. This means that swaps do not count towards the edit distance. To include swaps, set the 'SwapCost'
option to 1.
d = editDistance(str1,str2,'SwapCost',1)
d = 1
This means there is one action. For example, swap the adjacent characters "A"
and "L"
.
To compute the edit distance between two words and specify that the edits are case-insensitive, specify a custom substitute cost function.
First, compute the edit distance between the strings "MATLAB"
and "MathWorks"
.
d = editDistance("MATLAB","MathWorks")
d = 8
This means changing the first string to the second requires eight edits. For example:
Substitution – Substitute the character "A"
with "a"
. ("MATLAB"
to "MaTLAB"
)
Substitution – Substitute the character "T"
with "t"
. ("MaTLAB"
to "MatLAB"
)
Substitution – Substitute the character "L"
with "h"
. ("MatLAB"
to "MathAB"
)
Substitution – Substitute the character "A"
with "W"
. ("MathAB"
to "MathWB"
)
Substitution – Substitute the character "B"
with "o"
. ("MathWB"
to "MathWo"
)
Insertion – Insert the character "r"
. ("MathWo"
to "MathWor"
)
Insertion – Insert the character "k"
. ("MathWor"
to "MathWork"
)
Insertion – Insert the character "s"
. ("MathWork"
to "MathWorks"
)
Compute the edit distance and specify the custom substitution cost function caseInsensitiveSubstituteCost
, listed at the end of the example. The custom function caseInsensitiveSubstituteCost
returns 0 if the two inputs are the same or differ only by case and returns 1 otherwise.
d = editDistance("MATLAB","MathWorks",'SubstituteCost',@caseInsensitiveSubstituteCost)
d = 6
This means the total cost for changing the first string to the second is 6. For example:
Substitution (cost 0) – Substitute the character "A"
with "a"
. ("MATLAB"
to "MaTLAB"
)
Substitution (cost 0) – Substitute the character "T"
with "t"
. ("MaTLAB"
to "MatLAB"
)
Substitution (cost 1) – Substitute the character "L"
with "h"
. ("MatLAB"
to "MathAB"
)
Substitution (cost 1) – Substitute the character "A"
with "W"
. ("MathAB"
to "MathWB"
)
Substitution (cost 1) – Substitute the character "B"
with "o"
. ("MathWB"
to "MathWo"
)
Insert (cost 1) – Insert the character "r"
. ("MathWo"
to "MathWor"
)
Insert (cost 1) – Insert the character "k"
. ("MathWor"
to "MathWork"
)
Insert (cost 1) – Insert the character "s"
. ("MathWork"
to "MathWorks"
)
Custom Cost Function
The custom function caseInsensitiveSubstituteCost
returns 0 if the two inputs are the same or differ only by case and returns 1 otherwise.
function cost = caseInsensitiveSubstituteCost(grapheme1,grapheme2) if lower(grapheme1) == lower(grapheme2) cost = 0; else cost = 1; end end
str1
— Source stringSource string, specified as a string array, character vector, or a cell array of character vectors.
If str1
contains multiple strings, then
str2
must be the same size as str1
or
scalar.
Data Types: char
| string
| cell
str2
— Target stringTarget string, specified as a string array, character vector, or a cell array of character vectors.
If str2
contains multiple strings, then
str1
must be the same size as str2
or
scalar.
Data Types: char
| string
| cell
document1
— Source documenttokenizedDocument
Source document, specified as a tokenizedDocument
array.
If document1
contains multiple documents, then
document2
must be the same size as document1
or scalar.
document2
— Target documenttokenizedDocument
Target document, specified as a tokenizedDocument
array.
If document2
contains multiple documents, then
document1
must be the same size as document2
or scalar.
Specify optional
comma-separated pairs of Name,Value
arguments. Name
is
the argument name and Value
is the corresponding value.
Name
must appear inside quotes. You can specify several name and value
pair arguments in any order as
Name1,Value1,...,NameN,ValueN
.
editDistance("MATALB","MATLAB",'SwapCost',1)
returns the edit
distance between the strings "MATALB"
and "MATLAB"
and
sets the cost to swap two adjacent graphemes to 1.'InsertCost'
— Cost to insert grapheme or tokenCost to insert a grapheme or token, specified as the comma-separated pair
consisting of 'InsertCost'
and a nonnegative scalar or a function
handle.
If 'InsertCost'
is a function handle, then the function must
accept a single input and return the cost of inserting the input to the source. For example:
For string input to editDistance
, the cost function
must have the form cost = func(grapheme)
, where the function
returns the cost of inserting grapheme
into
str1
.
For document input to editDistance
, the cost function
must have the form cost = func(token)
, where the function
returns the cost of inserting token
into
document1
.
Example: 'InsertCost',2
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
| function_handle
'DeleteCost'
— Cost to delete grapheme or tokenCost to delete grapheme or token, specified as the comma-separated pair consisting
of 'DeleteCost'
and a nonnegative scalar or a function
handle.
If 'DeleteCost'
is a function handle, then the function must
accept a single input and return the cost of deleting the input from the source. For example:
For string input to editDistance
, the cost function
must have the form cost = func(grapheme)
, where the function
returns the cost of deleting grapheme
from
str1
.
For document input to editDistance
, the cost function
must have the form cost = func(token)
, where the function
returns the cost of deleting token
from
document1
.
Example: 'DeleteCost',2
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
| function_handle
'SubstituteCost'
— Cost to substitute grapheme or tokenCost to substitute a grapheme or token, specified as the comma-separated pair consisting
of 'SubstituteCost'
and a nonnegative scalar or a function
handle.
If 'SubstituteCost'
is a function handle, then the function must
accept exactly two inputs and return the cost of substituting the first input with the
second in the source. For example:
For string input to editDistance
, the cost function
must have the form cost = func(grapheme1,grapheme2)
, where
the function returns the cost of substituting grapheme1
with
grapheme2
in str1
.
For document input to editDistance
, the cost function
must have the form cost = func(token1,token2)
, where the
function returns the cost of substituting token1
with
token2
in document1
.
Example: 'SubstituteCost',2
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
| function_handle
'SwapCost'
— Cost to swap two adjacent graphemes or tokensInf
(default) | nonnegative scalar | function handleCost to swap two adjacent graphemes or tokens, specified as the comma-separated
pair consisting of 'SwapCost'
and a nonnegative scalar or a
function handle.
If 'SwapCost'
is a function handle, then the function must
accept exactly two inputs and return the cost of swapping the first input with the
second in the source. For example:
For string input to editDistance
, the cost function
must have the form cost = func(grapheme1,grapheme2)
, where
the function returns the cost of swapping the adjacent graphemes
grapheme1
and grapheme2
in
str1
.
For document input to editDistance
, the cost function
must have the form cost = func(token1,token2)
, where the
function returns the cost of swapping the adjacent tokens
token1
and token2
in
document1
.
Example: 'SwapCost',2
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
| function_handle
d
— Edit distanceEdit distance, returned as a nonnegative scalar.
The function, by default, uses the Levenshtein distance: the lowest number of insertions, deletions, and substitutions required to convert one string to another.
For other commonly used edit distances, use these options:
Distance | Description | Options |
---|---|---|
Levenshtein (default) | lowest number of insertions, deletions, and substitutions | Default |
Damerau-Levenshtein | lowest number of insertions, deletions, substitutions, and swaps | 'SwapCost',1 |
Hamming | lowest number of substitutions only | 'InsertCost',Inf,'DeleteCost',Inf |
correctSpelling
| editDistanceSearcher
| knnsearch
| rangesearch
| splitGraphemes
| tokenizedDocument
You have a modified version of this example. Do you want to open this example with your edits?