Correct spelling of words
Use correctSpelling
to correct spelling of words in string
arrays or documents.
The function supports English, German, and Korean text.
corrects the spelling of the words in the updatedDocuments
= correctSpelling(documents
)tokenizedDocument
array
documents
.
corrects the spelling of the words in the updatedWords
= correctSpelling(words
)string
vector
words
.
also specifies the language of the words in the updatedWords
= correctSpelling(words
,'Language',language
)string
vector
words
.
[___,
also returns a vector of words in the input that were not found in the dictionary and for
which no suggestion was found.unknownWords
] = correctSpelling(___)
___ = correctSpelling(___,
specifies additional options using one or more name-value pair arguments.Name,Value
)
Create a tokenized document array.
str = [ "A documnent containing some misspelled worrds." "Another documnent cntaining typos."]; documents = tokenizedDocument(str);
Correct the spelling of the words in the documents using the correctSpelling
function.
updatedDocuments = correctSpelling(documents)
updatedDocuments = 2x1 tokenizedDocument: 7 tokens: A document containing some misspelled words . 5 tokens: Another document containing typos .
Create a string array of words.
words = ["A" "strng" "array" "containing" "misspelled" "worrds" "."];
Correct the spelling of the words in the string array using the correctSpelling
function.
updatedWords = correctSpelling(words)
updatedWords = 1x7 string
Columns 1 through 6
"A" "string" "array" "containing" "misspelled" "words"
Column 7
"."
Create a tokenized document array.
str = [ "Analyze text data using MATLAB." "Another documnent cntaining typos."]; documents = tokenizedDocument(str);
Correct the spelling of the words in the documents using the correctSpelling
function.
updatedDocuments = correctSpelling(documents)
updatedDocuments = 2x1 tokenizedDocument: 7 tokens: Analyze text data using MAT LAB . 5 tokens: Another document containing typos .
Notice that the word "MATLAB" gets split into the two words "MAT" and "LAB".
Correct the spelling of the documents and specify "MATLAB" as a known word using the 'KnownWords'
option.
updatedDocuments = correctSpelling(documents,'KnownWords',"MATLAB")
updatedDocuments = 2x1 tokenizedDocument: 6 tokens: Analyze text data using MATLAB . 5 tokens: Another document containing typos .
documents
— Input documentstokenizedDocument
arrayInput documents, specified as a tokenizedDocument
array.
words
— Input wordsInput words, specified as a string vector, character vector, or cell array of character
vectors. If you specify words
as a character vector, then the
function treats the argument as a single word.
Data Types: string
| char
| cell
language
— Word language'en'
| 'de'
| 'ko'
Word language, specified as one of the following:
'en'
– English language
'de'
– German language
'ko'
– Korean language
If you do not specify language, then the software detects the language automatically.
Data Types: char
| string
Specify optional
comma-separated pairs of Name,Value
arguments. Name
is
the argument name and Value
is the corresponding value.
Name
must appear inside quotes. You can specify several name and value
pair arguments in any order as
Name1,Value1,...,NameN,ValueN
.
correctSpelling(documents,'KnownWords',["MathWorks"
"MATLAB"])
corrects the spelling of the words in documents
and treats the words "MathWorks" and "MATLAB" as correctly spelled words.'KnownWords'
— Words to be treated as correct[]
(default) | string array | cell array of character vectorsWords to be treated as correct, specified as the comma-separated pair consisting
of 'KnownWords'
and a string array or a cell array of character
vectors.
If you specify a list of known words, then these words remain unchanged when the function corrects spelling. The software may also substitute misspelled words with words from the list of known words.
Example: ["MathWorks" "MATLAB"]
Data Types: char
| string
| cell
'ExtensionDictionary'
— Hunspell extension dictionary file''
(default) | file pathHunspell extension dictionary file (also known as personal dictionary file),
specified as the comma-separated pair consisting of
'ExtensionDictionary'
and a file path of a Hunspell extension
dictionary file.
A Hunspell extension dictionary file is a .dic
file containing
the number of words in the dictionary followed by a list of the words in the following format:
word1/affixWord1 word2/affixWord2 ... wordN/affixWordN *forbiddenWord1 *forbiddenWord2 ... *forbiddenWordM
word1
, word2
, …,
wordN
is a list words to extend the Hunspell dictionary
with.
affixWord1
, affixWord2
, …,
affixWordN
(optional) indicate words in the Hunspell
dictionary that share affixes. Indicate affixes by concatenating them to the
corresponding word with a forward slash (/
). For example, the
entry exxxtreme/extreme
indicates that affixes that apply to
the word "extreme"
also apply to the custom word
"exxxtreme"
.
forbiddenWord1
, forbiddenWord2
, …,
forbiddenWordN
is a list of forbidden words to use for
spelling correction. Indicate forbidden words using an asterisk
(*
).
The entries in the Hunspell extension dictionary file can appear in any order.
For example, to create a Hunspell extension dictionary file specifying:
The words "MathWorks"
, "MATLAB"
, and
"exxxtreme"
.
The affixes that apply to the word "extreme"
also apply
to the word "exxxtreme"
.
The word "MATLOB"
is a forbidden word.
use:
MathWorks MATLAB exxxtreme/extreme *MATLOB
For an example showing how to create Hunspell extension dictionary files, see Create Extension Dictionary for Spelling Correction. For more information about the options of Hunspell dictionary files, see https://manpages.ubuntu.com/manpages/trusty/en/man4/hunspell.4.html.
Data Types: char
| string
'Dictionary'
— Hunspell dictionary file''
(default) | file pathHunspell dictionary file, specified as the comma-separated pair consisting of
'Dictionary'
and a file path of a Hunspell dictionary
file.
A Hunspell dictionary file is a .dic
file containing the number
of words in the dictionary followed by a list of the words in the following
format:
N word1/flags1 word2/flags2 ... wordN/flagsN
where N
is the number of words in the dictionary file,
word1
, word2
, …, wordN
are
the N
words in the dictionary, and flags1
, …,
flagsN
specify optional flags corresponding to the words
word1
, word2
, …, wordN
,
respectively. Use flags to specify word attributes, for example affixes. To specify a
Hunspell affix file, use the 'Affixes'
option.
For example, a to create a Hunspell dictionary file containing the 4 words
"MathWorks"
, "MATLAB"
,
"correctSpelling"
, and "tokenizedDocument"
,
use:
4 MathWorks MATLAB correctSpelling tokenizedDocument
For more information about the options of Hunspell dictionary files, see https://manpages.ubuntu.com/manpages/trusty/en/man4/hunspell.4.html.
Data Types: char
| string
'Affixes'
— Hunspell affix file''
(default) | file pathHunspell affix file, specified as the comma-separated pair consisting of
'Affixes'
and a file path of a Hunspell affix file.
A Hunspell affix file is a .aff
file containing the number of
words in the dictionary followed by a list of the words in the following
format:
option1 values1 option2 values2 ... optionM valuesM
where M
is the number of options in the affix file,
option1
, option2
, …,
optionM
are the M
options, and
values1
, …, valuesN
specify the values
corresponding to the options option1
, option2
,
…, optionM
, respectively. Use these options to specify
affixes.
To define a prefix rule, use the PFX
option with the
format:
PFX flag crossProduct K PFX flag stripping1 prefix1 condition1 ... PFX flag strippingK prefixK conditionK
flag
corresponds to the flags used in the Hunspell
dictionary file.
crossProduct
indicates whether prefixes and suffixes
can be mixed, specified as Y
or
N
.
K
is the number of prefixes defined for the specified
flag.
stripping1
, stripping2
, …,
strippingK
indicate characters to be stripped from the
word when applying prefix. If the stripping value is 0
,
then no stripping takes place.
prefix1
, prefix2
, …,
prefixK
specify the prefixes to use.
condition1
, condition2
, …,
conditionK
specify the optional conditions for which to
apply the prefixes prefix1
, prefix2
, …,
prefixK
, respectively. For the trivial condition, specify
"."
.
To define a suffix rule, use the SFX
option with the
format:
SFX flag crossProduct K SFX flag stripping1 suffix1 condition1 ... SFX flag strippingK suffixK conditionK
suffix1
, suffix2
, …,
suffixK
specify the prefixes to use, and the flag, cross
product, K
, stripping, and condition values are the same as the
prefix format.
Create a Hunspell affix file defining the following affix rules:
Flag A:
prefix words with "re"
Flag B:
suffix words not ending with "y"
with
"ed"
.
suffix words ending with "y"
with
"ied"
, removing "y"
.
use the Hunspell affix file:
PFX A Y 1 PFX A 0 re . SFX B Y 1 SFX B 0 ed [^y] SFX B y ied y
To use these flags in a Hunspell dictionary file, append the appropriate flags
to the words using the "/"
. For each word, you can specify
multiple flags. For example, to specify a dictionary file containing:
The words "ptest"
and "ptry"
.
For the word "ptest"
only, also include the prefix
"re"
using flag A
.
For both words, also include the suffixes "ed"
or
"ied"
where appropriate using flag
B
For more information about the options of Hunspell affix files, see https://manpages.ubuntu.com/manpages/trusty/en/man4/hunspell.4.html.
Data Types: char
| string
'RetokenizeMethod'
— Method to retokenize documents'split'
(default) | 'none'
Method to retokenize documents, specified as the comma-separated pair consisting
of 'RetokenizeMethod'
and one of the following:
'split'
– Correct spelling by splitting tokens. For
example, split the incorrectly spelled token "twowords"
into
the correctly spelled tokens "two"
and
"words"
.
'none'
– Do not split tokens for spelling
correction.
updatedDocuments
— Corrected documentstokenizedDocument
arrayCorrected documents, returned as a tokenizedDocument
array. If the 'RetokenizeMethod'
option is 'split'
, then the number
of words in each updated document may be different to the corresponding input
document.
If there are multiple candidates for corrected words, then the function automatically selects a single word for correction.
updatedWords
— Corrected wordsCorrected words, returned as a string vector. If the 'RetokenizeMethod'
option is 'split'
, then the number
of updated words may be different the number of input words.
If there are multiple candidates for corrected words, then the function automatically selects a single word for correction.
unknownWords
— Unknown wordsUnknown words, returned as a string vector. The string vector
unknownWords
contains the input words that are not in the
spelling correction dictionary and for which no suggestions are found.
You have a modified version of this example. Do you want to open this example with your edits?