This topic summarizes the Text Analytics Toolbox™ features that support Korean text.
The tokenizedDocument
function automatically detects Korean input.
Alternatively, set the 'Language'
option in tokenizedDocument
to 'ko'
. This option specifies the
language details of the tokens. To view the language details of the tokens, use
tokenDetails
. These language details determine the behavior of the removeStopWords
,
addPartOfSpeechDetails
, normalizeWords
, addSentenceDetails
, and addEntityDetails
functions on the tokens.
To specify additional MeCab options for tokenization, create a mecabOptions
object. To
tokenize using the specified MeCab tokenization options, use the 'TokenizeMethod'
option of tokenizedDocument
.
The tokenDetails
function, by default, includes part of speech details with
the token details.
The tokenDetails
function, by default, includes entity details with the
token details.
To remove stop words from documents according to the token language details, use
removeStopWords
.
For a list of Korean stop words set the 'Language'
option in
stopWords
to 'ko'
.
To lemmatize tokens according to the token language details, use normalizeWords
and set the 'Style'
option to
'lemma'
.
The bagOfWords
and bagOfNgrams
functions support tokenizedDocument
input regardless of language. If you have a tokenizedDocument
array containing your data, then you can use these functions.
The fitlda
and fitlsa
functions support bagOfWords
and bagOfNgrams
input regardless of language. If you have a bagOfWords
or bagOfNgrams
object containing your data, then you can use these functions.
The trainWordEmbedding
function supports tokenizedDocument
or file input regardless of language. If you have a tokenizedDocument
array or a file containing your data in the correct format, then you can use this function.
addEntityDetails
| addLanguageDetails
| addPartOfSpeechDetails
| normalizeWords
| removeStopWords
| stopWords
| tokenDetails
| tokenizedDocument