addLanguageDetails

Add language identifiers to documents

collapse all in page

Syntax

updatedDocuments = addLanguageDetails(documents)

updatedDocuments = addLanguageDetails(documents,Name,Value)

Description

Use addLanguageDetails to add language identifiers to documents.

The function supports English, Japanese, German, and Korean text.

example

updatedDocuments = addLanguageDetails(documents) detects the language of documents and updates the token details. The function adds details to the tokens with missing language details only. To get the language details from updatedDocuments, use tokenDetails.

updatedDocuments = addLanguageDetails(documents,Name,Value) specifies additional options using one or more name-value pairs.

Tip

Use addLanguageDetails before using the lower and upper functions as addLanguageDetails uses information that is removed by this functions.

Examples

collapse all

Add Language Details to Documents

Open Live Script

Manually tokenize some text by splitting it into an array of words. Convert the manually tokenized text into a tokenizedDocument object by setting the 'TokenizeMethod' option to 'none'.

str = split("an example of a short sentence")';
documents = tokenizedDocument(str,'TokenizeMethod','none');

View the token details using tokenDetails.

tdetails = tokenDetails(documents)

tdetails=6×2 table
      Token       DocumentNumber
    __________    ______________

    "an"                1       
    "example"           1       
    "of"                1       
    "a"                 1       
    "short"             1       
    "sentence"          1

When you specify 'TokenizeMethod','none', the function does not automatically detect the language details of the documents. To add the language details, use the addLanguageDetails function. This function, by default, automatically detects the language.

documents = addLanguageDetails(documents);

View the updated token details using tokenDetails.

tdetails = tokenDetails(documents)

tdetails=6×4 table
      Token       DocumentNumber     Type      Language
    __________    ______________    _______    ________

    "an"                1           letters       en   
    "example"           1           letters       en   
    "of"                1           letters       en   
    "a"                 1           letters       en   
    "short"             1           letters       en   
    "sentence"          1           letters       en

Input Arguments

collapse all

`documents` — Input documents
`tokenizedDocument` array

Input documents, specified as a tokenizedDocument array.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'DiscardKnownValues',true specifies to discard previously computed details and recompute them.

`'Language'` — Language
`'en'` | `'ja'` | `'de'` | `'ko'`

Language, specified as one of the following:

'en' – English
'ja' – Japanese
'de' – German
'ko' – Korean

If you do not specify a value, then the function detects the language from the input text using the corpusLanguage function.

This option specifies the language details of the tokens. To view the language details of the tokens, use tokenDetails. These language details determine the behavior of the removeStopWords, addPartOfSpeechDetails, normalizeWords, addSentenceDetails, and addEntityDetails functions on the tokens.

For more information about language support in Text Analytics Toolbox™, see Language Considerations.

`'DiscardKnownValues'` — Option to discard previously computed details
`false` (default) | `true`

Option to discard previously computed details and recompute them, specified as true or false.

Data Types: logical

Output Arguments

collapse all

`updatedDocuments` — Updated documents
`tokenizedDocument` array

Updated documents, returned as a tokenizedDocument array. To get the token details from updatedDocuments, use tokenDetails.

Documentation

addLanguageDetails

Syntax

Description

Examples

Add Language Details to Documents

Input Arguments

`documents` — Input documents
`tokenizedDocument` array

Name-Value Pair Arguments

`'Language'` — Language
`'en'` | `'ja'` | `'de'` | `'ko'`

`'DiscardKnownValues'` — Option to discard previously computed details
`false` (default) | `true`

Output Arguments

`updatedDocuments` — Updated documents
`tokenizedDocument` array

See Also

Topics

Text Analytics Toolbox Documentation

Support

Documentation

addLanguageDetails

Syntax

Description

Examples

Add Language Details to Documents

Input Arguments

documents — Input documents tokenizedDocument array

Name-Value Pair Arguments

'Language' — Language 'en' | 'ja' | 'de' | 'ko'

'DiscardKnownValues' — Option to discard previously computed details false (default) | true

Output Arguments

updatedDocuments — Updated documents tokenizedDocument array

See Also

Topics

Text Analytics Toolbox Documentation

Support

`documents` — Input documents
`tokenizedDocument` array

`'Language'` — Language
`'en'` | `'ja'` | `'de'` | `'ko'`

`'DiscardKnownValues'` — Option to discard previously computed details
`false` (default) | `true`

`updatedDocuments` — Updated documents
`tokenizedDocument` array