addLanguageDetails

Add language identifiers to documents

Description

Use addLanguageDetails to add language identifiers to documents.

The function supports English, Japanese, German, and Korean text.

example

updatedDocuments = addLanguageDetails(documents) detects the language of documents and updates the token details. The function adds details to the tokens with missing language details only. To get the language details from updatedDocuments, use tokenDetails.

updatedDocuments = addLanguageDetails(documents,Name,Value) specifies additional options using one or more name-value pairs.

Tip

Use addLanguageDetails before using the lower and upper functions as addLanguageDetails uses information that is removed by this functions.

Examples

collapse all

Manually tokenize some text by splitting it into an array of words. Convert the manually tokenized text into a tokenizedDocument object by setting the 'TokenizeMethod' option to 'none'.

str = split("an example of a short sentence")';
documents = tokenizedDocument(str,'TokenizeMethod','none');

View the token details using tokenDetails.

tdetails = tokenDetails(documents)
tdetails=6×2 table
      Token       DocumentNumber
    __________    ______________

    "an"                1       
    "example"           1       
    "of"                1       
    "a"                 1       
    "short"             1       
    "sentence"          1       

When you specify 'TokenizeMethod','none', the function does not automatically detect the language details of the documents. To add the language details, use the addLanguageDetails function. This function, by default, automatically detects the language.

documents = addLanguageDetails(documents);

View the updated token details using tokenDetails.

tdetails = tokenDetails(documents)
tdetails=6×4 table
      Token       DocumentNumber     Type      Language
    __________    ______________    _______    ________

    "an"                1           letters       en   
    "example"           1           letters       en   
    "of"                1           letters       en   
    "a"                 1           letters       en   
    "short"             1           letters       en   
    "sentence"          1           letters       en   

Input Arguments

collapse all

Input documents, specified as a tokenizedDocument array.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'DiscardKnownValues',true specifies to discard previously computed details and recompute them.

Language, specified as one of the following:

  • 'en' – English

  • 'ja' – Japanese

  • 'de' – German

  • 'ko' – Korean

If you do not specify a value, then the function detects the language from the input text using the corpusLanguage function.

This option specifies the language details of the tokens. To view the language details of the tokens, use tokenDetails. These language details determine the behavior of the removeStopWords, addPartOfSpeechDetails, normalizeWords, addSentenceDetails, and addEntityDetails functions on the tokens.

For more information about language support in Text Analytics Toolbox™, see Language Considerations.

Option to discard previously computed details and recompute them, specified as true or false.

Data Types: logical

Output Arguments

collapse all

Updated documents, returned as a tokenizedDocument array. To get the token details from updatedDocuments, use tokenDetails.

Introduced in R2018b