addTypeDetails

Add token type details to documents

Syntax

updatedDocuments = addTypeDetails(documents)

updatedDocuments = addTypeDetails(documents,Name,Value)

Description

updatedDocuments = addTypeDetails(documents) detects the token types in documents and updates the token details. The function adds type details to the tokens with unknown type only. To get the token types from updatedDocuments, use tokenDetails.

example

updatedDocuments = addTypeDetails(documents,Name,Value) specifies additional options using one or more name-value pairs.

Tip

Use addTypeDetails before using the lower, upper, and erasePunctuation functions as addTypeDetails uses information that is removed by these functions.

Examples

collapse all

Add Token Type Details to Documents

Open Live Script

Convert manually tokenized text into a tokenizedDocument object, setting the 'TokenizeMethod' option to 'none'.

str = ["For" "more" "information" "," "see" "https://www.mathworks.com" "."];
documents = tokenizedDocument(str,'TokenizeMethod','none')

documents = 
  tokenizedDocument:

   7 tokens: For more information , see https://www.mathworks.com .

View the token details using the tokenDetails function.

tdetails = tokenDetails(documents)

tdetails=7×2 table
               Token               DocumentNumber
    ___________________________    ______________

    "For"                                1       
    "more"                               1       
    "information"                        1       
    ","                                  1       
    "see"                                1       
    "https://www.mathworks.com"          1       
    "."                                  1

If you set 'TokenizeMethod' to 'none' in the call to the tokenizedDocument function, then it does not detect the types of the tokens. To add the token type details, use the addTypeDetails function.

documents = addTypeDetails(documents);

View the updated token details.

tdetails = tokenDetails(documents)

tdetails=7×3 table
               Token               DocumentNumber       Type    
    ___________________________    ______________    ___________

    "For"                                1           letters    
    "more"                               1           letters    
    "information"                        1           letters    
    ","                                  1           punctuation
    "see"                                1           letters    
    "https://www.mathworks.com"          1           web-address
    "."                                  1           punctuation

Input Arguments

collapse all

`documents` — Input documents
`tokenizedDocument` array

Input documents, specified as a tokenizedDocument array.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'TopLevelDomains',["com" "net" "org"] specifies the top-level domains "com", "net", and "org" for web address detection.

`'TopLevelDomains'` — Top-level domains
character vector | string array | cell array of character vectors

Top-level domains to use for web address detection, specified as a character vector, string array, or cell array of character vectors.

If you do not specify TopLevelDomains, then the function uses the output of the topLevelDomains function.

Example: ["com" "net" "org"]

Data Types: char | string | cell

`'DiscardKnownValues'` — Option to discard previously computed details
`false` (default) | `true`

Option to discard previously computed details and recompute them, specified as true or false.

Data Types: logical

Output Arguments

collapse all

`updatedDocuments` — Updated documents
`tokenizedDocument` array

Updated documents, returned as a tokenizedDocument array. To get the token details from updatedDocuments, use tokenDetails.

Documentation

addTypeDetails

Syntax

Description

Examples

Add Token Type Details to Documents

Input Arguments

`documents` — Input documents
`tokenizedDocument` array

Name-Value Pair Arguments

`'TopLevelDomains'` — Top-level domains
character vector | string array | cell array of character vectors

`'DiscardKnownValues'` — Option to discard previously computed details
`false` (default) | `true`

Output Arguments

`updatedDocuments` — Updated documents
`tokenizedDocument` array

See Also

Topics

Text Analytics Toolbox Documentation

Support

Documentation

addTypeDetails

Syntax

Description

Examples

Add Token Type Details to Documents

Input Arguments

documents — Input documents tokenizedDocument array

Name-Value Pair Arguments

'TopLevelDomains' — Top-level domains character vector | string array | cell array of character vectors

'DiscardKnownValues' — Option to discard previously computed details false (default) | true

Output Arguments

updatedDocuments — Updated documents tokenizedDocument array

See Also

Topics

Text Analytics Toolbox Documentation

Support

`documents` — Input documents
`tokenizedDocument` array

`'TopLevelDomains'` — Top-level domains
character vector | string array | cell array of character vectors

`'DiscardKnownValues'` — Option to discard previously computed details
`false` (default) | `true`

`updatedDocuments` — Updated documents
`tokenizedDocument` array