tokenDetails

Details of tokens in tokenized document array

Syntax

tdetails = tokenDetails(documents)

Description

tdetails = tokenDetails(documents) returns a table of token details for the tokens in the tokenizedDocument array documents.

Examples

collapse all

View Token Details of Documents

Open Live Script

Create a tokenized document array.

str = [ ...
    "This is an example document. It has two sentences."
    "This document has one sentence and an emoticon. :)"
    "Here is another example document. :D"];
documents = tokenizedDocument(str);

View the token details of the first few tokens.

tdetails = tokenDetails(documents);
head(tdetails)

ans=8×5 table
      Token       DocumentNumber    LineNumber       Type        Language
    __________    ______________    __________    ___________    ________

    "This"              1               1         letters           en   
    "is"                1               1         letters           en   
    "an"                1               1         letters           en   
    "example"           1               1         letters           en   
    "document"          1               1         letters           en   
    "."                 1               1         punctuation       en   
    "It"                1               1         letters           en   
    "has"               1               1         letters           en

The type variable contains the type of each token. View the emoticons in the documents.

idx = tdetails.Type == "emoticon";
tdetails(idx,:)

ans=2×5 table
    Token    DocumentNumber    LineNumber      Type      Language
    _____    ______________    __________    ________    ________

    ":)"           2               1         emoticon       en   
    ":D"           3               1         emoticon       en

Add Sentence Details to Documents

Open Live Script

Create a tokenized document array.

str = [ ...
    "This is an example document. It has two sentences."
    "This document has one sentence."
    "Here is another example document. It also has two sentences."];
documents = tokenizedDocument(str);

Add sentence details to the documents using addSentenceDetails. This function adds the sentence numbers to the table returned by tokenDetails. View the updated token details of the first few tokens.

documents = addSentenceDetails(documents);
tdetails = tokenDetails(documents);
head(tdetails)

ans=8×6 table
      Token       DocumentNumber    SentenceNumber    LineNumber       Type        Language
    __________    ______________    ______________    __________    ___________    ________

    "This"              1                 1               1         letters           en   
    "is"                1                 1               1         letters           en   
    "an"                1                 1               1         letters           en   
    "example"           1                 1               1         letters           en   
    "document"          1                 1               1         letters           en   
    "."                 1                 1               1         punctuation       en   
    "It"                1                 2               1         letters           en   
    "has"               1                 2               1         letters           en

View the token details of the second sentence of the third document.

idx = tdetails.DocumentNumber == 3 & ...
    tdetails.SentenceNumber == 2;
tdetails(idx,:)

ans=6×6 table
       Token       DocumentNumber    SentenceNumber    LineNumber       Type        Language
    ___________    ______________    ______________    __________    ___________    ________

    "It"                 3                 2               1         letters           en   
    "also"               3                 2               1         letters           en   
    "has"                3                 2               1         letters           en   
    "two"                3                 2               1         letters           en   
    "sentences"          3                 2               1         letters           en   
    "."                  3                 2               1         punctuation       en

Add Part-of-Speech Details to Documents

Open Live Script

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

View the token details of the first few tokens.

tdetails = tokenDetails(documents);
head(tdetails)

ans=8×5 table
       Token       DocumentNumber    LineNumber     Type      Language
    ___________    ______________    __________    _______    ________

    "fairest"            1               1         letters       en   
    "creatures"          1               1         letters       en   
    "desire"             1               1         letters       en   
    "increase"           1               1         letters       en   
    "thereby"            1               1         letters       en   
    "beautys"            1               1         letters       en   
    "rose"               1               1         letters       en   
    "might"              1               1         letters       en

Add part-of-speech details to the documents using the addPartOfSpeechDetails function. This function first adds sentence information to the documents, and then adds the part-of-speech tags to the table returned by tokenDetails. View the updated token details of the first few tokens.

documents = addPartOfSpeechDetails(documents);
tdetails = tokenDetails(documents);
head(tdetails)

ans=8×7 table
       Token       DocumentNumber    SentenceNumber    LineNumber     Type      Language     PartOfSpeech 
    ___________    ______________    ______________    __________    _______    ________    ______________

    "fairest"            1                 1               1         letters       en       adjective     
    "creatures"          1                 1               1         letters       en       noun          
    "desire"             1                 1               1         letters       en       verb          
    "increase"           1                 1               1         letters       en       noun          
    "thereby"            1                 1               1         letters       en       adverb        
    "beautys"            1                 1               1         letters       en       verb          
    "rose"               1                 1               1         letters       en       noun          
    "might"              1                 1               1         letters       en       auxiliary-verb

Input Arguments

collapse all

`documents` — Input documents
`tokenizedDocument` array

Input documents, specified as a tokenizedDocument array.

Output Arguments

collapse all

`tdetails` — Table of token details
table

Table of token details. tdetails has the following variables:

Name	Description
`Token`	Token text, returned as a string scalar.
`DocumentNumber`	Index of document that the token belongs to, returned as a positive integer.
`SentenceNumber`	Sentence number of token in document, returned as a positive integer. If these details are missing, then first add sentence details to `documents` using the `addSentenceDetails` function.
`LineNumber`	Line number of token in document, returned as a positive integer.
`Type`	The type of token, returned as one of the following: `'letters'` – string of letter characters only `'digits'` – string of digits only `'punctuation'` – string of punctuation and symbol characters only `'email-address'` – detected email address `'web-address'` – detected web address `'hashtag'` – detected hashtag (starts with `"#"` character followed by a letter) `'at-mention'` – detected at-mention (starts with `"@"` character) `'emoticon'` – detected emoticon `'emoji'` – detected emoji `'other'` – does not belong to the previous types and is not a custom type If these details are missing, then first add type details to `documents` using the `addTypeDetails` function.
`Language`	Language of the token, returned as one of the following: `'en'` – English `'ja'` – Japanese `'de'` – German `'ko'` – Korean These language details determine the behavior of the `removeStopWords`, `addPartOfSpeechDetails`, `normalizeWords`, `addSentenceDetails`, and `addEntityDetails` functions on the tokens. If these details are missing, then first add language details to `documents` using the `addLanguageDetails` function. For more information about language support in Text Analytics Toolbox™, see Language Considerations.
`PartOfSpeech`	Part of speech tag, specified as a categorical from one of the following class names: `"adjective"` – Adjective `"adposition"` – Adposition `"adverb"` – Adverb `"auxiliary-verb"` – Auxiliary verb `"coord-conjunction"` – Coordinating conjunction `"determiner"` – Determiner `"interjection"` – Interjection `"noun"` – Noun `"numeral"` – Numeral `"particle"` – Particle `"pronoun"` – Pronoun `"proper-noun"` – Proper noun `"punctuation"` – Punctuation `"subord-conjunction"` – Subordinating conjucntion `"symbol"` – Symbol `"verb"` – Verb `"other"` – Other If these details are missing, then first add part-of-speech details to `documents` using the `addPartOfSpeechDetails` function.
`Entity`	Entity tag, specified as one of the following: `'location'` – detected location `'organization'` – detected organization `'person'` – detected person `'other'` – detected entity, not belonging to the above categories `'non-entity'` – no entity detected If these details are missing, then first add entity details to `documents` using the `addEntityDetails` function.
`Lemma`	Lemma form. If these details are missing, then first lemma details to `documents` using the `addLemmaDetails` function.

Compatibility Considerations

expand all

`tokenDetails` returns token type `emoji` for emoji characters

Behavior changed in R2018b

Starting in R2018b, tokenizedDocument detects emoji characters and the tokenDetails function reports these tokens with type "emoji". This makes it easier to analyze text containing emoji characters.

In R2018a, tokenDetails reports emoji characters with type "other". To find the indices of the tokens with type "emoji" or "other", use the indices idx = tdetails.Type == "emoji" | tdetails.Type == "other", where tdetails is a table of token details.

Documentation

tokenDetails

Syntax

Description

Examples

View Token Details of Documents

Add Sentence Details to Documents

Add Part-of-Speech Details to Documents

Input Arguments

`documents` — Input documents
`tokenizedDocument` array

Output Arguments

`tdetails` — Table of token details
table

Compatibility Considerations

`tokenDetails` returns token type `emoji` for emoji characters

See Also

Topics

Text Analytics Toolbox Documentation

Support

Documentation

tokenDetails

Syntax

Description

Examples

View Token Details of Documents

Add Sentence Details to Documents

Add Part-of-Speech Details to Documents

Input Arguments

documents — Input documents tokenizedDocument array

Output Arguments

tdetails — Table of token details table

Compatibility Considerations

tokenDetails returns token type emoji for emoji characters

See Also

Topics

Text Analytics Toolbox Documentation

Support

`documents` — Input documents
`tokenizedDocument` array

`tdetails` — Table of token details
table

`tokenDetails` returns token type `emoji` for emoji characters