Details of tokens in tokenized document array
Create a tokenized document array.
str = [ ... "This is an example document. It has two sentences." "This document has one sentence and an emoticon. :)" "Here is another example document. :D"]; documents = tokenizedDocument(str);
View the token details of the first few tokens.
tdetails = tokenDetails(documents); head(tdetails)
ans=8×5 table
Token DocumentNumber LineNumber Type Language
__________ ______________ __________ ___________ ________
"This" 1 1 letters en
"is" 1 1 letters en
"an" 1 1 letters en
"example" 1 1 letters en
"document" 1 1 letters en
"." 1 1 punctuation en
"It" 1 1 letters en
"has" 1 1 letters en
The type
variable contains the type of each token. View the emoticons in the documents.
idx = tdetails.Type == "emoticon";
tdetails(idx,:)
ans=2×5 table
Token DocumentNumber LineNumber Type Language
_____ ______________ __________ ________ ________
":)" 2 1 emoticon en
":D" 3 1 emoticon en
Create a tokenized document array.
str = [ ... "This is an example document. It has two sentences." "This document has one sentence." "Here is another example document. It also has two sentences."]; documents = tokenizedDocument(str);
Add sentence details to the documents using addSentenceDetails
. This function adds the sentence numbers to the table returned by tokenDetails
. View the updated token details of the first few tokens.
documents = addSentenceDetails(documents); tdetails = tokenDetails(documents); head(tdetails)
ans=8×6 table
Token DocumentNumber SentenceNumber LineNumber Type Language
__________ ______________ ______________ __________ ___________ ________
"This" 1 1 1 letters en
"is" 1 1 1 letters en
"an" 1 1 1 letters en
"example" 1 1 1 letters en
"document" 1 1 1 letters en
"." 1 1 1 punctuation en
"It" 1 2 1 letters en
"has" 1 2 1 letters en
View the token details of the second sentence of the third document.
idx = tdetails.DocumentNumber == 3 & ...
tdetails.SentenceNumber == 2;
tdetails(idx,:)
ans=6×6 table
Token DocumentNumber SentenceNumber LineNumber Type Language
___________ ______________ ______________ __________ ___________ ________
"It" 3 2 1 letters en
"also" 3 2 1 letters en
"has" 3 2 1 letters en
"two" 3 2 1 letters en
"sentences" 3 2 1 letters en
"." 3 2 1 punctuation en
Load the example data. The file sonnetsPreprocessed.txt
contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt
, split the text into documents at newline characters, and then tokenize the documents.
filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);
View the token details of the first few tokens.
tdetails = tokenDetails(documents); head(tdetails)
ans=8×5 table
Token DocumentNumber LineNumber Type Language
___________ ______________ __________ _______ ________
"fairest" 1 1 letters en
"creatures" 1 1 letters en
"desire" 1 1 letters en
"increase" 1 1 letters en
"thereby" 1 1 letters en
"beautys" 1 1 letters en
"rose" 1 1 letters en
"might" 1 1 letters en
Add part-of-speech details to the documents using the addPartOfSpeechDetails
function. This function first adds sentence information to the documents, and then adds the part-of-speech tags to the table returned by tokenDetails
. View the updated token details of the first few tokens.
documents = addPartOfSpeechDetails(documents); tdetails = tokenDetails(documents); head(tdetails)
ans=8×7 table
Token DocumentNumber SentenceNumber LineNumber Type Language PartOfSpeech
___________ ______________ ______________ __________ _______ ________ ______________
"fairest" 1 1 1 letters en adjective
"creatures" 1 1 1 letters en noun
"desire" 1 1 1 letters en verb
"increase" 1 1 1 letters en noun
"thereby" 1 1 1 letters en adverb
"beautys" 1 1 1 letters en verb
"rose" 1 1 1 letters en noun
"might" 1 1 1 letters en auxiliary-verb
documents
— Input documentstokenizedDocument
arrayInput documents, specified as a tokenizedDocument
array.
tdetails
— Table of token detailsTable of token details. tdetails
has the following
variables:
Name | Description |
---|---|
Token | Token text, returned as a string scalar. |
DocumentNumber | Index of document that the token belongs to, returned as a positive integer. |
SentenceNumber | Sentence number of token in document, returned as a
positive integer. If these details are missing, then
first add sentence details to
documents using the addSentenceDetails function. |
LineNumber | Line number of token in document, returned as a positive integer. |
Type | The type of token, returned as one of the following:
If these details are
missing, then first add type details to
|
Language | Language of the token, returned as one of the following:
These language details determine the behavior of the If these details are missing, then first
add language details to
For more information about language support in Text Analytics Toolbox™, see Language Considerations. |
PartOfSpeech | Part of speech tag, specified as a categorical from one of the following class names:
If these details are missing, then
first add part-of-speech details to
|
Entity | Entity tag, specified as one of the following:
If these details are
missing, then first add entity details to
|
Lemma | Lemma form. If these details are missing, then
first lemma details to
|
tokenDetails
returns token type emoji
for emoji charactersBehavior changed in R2018b
Starting in R2018b, tokenizedDocument
detects emoji characters and the tokenDetails
function reports these tokens with type
"emoji"
. This makes it easier to analyze text containing emoji
characters.
In R2018a, tokenDetails
reports emoji characters with type "other"
.
To find the indices of the tokens with type "emoji"
or
"other"
, use the indices idx = tdetails.Type == "emoji" |
tdetails.Type == "other"
, where tdetails
is a table of
token details.
addEntityDetails
| addLanguageDetails
| addLemmaDetails
| addPartOfSpeechDetails
| addSentenceDetails
| addTypeDetails
| normalizeWords
| tokenizedDocument
You have a modified version of this example. Do you want to open this example with your edits?