This topic summarizes the Text Analytics Toolbox™ features that support German text. For an example showing how to analyze German text data, see Analyze German Text Data.
The tokenizedDocument
function automatically detects German input.
Alternatively, set the 'Language'
option in tokenizedDocument
to 'de'
. This option specifies the
language details of the tokens. To view the language details of the tokens, use
tokenDetails
. These language details determine the behavior of the removeStopWords
,
addPartOfSpeechDetails
, normalizeWords
, addSentenceDetails
, and addEntityDetails
functions on the tokens.
Tokenize German text using tokenizedDocument
. The function automatically detects German text.
str = [ "Guten Morgen. Wie geht es dir?" "Heute wird ein guter Tag."]; documents = tokenizedDocument(str)
documents = 2x1 tokenizedDocument: 8 tokens: Guten Morgen . Wie geht es dir ? 6 tokens: Heute wird ein guter Tag .
To detect sentence structure in documents, use the addSentenceDetails
. You can use the abbreviations
function to help create custom lists of abbreviations to
detect.
Tokenize German text using tokenizedDocument
.
str = [ "Guten Morgen, Dr. Schmidt. Geht es Ihnen wieder besser?" "Heute wird ein guter Tag."]; documents = tokenizedDocument(str);
Add sentence details to the documents using addSentenceDetails
. This function adds the sentence numbers to the table returned by tokenDetails
. View the updated token details of the first few tokens.
documents = addSentenceDetails(documents); tdetails = tokenDetails(documents); head(tdetails,10)
ans=10×6 table
Token DocumentNumber SentenceNumber LineNumber Type Language
_________ ______________ ______________ __________ ___________ ________
"Guten" 1 1 1 letters de
"Morgen" 1 1 1 letters de
"," 1 1 1 punctuation de
"Dr" 1 1 1 letters de
"." 1 1 1 punctuation de
"Schmidt" 1 1 1 letters de
"." 1 1 1 punctuation de
"Geht" 1 2 1 letters de
"es" 1 2 1 letters de
"Ihnen" 1 2 1 letters de
View a table of German abbreviations. Use this table to help create custom tables of abbreviations for sentence detection when using addSentenceDetails
.
tbl = abbreviations('Language','de'); head(tbl)
ans=8×2 table
Abbreviation Usage
____________ _______
"A.T" regular
"ABl" regular
"Abb" regular
"Abdr" regular
"Abf" regular
"Abfl" regular
"Abh" regular
"Abk" regular
To add German part of speech details to documents, use the addPartOfSpeechDetails
function.
Tokenize German text using tokenizedDocument
.
str = [ "Guten Morgen. Wie geht es dir?" "Heute wird ein guter Tag."]; documents = tokenizedDocument(str)
documents = 2x1 tokenizedDocument: 8 tokens: Guten Morgen . Wie geht es dir ? 6 tokens: Heute wird ein guter Tag .
To get the part of speech details for German text, first use addPartOfSpeechDetails
.
documents = addPartOfSpeechDetails(documents);
To view the part of speech details, use the tokenDetails
function.
tdetails = tokenDetails(documents); head(tdetails)
ans=8×7 table
Token DocumentNumber SentenceNumber LineNumber Type Language PartOfSpeech
________ ______________ ______________ __________ ___________ ________ ____________
"Guten" 1 1 1 letters de adjective
"Morgen" 1 1 1 letters de noun
"." 1 1 1 punctuation de punctuation
"Wie" 1 2 1 letters de adverb
"geht" 1 2 1 letters de verb
"es" 1 2 1 letters de pronoun
"dir" 1 2 1 letters de pronoun
"?" 1 2 1 punctuation de punctuation
To add entity tags to documents, use the addEntityDetails
function.
Tokenize German text using tokenizedDocument
.
str = [ "Ernst zog von Frankfurt nach Berlin." "Besuchen Sie Volkswagen in Wolfsburg."]; documents = tokenizedDocument(str);
To add entity tags to German text, use the addEntityDetails
function. This function detects person names, locations, organizations, and other named entities.
documents = addEntityDetails(documents);
To view the entity details, use the tokenDetails
function.
tdetails = tokenDetails(documents); head(tdetails)
ans=8×8 table
Token DocumentNumber SentenceNumber LineNumber Type Language PartOfSpeech Entity
___________ ______________ ______________ __________ ___________ ________ ____________ __________
"Ernst" 1 1 1 letters de proper-noun person
"zog" 1 1 1 letters de verb non-entity
"von" 1 1 1 letters de adposition non-entity
"Frankfurt" 1 1 1 letters de proper-noun location
"nach" 1 1 1 letters de adposition non-entity
"Berlin" 1 1 1 letters de proper-noun location
"." 1 1 1 punctuation de punctuation non-entity
"Besuchen" 2 1 1 letters de verb non-entity
View the words tagged with entity "person"
, "location"
, "organization"
, or "other"
. These words are the words not tagged with "non-entity"
.
idx = tdetails.Entity ~= "non-entity";
tdetails(idx,:)
ans=5×8 table
Token DocumentNumber SentenceNumber LineNumber Type Language PartOfSpeech Entity
____________ ______________ ______________ __________ _______ ________ ____________ ____________
"Ernst" 1 1 1 letters de proper-noun person
"Frankfurt" 1 1 1 letters de proper-noun location
"Berlin" 1 1 1 letters de proper-noun location
"Volkswagen" 2 1 1 letters de noun organization
"Wolfsburg" 2 1 1 letters de proper-noun location
To remove stop words from documents according to the token language details, use
removeStopWords
.
For a list of German stop words set the 'Language'
option in
stopWords
to 'de'
.
Tokenize German text using tokenizedDocument
. The function automatically detects German text.
str = [ "Guten Morgen. Wie geht es dir?" "Heute wird ein guter Tag."]; documents = tokenizedDocument(str)
documents = 2x1 tokenizedDocument: 8 tokens: Guten Morgen . Wie geht es dir ? 6 tokens: Heute wird ein guter Tag .
Remove stop words using the removeStopWords
function. The function uses the language details from documents to determine which language stop words to remove.
documents = removeStopWords(documents)
documents = 2x1 tokenizedDocument: 5 tokens: Guten Morgen . geht ? 5 tokens: Heute wird guter Tag .
To stem tokens according to the token language details, use normalizeWords
.
Tokenize German text using the tokenizedDocument
function. The function automatically detects German text.
str = [ "Guten Morgen. Wie geht es dir?" "Heute wird ein guter Tag."]; documents = tokenizedDocument(str);
Stem the tokens using normalizeWords
.
documents = normalizeWords(documents)
documents = 2x1 tokenizedDocument: 8 tokens: gut morg . wie geht es dir ? 6 tokens: heut wird ein gut tag .
The bagOfWords
and bagOfNgrams
functions support tokenizedDocument
input regardless of language. If you have a tokenizedDocument
array containing your data, then you can use these functions.
The fitlda
and fitlsa
functions support bagOfWords
and bagOfNgrams
input regardless of language. If you have a bagOfWords
or bagOfNgrams
object containing your data, then you can use these functions.
The trainWordEmbedding
function supports tokenizedDocument
or file input regardless of language. If you have a tokenizedDocument
array or a file containing your data in the correct format, then you can use this function.
addLanguageDetails
| addPartOfSpeechDetails
| normalizeWords
| removeStopWords
| stopWords
| tokenDetails
| tokenizedDocument