This example shows how to create a function which cleans and preprocesses text data for analysis.
Text data can be large and can contain lots of noise which negatively affects statistical analysis. For example, text data can contain the following:
Variations in case, for example "new" and "New"
Variations in word forms, for example "walk" and "walking"
Words which add noise, for example "stop words" such as "the" and "of"
Punctuation and special characters
HTML and XML tags
These word clouds illustrate word frequency analysis applied to some raw text data from weather reports, and a preprocessed version of the same text data.
It can be useful to create a preprocessing function, so you can prepare different collections of text data in the same way. For example, when training a model, you can use a function so that you can preprocess new data using the same steps as the training data.
The function preprocessTextData
, listed at the end of the example, performs the following steps:
Tokenize the text using tokenizedDocument
.
Lemmatize the words using normalizeWords
.
Erase punctuation using erasePunctuation
.
Remove a list of stop words (such as "and", "of", and "the") using removeStopWords
.
Remove words with 2 or fewer characters using removeShortWords
.
Remove words with 15 or more characters using removeLongWords
.
To use the function, simply input your text data into preprocessTextData
.
textData = [ "A large tree is downed and blocking traffic outside Apple Hill." "There is lots of damage to many car windshields in the parking lot."]; documents = preprocessTextData(textData)
documents = 2x1 tokenizedDocument: 8 tokens: large tree down block traffic outside apple hill 7 tokens: lot damage many car windshield parking lot
function documents = preprocessTextData(textData) % Tokenize the text. documents = tokenizedDocument(textData); % Lemmatize the words. To improve lemmatization, first use % addPartOfSpeechDetails. documents = addPartOfSpeechDetails(documents); documents = normalizeWords(documents,'Style','lemma'); % Erase punctuation. documents = erasePunctuation(documents); % Remove a list of stop words. documents = removeStopWords(documents); % Remove words with 2 or fewer characters, and words with 15 or more % characters. documents = removeShortWords(documents,2); documents = removeLongWords(documents,15); end
For an example showing a more detailed workflow, see Prepare Text Data for Analysis.
For next steps in text analytics, you can try creating a classification model or analyze the data using topic models. For examples, see Create Simple Text Model for Classification and Analyze Text Data Using Topic Models.
addPartOfSpeechDetails
| erasePunctuation
| normalizeWords
| removeLongWords
| removeShortWords
| removeStopWords
| tokenizedDocument