This example shows how to use the Latent Dirichlet Allocation (LDA) topic model to analyze text data.
A Latent Dirichlet Allocation (LDA) model is a topic model which discovers underlying topics in a collection of documents and infers the word probabilities in topics.
Load the example data. The file factoryReports.csv
contains factory reports, including a text description and categorical labels for each event.
data = readtable("factoryReports.csv",'TextType','string'); head(data)
ans=8×5 table
Description Category Urgency Resolution Cost
_____________________________________________________________________ ____________________ ________ ____________________ _____
"Items are occasionally getting stuck in the scanner spools." "Mechanical Failure" "Medium" "Readjust Machine" 45
"Loud rattling and banging sounds are coming from assembler pistons." "Mechanical Failure" "Medium" "Readjust Machine" 35
"There are cuts to the power when starting the plant." "Electronic Failure" "High" "Full Replacement" 16200
"Fried capacitors in the assembler." "Electronic Failure" "High" "Replace Components" 352
"Mixer tripped the fuses." "Electronic Failure" "Low" "Add to Watch List" 55
"Burst pipe in the constructing agent is spraying coolant." "Leak" "High" "Replace Components" 371
"A fuse is blown in the mixer." "Electronic Failure" "Low" "Replace Components" 441
"Things continue to tumble off of the belt." "Mechanical Failure" "Low" "Readjust Machine" 38
Extract the text data from the field Description
.
textData = data.Description; textData(1:10)
ans = 10×1 string
"Items are occasionally getting stuck in the scanner spools."
"Loud rattling and banging sounds are coming from assembler pistons."
"There are cuts to the power when starting the plant."
"Fried capacitors in the assembler."
"Mixer tripped the fuses."
"Burst pipe in the constructing agent is spraying coolant."
"A fuse is blown in the mixer."
"Things continue to tumble off of the belt."
"Falling items from the conveyor belt."
"The scanner reel is split, it will soon begin to curve."
Create a function which tokenizes and preprocesses the text data so it can be used for analysis. The function preprocessText
, listed at the end of the example, performs the following steps in order:
Tokenize the text using tokenizedDocument
.
Lemmatize the words using normalizeWords
.
Erase punctuation using erasePunctuation
.
Remove a list of stop words (such as "and", "of", and "the") using removeStopWords
.
Remove words with 2 or fewer characters using removeShortWords
.
Remove words with 15 or more characters using removeLongWords
.
Use the preprocessing function preprocessText
to prepare the text data.
documents = preprocessText(textData); documents(1:5)
ans = 5×1 tokenizedDocument: 6 tokens: items occasionally get stuck scanner spool 7 tokens: loud rattle bang sound come assembler piston 4 tokens: cut power start plant 3 tokens: fry capacitor assembler 3 tokens: mixer trip fuse
Create a bag-of-words model from the tokenized documents.
bag = bagOfWords(documents)
bag = bagOfWords with properties: Counts: [480×351 double] Vocabulary: [1×351 string] NumWords: 351 NumDocuments: 480
Remove words from the bag-of-words model that have do not appear more than two times in total. Remove any documents containing no words from the bag-of-words model.
bag = removeInfrequentWords(bag,2); bag = removeEmptyDocuments(bag)
bag = bagOfWords with properties: Counts: [480×162 double] Vocabulary: [1×162 string] NumWords: 162 NumDocuments: 480
Fit an LDA model with 7 topics. For an example showing how to choose the number of topics, see Choose Number of Topics for LDA Model. To suppress verbose output, set 'Verbose'
to 0.
numTopics = 7;
mdl = fitlda(bag,numTopics,'Verbose',0);
If you have a large dataset, then the stochastic approximate variational Bayes solver is usually better suited as it can fit a good model in fewer passes of the data. The default solver for fitlda
(collapsed Gibbs sampling) can be more accurate at the cost of taking longer to run. To use stochastic approximate variational Bayes, set the 'Solver'
option to 'savb'
. For an example showing how to compare LDA solvers, see Compare LDA Solvers.
You can use word clouds to view the words with the highest probabilities in each topic. Visualize the first four topics using word clouds.
figure; for topicIdx = 1:4 subplot(2,2,topicIdx) wordcloud(mdl,topicIdx); title("Topic " + topicIdx) end
Use transform
to transform the documents into vectors of topic probabilities.
newDocument = tokenizedDocument("Coolant is pooling underneath sorter."); topicMixture = transform(mdl,newDocument); figure bar(topicMixture) xlabel("Topic Index") ylabel("Probability") title("Document Topic Probabilities")
Visualize multiple topic mixtures using stacked bar charts. Visualize the topic mixtures of the first 5 input documents.
figure topicMixtures = transform(mdl,documents(1:5)); barh(topicMixtures(1:5,:),'stacked') xlim([0 1]) title("Topic Mixtures") xlabel("Topic Probability") ylabel("Document") legend("Topic " + string(1:numTopics),'Location','northeastoutside')
The function preprocessText
, performs the following steps in order:
Tokenize the text using tokenizedDocument
.
Lemmatize the words using normalizeWords
.
Erase punctuation using erasePunctuation
.
Remove a list of stop words (such as "and", "of", and "the") using removeStopWords
.
Remove words with 2 or fewer characters using removeShortWords
.
Remove words with 15 or more characters using removeLongWords
.
function documents = preprocessText(textData) % Tokenize the text. documents = tokenizedDocument(textData); % Lemmatize the words. documents = addPartOfSpeechDetails(documents); documents = normalizeWords(documents,'Style','lemma'); % Erase punctuation. documents = erasePunctuation(documents); % Remove a list of stop words. documents = removeStopWords(documents); % Remove words with 2 or fewer characters, and words with 15 or greater % characters. documents = removeShortWords(documents,2); documents = removeLongWords(documents,15); end
addPartOfSpeechDetails
| bagOfWords
| fitlda
| ldaModel
| removeEmptyDocuments
| removeInfrequentWords
| removeStopWords
| tokenizedDocument
| transform
| wordcloud