Documentation

Text Analytics Glossary

This section provides a list of terms used in text analytics.

Documents and Tokens

Term	Definition	More Information
Bigram	Two tokens in succession. For example, `["New" "York"]`.	`bagOfNgrams`
Complex token	A token with complex structure. For example, an email address or a hash tag.	`tokenDetails`
Context	Tokens or characters that surround a given token.	`context`
Corpus	A collection of documents.	`tokenizedDocument`
Document	A single observation of text data. For example, a report, a tweet, or an article.	`tokenizedDocument`
Grapheme	A human readable character. A grapheme can consist of multiple Unicode code points. For example, "a", "😎", or "語".	`splitGraphemes`
N-gram	N tokens in succession.	`bagOfNgrams`
Part of speech	Categories of words used in grammatical structure. For example, "noun", "verb", and "adjective".	`addPartOfSpeechDetails`
Token	A string of characters representing a unit of text data, also known as a "unigram". For example, a word, number, or email address.	`tokenizedDocument`
Token details	Information about the token. For example, type, language, or part-of-speech details.	`tokenDetails`
Token types	The category of the token. For example, "letters", "punctuation", or "email address".	`tokenDetails`
Tokenized document	A document split into tokens.	`tokenizedDocument`
Trigram	Three tokens in succession. For example, `["The" "United" "States"]`	`bagOfNgrams`
Vocabulary	Unique words or tokens in a corpus or model.	`tokenizedDocument`

Preprocessing

Term	Definition	More Information
Normalize	Reduce words to a root form. For example, reduce the word "walking" to "walk" using stemming or lemmatization.	`normalizeWords`
Lemmatize	Reduce words to a dictionary word (the lemma form). For example, reduce the words "running" and "ran" to "run".	`normalizeWords`
Stem	Reduce words by removing inflections. The reduced word is not necessarily a real word. For example, the Porter stemmer reduces the words "happy" and "happiest" to "happi".	`normalizeWords`
Stop words	Words commonly removed before analysis. For example "and", "of", and "the".	`removeStopWords`

Modeling and Prediction

Bag-of-Words

Term	Definition	More Information
Bag-of-n-grams model	A model that records the number of times that n-grams appear in each document of a corpus.	`bagOfNgrams`
Bag-of-words model	A model that records the number of times that words appear in each document of a collection.	`bagOfWords`
Term frequency count matrix	A matrix of the frequency counts of words occurring in a collection of documents corresponding to a given vocabulary. This matrix is the underlying data of a bag-of-words model.	`bagOfWords`
Term Frequency–Inverse Document Frequency (tf-idf) matrix	A statistical measure based on the word frequency counts in documents and the proportion of documents containing the words in the corpus.	`tfidf`

Latent Dirichlet Allocation

Term	Definition	More Information
Corpus topic probabilities	The probabilities of observing each topic in the corpus used to fit the LDA model.	`ldaModel`
Document topic probabilities	The probabilities of observing each topic in each document used to fit the LDA model. Equivalently, the topic mixtures of the training documents.	`ldaModel`
Latent Dirichlet allocation (LDA)	A generative statistical topic model that infers topic probabilities in documents and word probabilities in topics.	`fitlda`
Perplexity	A statistical measure of how well a model describes the given data. A lower perplexity indicates a better fit.	`logp`
Topic	A distribution of words, characterized by the "topic word probabilities".	`ldaModel`
Topic concentration	The concentration parameter of the underlying Dirichlet distribution of the corpus topics mixtures.	`ldaModel`
Topic mixture	The probabilities of topics in a given document.	`transform`
Topic word probabilities	The probabilities of words in a given topic.	`ldaModel`
Word concentration	The concentration parameter of the underlying Dirichlet distribution of the topics.	`ldaModel`

Latent Semantic Analysis

Term	Definition	More Information
Component weights	The singular values of the decomposition, squared.	`lsaModel`
Document scores	The score vectors in lower dimensional space of the documents used to fit the LSA model.	`transform`
Latent semantic analysis (LSA)	A dimension reducing technique based on principal component analysis (PCA).	`fitlsa`
Word scores	The scores of each word in each component of the LSA model.	`lsaModel`

Word Embeddings

Term	Definition	More Information
Word embedding	A model, popularized by the word2vec, GloVe, and fastText libraries, that maps words in a vocabulary to real vectors.	`wordEmbedding`
Word embedding layer	A deep learning network layer that learns a word embedding during training.	`wordEmbeddingLayer`
Word encoding	A model that maps words to numeric indices.	`wordEncoding`

Visualization

Term	Definition	More Information
Text scatter plot	A scatter plot with words plotted at specified coordinates instead of markers.	`textscatter`
Word cloud	A chart that displays words with sizes corresponding to numeric data, usually frequency counts.	`wordcloud`

See Also

addPartOfSpeechDetails | bagOfNgrams | bagOfWords | fitlda | normalizeWords | removeStopWords | textscatter | tokenDetails | tokenizedDocument | wordcloud | wordEmbedding | wordEmbeddingLayer | wordEncoding

Related Topics

Text Analytics Toolbox Documentation

Support