Text Analytics Glossary

This section provides a list of terms used in text analytics.

Documents and Tokens

TermDefinitionMore Information
BigramTwo tokens in succession. For example, ["New" "York"].bagOfNgrams
Complex tokenA token with complex structure. For example, an email address or a hash tag.tokenDetails
ContextTokens or characters that surround a given token.context
CorpusA collection of documents.tokenizedDocument
DocumentA single observation of text data. For example, a report, a tweet, or an article.tokenizedDocument
GraphemeA human readable character. A grapheme can consist of multiple Unicode code points. For example, "a", "😎", or "語".splitGraphemes
N-gramN tokens in succession.bagOfNgrams
Part of speechCategories of words used in grammatical structure. For example, "noun", "verb", and "adjective".addPartOfSpeechDetails
TokenA string of characters representing a unit of text data, also known as a "unigram". For example, a word, number, or email address.tokenizedDocument
Token detailsInformation about the token. For example, type, language, or part-of-speech details.tokenDetails
Token typesThe category of the token. For example, "letters", "punctuation", or "email address".tokenDetails
Tokenized documentA document split into tokens.tokenizedDocument
TrigramThree tokens in succession. For example, ["The" "United" "States"]bagOfNgrams
VocabularyUnique words or tokens in a corpus or model.tokenizedDocument

Preprocessing

TermDefinitionMore Information
NormalizeReduce words to a root form. For example, reduce the word "walking" to "walk" using stemming or lemmatization. normalizeWords
LemmatizeReduce words to a dictionary word (the lemma form). For example, reduce the words "running" and "ran" to "run".normalizeWords
StemReduce words by removing inflections. The reduced word is not necessarily a real word. For example, the Porter stemmer reduces the words "happy" and "happiest" to "happi".normalizeWords
Stop wordsWords commonly removed before analysis. For example "and", "of", and "the".removeStopWords

Modeling and Prediction

Bag-of-Words

TermDefinitionMore Information
Bag-of-n-grams modelA model that records the number of times that n-grams appear in each document of a corpus.bagOfNgrams
Bag-of-words modelA model that records the number of times that words appear in each document of a collection.bagOfWords
Term frequency count matrixA matrix of the frequency counts of words occurring in a collection of documents corresponding to a given vocabulary. This matrix is the underlying data of a bag-of-words model.bagOfWords
Term Frequency–Inverse Document Frequency (tf-idf) matrixA statistical measure based on the word frequency counts in documents and the proportion of documents containing the words in the corpus.tfidf

Latent Dirichlet Allocation

TermDefinitionMore Information
Corpus topic probabilitiesThe probabilities of observing each topic in the corpus used to fit the LDA model.ldaModel
Document topic probabilitiesThe probabilities of observing each topic in each document used to fit the LDA model. Equivalently, the topic mixtures of the training documents.ldaModel
Latent Dirichlet allocation (LDA)A generative statistical topic model that infers topic probabilities in documents and word probabilities in topics.fitlda
PerplexityA statistical measure of how well a model describes the given data. A lower perplexity indicates a better fit.logp
TopicA distribution of words, characterized by the "topic word probabilities".ldaModel
Topic concentrationThe concentration parameter of the underlying Dirichlet distribution of the corpus topics mixtures.ldaModel
Topic mixtureThe probabilities of topics in a given document.transform
Topic word probabilitiesThe probabilities of words in a given topic.ldaModel
Word concentrationThe concentration parameter of the underlying Dirichlet distribution of the topics.ldaModel

Latent Semantic Analysis

TermDefinitionMore Information
Component weightsThe singular values of the decomposition, squared.lsaModel
Document scoresThe score vectors in lower dimensional space of the documents used to fit the LSA model.transform
Latent semantic analysis (LSA)A dimension reducing technique based on principal component analysis (PCA).fitlsa
Word scoresThe scores of each word in each component of the LSA model.lsaModel

Word Embeddings

TermDefinitionMore Information
Word embeddingA model, popularized by the word2vec, GloVe, and fastText libraries, that maps words in a vocabulary to real vectors.wordEmbedding
Word embedding layerA deep learning network layer that learns a word embedding during training.wordEmbeddingLayer
Word encodingA model that maps words to numeric indices.wordEncoding

Visualization

TermDefinitionMore Information
Text scatter plotA scatter plot with words plotted at specified coordinates instead of markers.textscatter
Word cloudA chart that displays words with sizes corresponding to numeric data, usually frequency counts.wordcloud

See Also

| | | | | | | | | | | |

Related Topics