This section provides a list of terms used in text analytics.
Term | Definition | More Information |
---|---|---|
Bigram | Two tokens in succession. For example, ["New"
"York"] . | bagOfNgrams |
Complex token | A token with complex structure. For example, an email address or a hash tag. | tokenDetails |
Context | Tokens or characters that surround a given token. | context |
Corpus | A collection of documents. | tokenizedDocument |
Document | A single observation of text data. For example, a report, a tweet, or an article. | tokenizedDocument |
Grapheme | A human readable character. A grapheme can consist of multiple Unicode code points. For example, "a", "😎", or "語". | splitGraphemes |
N-gram | N tokens in succession. | bagOfNgrams |
Part of speech | Categories of words used in grammatical structure. For example, "noun", "verb", and "adjective". | addPartOfSpeechDetails |
Token | A string of characters representing a unit of text data, also known as a "unigram". For example, a word, number, or email address. | tokenizedDocument |
Token details | Information about the token. For example, type, language, or part-of-speech details. | tokenDetails |
Token types | The category of the token. For example, "letters", "punctuation", or "email address". | tokenDetails |
Tokenized document | A document split into tokens. | tokenizedDocument |
Trigram | Three tokens in succession. For example, ["The" "United"
"States"] | bagOfNgrams |
Vocabulary | Unique words or tokens in a corpus or model. | tokenizedDocument |
Term | Definition | More Information |
---|---|---|
Normalize | Reduce words to a root form. For example, reduce the word "walking" to "walk" using stemming or lemmatization. | normalizeWords |
Lemmatize | Reduce words to a dictionary word (the lemma form). For example, reduce the words "running" and "ran" to "run". | normalizeWords |
Stem | Reduce words by removing inflections. The reduced word is not necessarily a real word. For example, the Porter stemmer reduces the words "happy" and "happiest" to "happi". | normalizeWords |
Stop words | Words commonly removed before analysis. For example "and", "of", and "the". | removeStopWords |
Term | Definition | More Information |
---|---|---|
Bag-of-n-grams model | A model that records the number of times that n-grams appear in each document of a corpus. | bagOfNgrams |
Bag-of-words model | A model that records the number of times that words appear in each document of a collection. | bagOfWords |
Term frequency count matrix | A matrix of the frequency counts of words occurring in a collection of documents corresponding to a given vocabulary. This matrix is the underlying data of a bag-of-words model. | bagOfWords |
Term Frequency–Inverse Document Frequency (tf-idf) matrix | A statistical measure based on the word frequency counts in documents and the proportion of documents containing the words in the corpus. | tfidf |
Term | Definition | More Information |
---|---|---|
Corpus topic probabilities | The probabilities of observing each topic in the corpus used to fit the LDA model. | ldaModel |
Document topic probabilities | The probabilities of observing each topic in each document used to fit the LDA model. Equivalently, the topic mixtures of the training documents. | ldaModel |
Latent Dirichlet allocation (LDA) | A generative statistical topic model that infers topic probabilities in documents and word probabilities in topics. | fitlda |
Perplexity | A statistical measure of how well a model describes the given data. A lower perplexity indicates a better fit. | logp |
Topic | A distribution of words, characterized by the "topic word probabilities". | ldaModel |
Topic concentration | The concentration parameter of the underlying Dirichlet distribution of the corpus topics mixtures. | ldaModel |
Topic mixture | The probabilities of topics in a given document. | transform |
Topic word probabilities | The probabilities of words in a given topic. | ldaModel |
Word concentration | The concentration parameter of the underlying Dirichlet distribution of the topics. | ldaModel |
Term | Definition | More Information |
---|---|---|
Component weights | The singular values of the decomposition, squared. | lsaModel |
Document scores | The score vectors in lower dimensional space of the documents used to fit the LSA model. | transform |
Latent semantic analysis (LSA) | A dimension reducing technique based on principal component analysis (PCA). | fitlsa |
Word scores | The scores of each word in each component of the LSA model. | lsaModel |
Term | Definition | More Information |
---|---|---|
Word embedding | A model, popularized by the word2vec, GloVe, and fastText libraries, that maps words in a vocabulary to real vectors. | wordEmbedding |
Word embedding layer | A deep learning network layer that learns a word embedding during training. | wordEmbeddingLayer |
Word encoding | A model that maps words to numeric indices. | wordEncoding |
Term | Definition | More Information |
---|---|---|
Text scatter plot | A scatter plot with words plotted at specified coordinates instead of markers. | textscatter |
Word cloud | A chart that displays words with sizes corresponding to numeric data, usually frequency counts. | wordcloud |
addPartOfSpeechDetails
| bagOfNgrams
| bagOfWords
| fitlda
| normalizeWords
| removeStopWords
| textscatter
| tokenDetails
| tokenizedDocument
| wordcloud
| wordEmbedding
| wordEmbeddingLayer
| wordEncoding