topkngrams

Most frequent n-grams

Syntax

tbl = topkngrams(bag)

tbl = topkngrams(bag,k)

tbl = topkngrams(___,Name,Value)

Description

tbl = topkngrams(bag) returns a table listing the five most frequently seen n-grams in the bag-of-n-grams model bag. The function, by default, is case sensitive.

example

tbl = topkngrams(bag,k) lists the k most frequently seen n-grams in the bag-of-n-grams model bag. The function, by default, is case sensitive.

example

tbl = topkngrams(___,Name,Value) specifies additional options using one or more name-value pair arguments.

Examples

collapse all

Most Frequent Bigrams of Bag-of-N-Grams Model

Open Live Script

Create a table of the most frequent bigrams of a bag-of-n-grams model.

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create a bag-of-n-grams model.

bag = bagOfNgrams(documents)

bag = 
  bagOfNgrams with properties:

          Counts: [154×8799 double]
      Vocabulary: [1×3092 string]
          Ngrams: [8799×2 string]
    NgramLengths: 2
       NumNgrams: 8799
    NumDocuments: 154

Find the top 5 bigrams.

tbl = topkngrams(bag)

tbl=5×3 table
         Ngram          Count    NgramLength
    ________________    _____    ___________

    "thou"    "art"      34           2     
    "mine"    "eye"      15           2     
    "thy"     "self"     14           2     
    "thou"    "dost"     13           2     
    "mine"    "own"      13           2

Find the top 10 bigrams.

tbl = topkngrams(bag,10)

tbl=10×3 table
          Ngram          Count    NgramLength
    _________________    _____    ___________

    "thou"    "art"       34           2     
    "mine"    "eye"       15           2     
    "thy"     "self"      14           2     
    "thou"    "dost"      13           2     
    "mine"    "own"       13           2     
    "thy"     "sweet"     12           2     
    "thy"     "love"      11           2     
    "dost"    "thou"      10           2     
    "thou"    "wilt"      10           2     
    "love"    "thee"       9           2

Count N-Grams of Different Lengths

Open Live Script

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create a bag-of-n-grams model. To count n-grams of length 2 and 3 (bigrams and trigrams), specify 'NgramLengths' to be the vector [2 3].

bag = bagOfNgrams(documents,'NgramLengths',[2 3])

bag = 
  bagOfNgrams with properties:

          Counts: [154×18022 double]
      Vocabulary: [1×3092 string]
          Ngrams: [18022×3 string]
    NgramLengths: [2 3]
       NumNgrams: 18022
    NumDocuments: 154

View the 10 most common n-grams of length 2 (bigrams).

topkngrams(bag,10,'NGramLengths',2)

ans=10×3 table
             Ngram             Count    NgramLength
    _______________________    _____    ___________

    "thou"    "art"      ""     34           2     
    "mine"    "eye"      ""     15           2     
    "thy"     "self"     ""     14           2     
    "thou"    "dost"     ""     13           2     
    "mine"    "own"      ""     13           2     
    "thy"     "sweet"    ""     12           2     
    "thy"     "love"     ""     11           2     
    "dost"    "thou"     ""     10           2     
    "thou"    "wilt"     ""     10           2     
    "love"    "thee"     ""      9           2

View the 10 most common n-grams of length 3 (trigrams).

 topkngrams(bag,10,'NGramLengths',3)

ans=10×3 table
               Ngram                Count    NgramLength
    ____________________________    _____    ___________

    "thy"     "sweet"    "self"       4           3     
    "why"     "dost"     "thou"       4           3     
    "thy"     "self"     "thy"        3           3     
    "thou"    "thy"      "self"       3           3     
    "mine"    "eye"      "heart"      3           3     
    "thou"    "shalt"    "find"       3           3     
    "fair"    "kind"     "true"       3           3     
    "thou"    "art"      "fair"       2           3     
    "love"    "thy"      "self"       2           3     
    "thy"     "self"     "thou"       2           3

Input Arguments

collapse all

`bag` — Input bag-of-n-grams model
`bagOfNgrams` object

Input bag-of-n-grams model, specified as a bagOfNgrams object.

`k` — Number of n-grams
nonnegative integer

Number of n-grams to return, specified as a positive integer.

Example: 20

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'NgramLengths',[2 3] specifies to return the top bigrams and trigrams.

`'NgramLengths'` — N-gram lengths
positive integer | vector of positive integers

N-gram lengths, specified as the comma separated pair consisting of 'NgramLengths' and a positive integer or a vector of positive integers.

If you specify NgramLengths, then the function returns n-grams of these lengths only. If you do not specify NgramLengths, then the function returns the top n-grams regardless of length.

Example: [1 2 3]

`'IgnoreCase'` — Option to ignore case
`false` (default) | `true`

Option to ignore case, specified as the comma-separated pair consisting of 'IgnoreCase' and one of the following:

false – treat n-grams differing only by case as separate n-grams.
true – treat n-grams differing only by case as the same n-gram and merge counts.

`'ForceCellOutput'` — Indicator for forcing output to be returned as cell array
`false` (default) | `true`

Indicator for forcing output to be returned as cell array, specified as the comma separated pair consisting of 'ForceCellOutput' and true or false.

Data Types: logical

Output Arguments

collapse all

`tbl` — Table of top n-grams
table | cell array of tables

Table of top n-grams sorted in order of frequency or a cell array of tables.

The table has the following columns:

`Ngram`	N-gram specified as a string vector
`Count`	Number of times the n-gram appears in the bag-of-n-grams model.
`NgramLength`	Length of the n-gram.

If bag is a non-scalar array or 'ForceCellOutput' is true, then the function returns the outputs as a cell array of tables. Each element in the cell array is a table containing the top n-grams of the corresponding element of bag.

Documentation

topkngrams

Syntax

Description

Examples

Most Frequent Bigrams of Bag-of-N-Grams Model

Count N-Grams of Different Lengths

Input Arguments

`bag` — Input bag-of-n-grams model
`bagOfNgrams` object

`k` — Number of n-grams
nonnegative integer

Name-Value Pair Arguments

`'NgramLengths'` — N-gram lengths
positive integer | vector of positive integers

`'IgnoreCase'` — Option to ignore case
`false` (default) | `true`

`'ForceCellOutput'` — Indicator for forcing output to be returned as cell array
`false` (default) | `true`

Output Arguments

`tbl` — Table of top n-grams
table | cell array of tables

See Also

Topics

Text Analytics Toolbox Documentation

Support

Documentation

topkngrams

Syntax

Description

Examples

Most Frequent Bigrams of Bag-of-N-Grams Model

Count N-Grams of Different Lengths

Input Arguments

bag — Input bag-of-n-grams model bagOfNgrams object

k — Number of n-grams nonnegative integer

Name-Value Pair Arguments

'NgramLengths' — N-gram lengths positive integer | vector of positive integers

'IgnoreCase' — Option to ignore case false (default) | true

'ForceCellOutput' — Indicator for forcing output to be returned as cell array false (default) | true

Output Arguments

tbl — Table of top n-grams table | cell array of tables

See Also

Topics

Text Analytics Toolbox Documentation

Support

`bag` — Input bag-of-n-grams model
`bagOfNgrams` object

`k` — Number of n-grams
nonnegative integer

`'NgramLengths'` — N-gram lengths
positive integer | vector of positive integers

`'IgnoreCase'` — Option to ignore case
`false` (default) | `true`

`'ForceCellOutput'` — Indicator for forcing output to be returned as cell array
`false` (default) | `true`

`tbl` — Table of top n-grams
table | cell array of tables