removeInfrequentNgrams

Remove infrequently seen n-grams from bag-of-n-grams model

Syntax

newBag = removeInfrequentNgrams(bag,count)

newBag = removeInfrequentNgrams(bag,count,'NgramLengths',lengths)

Description

newBag = removeInfrequentNgrams(bag,count) removes the n-grams that appear at most count times in total from the bag-of-n-grams model bag.

example

newBag = removeInfrequentNgrams(bag,count,'NgramLengths',lengths) only removes n-grams with lengths specified by lengths.

Examples

collapse all

Remove Infrequent N-Grams from Bag-of-N-Grams Model

Open Live Script

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create a bag-of-n-grams model. Specify to count bigrams (pairs of words) and trigrams (triples of words).

bag = bagOfNgrams(documents,'NgramLengths',[2 3])

bag = 
  bagOfNgrams with properties:

          Counts: [154x18022 double]
      Vocabulary: [1x3092 string]
          Ngrams: [18022x3 string]
    NgramLengths: [2 3]
       NumNgrams: 18022
    NumDocuments: 154

Remove n-grams of any length that appear two or fewer times in total.

bag = removeInfrequentNgrams(bag,2)

bag = 
  bagOfNgrams with properties:

          Counts: [154x103 double]
      Vocabulary: [1x73 string]
          Ngrams: [103x3 string]
    NgramLengths: [2 3]
       NumNgrams: 103
    NumDocuments: 154

Remove bigrams that appear four or fewer times in total.

bag = removeInfrequentNgrams(bag,4,'NgramLengths',2)

bag = 
  bagOfNgrams with properties:

          Counts: [154x41 double]
      Vocabulary: [1x30 string]
          Ngrams: [41x3 string]
    NgramLengths: [2 3]
       NumNgrams: 41
    NumDocuments: 154

Input Arguments

collapse all

`bag` — Input bag-of-n-grams model
`bagOfNgrams` object

Input bag-of-n-grams model, specified as a bagOfNgrams object.

`count` — Count threshold
positive integer

Count threshold, specified as a positive integer. The function removes the n-grams that appear count times in total or fewer.

`lengths` — N-gram lengths
positive integer | vector of positive integers

N-gram lengths, specified as a positive integer or a vector of positive integers.

If you specify lengths, the function removes infrequent n-grams of the specified lengths only. If you do not specify lengths, then the function removes infrequent n-grams regardless of length.

Example: [1 2 3]

Output Arguments

collapse all

`newBag` — Output bag-of-n-grams model
`bagOfNgrams` object

Output bag-of-n-grams model, returned as a bagOfNgrams object.

Documentation

removeInfrequentNgrams

Syntax

Description

Examples

Remove Infrequent N-Grams from Bag-of-N-Grams Model

Input Arguments

`bag` — Input bag-of-n-grams model
`bagOfNgrams` object

`count` — Count threshold
positive integer

`lengths` — N-gram lengths
positive integer | vector of positive integers

Output Arguments

`newBag` — Output bag-of-n-grams model
`bagOfNgrams` object

See Also

Topics

Introduced in R2018a

Text Analytics Toolbox Documentation

Support

Documentation

removeInfrequentNgrams

Syntax

Description

Examples

Remove Infrequent N-Grams from Bag-of-N-Grams Model

Input Arguments

bag — Input bag-of-n-grams model bagOfNgrams object

count — Count threshold positive integer

lengths — N-gram lengths positive integer | vector of positive integers

Output Arguments

newBag — Output bag-of-n-grams model bagOfNgrams object

See Also

Topics

Introduced in R2018a

Text Analytics Toolbox Documentation

Support

`bag` — Input bag-of-n-grams model
`bagOfNgrams` object

`count` — Count threshold
positive integer

`lengths` — N-gram lengths
positive integer | vector of positive integers

`newBag` — Output bag-of-n-grams model
`bagOfNgrams` object