addDocument

Add documents to bag-of-words or bag-of-n-grams model

Description

example

newBag = addDocument(bag,documents) adds documents to the bag-of-words or bag-of-n-grams model bag.

Examples

collapse all

Create a bag-of-words model from an array of tokenized documents.

documents = tokenizedDocument([
    "an example of a short sentence"
    "a second short sentence"]);
bag = bagOfWords(documents)
bag = 
  bagOfWords with properties:

          Counts: [2x7 double]
      Vocabulary: [1x7 string]
        NumWords: 7
    NumDocuments: 2

Create another array of tokenized documents and add it to the same bag-of-words model.

documents = tokenizedDocument([ 
    "a third example of a short sentence" 
    "another short sentence"]);
newBag = addDocument(bag,documents)
newBag = 
  bagOfWords with properties:

          Counts: [4x9 double]
      Vocabulary: [1x9 string]
        NumWords: 9
    NumDocuments: 4

If your text data is contained in multiple files in a folder, then you can import the text data into MATLAB using a file datastore.

Create a file datastore for the example sonnet text files. The examples sonnets have file names "exampleSonnetN.txt", where N is the number of the sonnet. Specify the read function to be extractFileText.

readFcn = @extractFileText;
fds = fileDatastore('exampleSonnet*.txt','ReadFcn',readFcn)
fds = 
  FileDatastore with properties:

                       Files: {
                              ' .../tp3a31d8eb/textanalytics-ex73762432/exampleSonnet1.txt';
                              ' .../tp3a31d8eb/textanalytics-ex73762432/exampleSonnet2.txt';
                              ' .../tp3a31d8eb/textanalytics-ex73762432/exampleSonnet3.txt'
                               ... and 1 more
                              }
                     Folders: {
                              ' .../mlx_to_docbook5/tp3a31d8eb/textanalytics-ex73762432'
                              }
                 UniformRead: 0
                    ReadMode: 'file'
                   BlockSize: Inf
                  PreviewFcn: @extractFileText
      SupportedOutputFormats: [1x16 string]
                     ReadFcn: @extractFileText
    AlternateFileSystemRoots: {}

Create an empty bag-of-words model.

bag = bagOfWords
bag = 
  bagOfWords with properties:

          Counts: []
      Vocabulary: [1x0 string]
        NumWords: 0
    NumDocuments: 0

Loop over the files in the datastore and read each file. Tokenize the text in each file and add the document to bag.

while hasdata(fds)
    str = read(fds);
    document = tokenizedDocument(str);
    bag = addDocument(bag,document);
end

View the updated bag-of-words model.

bag
bag = 
  bagOfWords with properties:

          Counts: [4x276 double]
      Vocabulary: [1x276 string]
        NumWords: 276
    NumDocuments: 4

Input Arguments

collapse all

Input bag-of-words or bag-of-n-grams model, specified as a bagOfWords object or a bagOfNgrams object.

Input documents, specified as a tokenizedDocument array, a string array of words, or a cell array of character vectors. If documents is not a tokenizedDocument array, then it must be a row vector representing a single document, where each element is a word. To specify multiple documents, use a tokenizedDocument array.

Output Arguments

collapse all

Output model, returned as a bagOfWords object or a bagOfNgrams object. The type of newBag is the same as the type of bag.

Introduced in R2017b