rakeKeywords

Extract keywords using RAKE

Syntax

tbl = rakeKeywords(documents)

tbl = rakeKeywords(documents,Name,Value)

Description

tbl = rakeKeywords(documents) extracts keywords and respective scores using the Rapid Automatic Keyword Extraction (RAKE) algorithm. The function supports English, Japanese, German, and Korean text. To learn how to use rakeKeywords for other languages, see Language Considerations.

example

tbl = rakeKeywords(documents,Name,Value) specifies additional options using one or more name-value pair arguments.

Tip

The rakeKeywords function, by default, extracts keywords using stop words and punctuation characters. When using the default values for the 'Delimiters' and 'MergingDelimiters' options, do not remove stop words or punctuation characters from the input text.

Examples

collapse all

Extract Keywords Using RAKE

Open Live Script

Create an array of tokenized documents containing the text data.

textData = [
    "MATLAB provides tools for scientists and engineers. MATLAB is used by scientists and engineers."
    "Analyze text and images. You can import text and images."
    "Analyze text and images. Analyze text, images, and videos in MATLAB."];
documents = tokenizedDocument(textData);

Extract the keywords using the rakeKeywords function.

tbl = rakeKeywords(documents)

tbl=12×3 table
                     Keyword                     DocumentNumber    Score
    _________________________________________    ______________    _____

    "MATLAB"        "provides"    "tools"              1             8  
    "MATLAB"        ""            ""                   1             2  
    "scientists"    "and"         "engineers"          1             2  
    "engineers"     ""            ""                   1             1  
    "scientists"    ""            ""                   1             1  
    "Analyze"       "text"        ""                   2             4  
    "import"        "text"        ""                   2             4  
    "images"        ""            ""                   2             1  
    "Analyze"       "text"        ""                   3             4  
    "MATLAB"        ""            ""                   3             1  
    "images"        ""            ""                   3             1  
    "videos"        ""            ""                   3             1

If a keyword contains multiple words, then the ith element of the string array corresponds to the ith word of the keyword. If the keyword has fewer words that the longest keyword, then remaining entries of the string array are the empty string "".

For readability, transform the multi-word keywords into a single sting using the join and strip functions.

if size(tbl.Keyword,2) > 1
    tbl.Keyword = strip(join(tbl.Keyword));
end
tbl

tbl=12×3 table
             Keyword              DocumentNumber    Score
    __________________________    ______________    _____

    "MATLAB provides tools"             1             8  
    "MATLAB"                            1             2  
    "scientists and engineers"          1             2  
    "engineers"                         1             1  
    "scientists"                        1             1  
    "Analyze text"                      2             4  
    "import text"                       2             4  
    "images"                            2             1  
    "Analyze text"                      3             4  
    "MATLAB"                            3             1  
    "images"                            3             1  
    "videos"                            3             1

Specify Maximum Number of Keywords Per Document

Open Live Script

Create an array of tokenized document containing the text data.

textData = [
    "MATLAB provides tools for scientists and engineers. MATLAB is used by scientists and engineers."
    "Analyze text and images. You can import text and images."
    "Analyze text and images. Analyze text, images, and videos in MATLAB."];
documents = tokenizedDocument(textData);

Extract the top two keywords using the rakeKeywords function and setting the 'MaxNumKeywords' option to 2.

tbl = rakeKeywords(documents,'MaxNumKeywords',2)

tbl=6×3 table
                 Keyword                  DocumentNumber    Score
    __________________________________    ______________    _____

    "MATLAB"     "provides"    "tools"          1             8  
    "MATLAB"     ""            ""               1             2  
    "Analyze"    "text"        ""               2             4  
    "import"     "text"        ""               2             4  
    "Analyze"    "text"        ""               3             4  
    "MATLAB"     ""            ""               3             1

For readability, transform the multi-word keywords into a single sting using the join and strip functions.

if size(tbl.Keyword,2) > 1
    tbl.Keyword = strip(join(tbl.Keyword));
end
tbl

tbl=6×3 table
            Keyword            DocumentNumber    Score
    _______________________    ______________    _____

    "MATLAB provides tools"          1             8  
    "MATLAB"                         1             2  
    "Analyze text"                   2             4  
    "import text"                    2             4  
    "Analyze text"                   3             4  
    "MATLAB"                         3             1

Input Arguments

collapse all

`documents` — Input documents
`tokenizedDocument` array | string array of words | cell array of character vectors

Input documents, specified as a tokenizedDocument array, a string array of words, or a cell array of character vectors. If documents is not a tokenizedDocument array, then it must be a row vector representing a single document, where each element is a word. To specify multiple documents, use a tokenizedDocument array.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: rakeKeywords(documents,'MaxNumKeywords',20) returns at most 20 keywords per document.

`'MaxNumKeywords'` — Maximum number of keywords to return per document
`Inf` (default) | positive integer

Maximum number of keywords to return per document, specified as the comma-separated pair consisting of 'MaxNumKeywords' and a positive integer or Inf.

If MaxNumKeywords is Inf, then the function returns all identified keywords.

`'Delimiters'` — Tokens for splitting documents into keywords
string array | character vector | cell array of character vectors

Tokens for splitting documents into keywords, specified as the comma-separated pair consisting of 'Delimiters' and a string array, a character vector, or a cell array of character vectors. If Delimiters is a character vector, then it must represent a single delimiter.

The default list of delimiters is a list of punctuation characters.

If multiple candidate keywords appear in a document separated only by merging delimiters, then the function merges those keywords and the merging delimiters into a single keyword.

To specify delimiters for merging, use the 'MergingDelimiters' option.

Delimiter matching is case insensitive.

Data Types: char | string | cell

`'MergingDelimiters'` — Delimiters also used for merging keywords
string array | character vector | cell array of character vectors

Delimiters also used for merging keywords, specified as the comma-separated pair consisting of 'MergingDelimiters' and a string array, a character vector, or a cell array of character vectors. If MergingDelimiters is a character vector, then it must represent a single delimiter.

The default list of merging delimiters is the list of stop words given by the stopWords function.

If multiple candidate keywords appear in a document separated only by merging delimiters, then the function merges those keywords and the merging delimiters into a single keyword.

To specify delimiters that should not be used for merging, use the 'Delimiters' option.

Delimiter matching is case insensitive.

Data Types: char | string | cell

Output Arguments

collapse all

`tbl` — Extracted keywords and scores
table

Extracted keywords and scores, returned as a table with the following variables:

Keyword – Extracted keyword, specified as a 1-by-maxNgramLength string array, where maxNgramLength is the number of words in the longest keyword.
DocumentNumber – Document number containing the corresponding keyword.
Score – Score of keyword.

If multiple candidate keywords appear in a document separated only by merging delimiters, then the function merges those keywords and the merging delimiters into a single keyword.

If a keyword contains multiple words, then the ith element of the corresponding string array corresponds to the ith word of the keyword. If the keyword has fewer words that the longest keyword, then remaining entries of the string array are the empty string "".

For more information, see Rapid Automatic Keyword Extraction.

More About

collapse all

Language Considerations

The rakeKeywords function supports English, Japanese, German, and Korean text only.

The rakeKeywords function extracts keywords using a delimiter-based approach to identify candidate keywords. The function, by default, uses punctuation characters and the stop words given by the stopWords with language given by the language details of the input documents as delimiters.

For other languages, specify an appropriate set of delimiters using the 'Delimiters' and 'MergingDelimiters' options.

Tips

You can experiment with different keyword extraction algorithms to see what works best with your data. Because the RAKE keywords algorithm uses a delimiter-based approach to extract candidate keywords, the extracted keywords can be very long. Alternatively, you can try extracting keywords using TextRank algorithm which starts with individual tokens as candidate keywords and then merges them when appropriate. To extract keywords using TextRank, use the textrankKeywords function. To learn more, see Extract Keywords from Text Data Using TextRank.

Algorithms

collapse all

Rapid Automatic Keyword Extraction

For each document, the rakeKeywords function extracts keywords independently using the following steps based on [1]:

Determine candidate keywords:
- Extract sequences of tokens between the delimiters specified by the 'Delimiters' and 'MergingDelimiters' options. The function treats each sequence as a single candidate keyword.
Calculate scores for the candidate keywords:
- Create an undirected, unweighted graph with nodes corresponding to the individual tokens in the candidate keywords.
- Add edges between nodes where tokens co-occur in a candidate keyword, including self co-occurrences, weighted by the number of candidate keywords containing that co-occurrence.
- Score each token using the formula deg(token) / freq(token), where deg(token) is the number of edges for the specified token and freq(token) is the number of times that the specified token occurs in the document.
- For each candidate keyword, assign a score given by the sum of scores of the contained tokens.
Extract top keywords from candidates:
- If there are multiple instances of the same pair of candidate keywords separated by the same single merging delimiter, then merge the candidate keywords and the delimiter into a single keyword and sum the corresponding scores.
- Return the top k keywords, where k is given by the 'MaxNumKeywords' option.

Language Details

tokenizedDocument objects contain details about the tokens including language details. The language details of the input documents determine the behavior of rakeKeywords. The tokenizedDocument function, by default, automatically detects the language of the input text. To specify the language details manually, use the 'Language' name-value pair argument of tokenizedDocument. To view the token details, use the tokenDetails function.

References

[1] Rose, Stuart, Dave Engel, Nick Cramer, and Wendy Cowley. "Automatic keyword extraction from individual documents." Text mining: applications and theory 1 (2010): 1-20.

Documentation

rakeKeywords

Syntax

Description

Examples

Extract Keywords Using RAKE

Specify Maximum Number of Keywords Per Document

Input Arguments

`documents` — Input documents
`tokenizedDocument` array | string array of words | cell array of character vectors

Name-Value Pair Arguments

`'MaxNumKeywords'` — Maximum number of keywords to return per document
`Inf` (default) | positive integer

`'Delimiters'` — Tokens for splitting documents into keywords
string array | character vector | cell array of character vectors

`'MergingDelimiters'` — Delimiters also used for merging keywords
string array | character vector | cell array of character vectors

Output Arguments

`tbl` — Extracted keywords and scores
table

More About

Language Considerations

Tips

Algorithms

Rapid Automatic Keyword Extraction

Language Details

References

See Also

Topics

Text Analytics Toolbox Documentation

Support

Documentation

rakeKeywords

Syntax

Description

Examples

Extract Keywords Using RAKE

Specify Maximum Number of Keywords Per Document

Input Arguments

documents — Input documents tokenizedDocument array | string array of words | cell array of character vectors

Name-Value Pair Arguments

'MaxNumKeywords' — Maximum number of keywords to return per document Inf (default) | positive integer

'Delimiters' — Tokens for splitting documents into keywords string array | character vector | cell array of character vectors

'MergingDelimiters' — Delimiters also used for merging keywords string array | character vector | cell array of character vectors

Output Arguments

tbl — Extracted keywords and scores table

More About

Language Considerations

Tips

Algorithms

Rapid Automatic Keyword Extraction

Language Details

References

See Also

Topics

Text Analytics Toolbox Documentation

Support

`documents` — Input documents
`tokenizedDocument` array | string array of words | cell array of character vectors

`'MaxNumKeywords'` — Maximum number of keywords to return per document
`Inf` (default) | positive integer

`'Delimiters'` — Tokens for splitting documents into keywords
string array | character vector | cell array of character vectors

`'MergingDelimiters'` — Delimiters also used for merging keywords
string array | character vector | cell array of character vectors

`tbl` — Extracted keywords and scores
table