rakeKeywords

Extract keywords using RAKE

    Description

    example

    tbl = rakeKeywords(documents) extracts keywords and respective scores using the Rapid Automatic Keyword Extraction (RAKE) algorithm. The function supports English, Japanese, German, and Korean text. To learn how to use rakeKeywords for other languages, see Language Considerations.

    example

    tbl = rakeKeywords(documents,Name,Value) specifies additional options using one or more name-value pair arguments.

    Tip

    The rakeKeywords function, by default, extracts keywords using stop words and punctuation characters. When using the default values for the 'Delimiters' and 'MergingDelimiters' options, do not remove stop words or punctuation characters from the input text.

    Examples

    collapse all

    Create an array of tokenized documents containing the text data.

    textData = [
        "MATLAB provides tools for scientists and engineers. MATLAB is used by scientists and engineers."
        "Analyze text and images. You can import text and images."
        "Analyze text and images. Analyze text, images, and videos in MATLAB."];
    documents = tokenizedDocument(textData);

    Extract the keywords using the rakeKeywords function.

    tbl = rakeKeywords(documents)
    tbl=12×3 table
                         Keyword                     DocumentNumber    Score
        _________________________________________    ______________    _____
    
        "MATLAB"        "provides"    "tools"              1             8  
        "MATLAB"        ""            ""                   1             2  
        "scientists"    "and"         "engineers"          1             2  
        "engineers"     ""            ""                   1             1  
        "scientists"    ""            ""                   1             1  
        "Analyze"       "text"        ""                   2             4  
        "import"        "text"        ""                   2             4  
        "images"        ""            ""                   2             1  
        "Analyze"       "text"        ""                   3             4  
        "MATLAB"        ""            ""                   3             1  
        "images"        ""            ""                   3             1  
        "videos"        ""            ""                   3             1  
    
    

    If a keyword contains multiple words, then the ith element of the string array corresponds to the ith word of the keyword. If the keyword has fewer words that the longest keyword, then remaining entries of the string array are the empty string "".

    For readability, transform the multi-word keywords into a single sting using the join and strip functions.

    if size(tbl.Keyword,2) > 1
        tbl.Keyword = strip(join(tbl.Keyword));
    end
    tbl
    tbl=12×3 table
                 Keyword              DocumentNumber    Score
        __________________________    ______________    _____
    
        "MATLAB provides tools"             1             8  
        "MATLAB"                            1             2  
        "scientists and engineers"          1             2  
        "engineers"                         1             1  
        "scientists"                        1             1  
        "Analyze text"                      2             4  
        "import text"                       2             4  
        "images"                            2             1  
        "Analyze text"                      3             4  
        "MATLAB"                            3             1  
        "images"                            3             1  
        "videos"                            3             1  
    
    

    Create an array of tokenized document containing the text data.

    textData = [
        "MATLAB provides tools for scientists and engineers. MATLAB is used by scientists and engineers."
        "Analyze text and images. You can import text and images."
        "Analyze text and images. Analyze text, images, and videos in MATLAB."];
    documents = tokenizedDocument(textData);

    Extract the top two keywords using the rakeKeywords function and setting the 'MaxNumKeywords' option to 2.

    tbl = rakeKeywords(documents,'MaxNumKeywords',2)
    tbl=6×3 table
                     Keyword                  DocumentNumber    Score
        __________________________________    ______________    _____
    
        "MATLAB"     "provides"    "tools"          1             8  
        "MATLAB"     ""            ""               1             2  
        "Analyze"    "text"        ""               2             4  
        "import"     "text"        ""               2             4  
        "Analyze"    "text"        ""               3             4  
        "MATLAB"     ""            ""               3             1  
    
    

    If a keyword contains multiple words, then the ith element of the string array corresponds to the ith word of the keyword. If the keyword has fewer words that the longest keyword, then remaining entries of the string array are the empty string "".

    For readability, transform the multi-word keywords into a single sting using the join and strip functions.

    if size(tbl.Keyword,2) > 1
        tbl.Keyword = strip(join(tbl.Keyword));
    end
    tbl
    tbl=6×3 table
                Keyword            DocumentNumber    Score
        _______________________    ______________    _____
    
        "MATLAB provides tools"          1             8  
        "MATLAB"                         1             2  
        "Analyze text"                   2             4  
        "import text"                    2             4  
        "Analyze text"                   3             4  
        "MATLAB"                         3             1  
    
    

    Input Arguments

    collapse all

    Input documents, specified as a tokenizedDocument array, a string array of words, or a cell array of character vectors. If documents is not a tokenizedDocument array, then it must be a row vector representing a single document, where each element is a word. To specify multiple documents, use a tokenizedDocument array.

    Name-Value Pair Arguments

    Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

    Example: rakeKeywords(documents,'MaxNumKeywords',20) returns at most 20 keywords per document.

    Maximum number of keywords to return per document, specified as the comma-separated pair consisting of 'MaxNumKeywords' and a positive integer or Inf.

    If MaxNumKeywords is Inf, then the function returns all identified keywords.

    Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

    Tokens for splitting documents into keywords, specified as the comma-separated pair consisting of 'Delimiters' and a string array, a character vector, or a cell array of character vectors. If Delimiters is a character vector, then it must represent a single delimiter.

    The default list of delimiters is a list of punctuation characters.

    If multiple candidate keywords appear in a document separated only by merging delimiters, then the function merges those keywords and the merging delimiters into a single keyword.

    To specify delimiters for merging, use the 'MergingDelimiters' option.

    Delimiter matching is case insensitive.

    Data Types: char | string | cell

    Delimiters also used for merging keywords, specified as the comma-separated pair consisting of 'MergingDelimiters' and a string array, a character vector, or a cell array of character vectors. If MergingDelimiters is a character vector, then it must represent a single delimiter.

    The default list of merging delimiters is the list of stop words given by the stopWords function.

    If multiple candidate keywords appear in a document separated only by merging delimiters, then the function merges those keywords and the merging delimiters into a single keyword.

    To specify delimiters that should not be used for merging, use the 'Delimiters' option.

    Delimiter matching is case insensitive.

    Data Types: char | string | cell

    Output Arguments

    collapse all

    Extracted keywords and scores, returned as a table with the following variables:

    • Keyword – Extracted keyword, specified as a 1-by-maxNgramLength string array, where maxNgramLength is the number of words in the longest keyword.

    • DocumentNumber – Document number containing the corresponding keyword.

    • Score – Score of keyword.

    If multiple candidate keywords appear in a document separated only by merging delimiters, then the function merges those keywords and the merging delimiters into a single keyword.

    If a keyword contains multiple words, then the ith element of the corresponding string array corresponds to the ith word of the keyword. If the keyword has fewer words that the longest keyword, then remaining entries of the string array are the empty string "".

    For more information, see Rapid Automatic Keyword Extraction.

    More About

    collapse all

    Language Considerations

    The rakeKeywords function supports English, Japanese, German, and Korean text only.

    The rakeKeywords function extracts keywords using a delimiter-based approach to identify candidate keywords. The function, by default, uses punctuation characters and the stop words given by the stopWords with language given by the language details of the input documents as delimiters.

    For other languages, specify an appropriate set of delimiters using the 'Delimiters' and 'MergingDelimiters' options.

    Tips

    • You can experiment with different keyword extraction algorithms to see what works best with your data. Because the RAKE keywords algorithm uses a delimiter-based approach to extract candidate keywords, the extracted keywords can be very long. Alternatively, you can try extracting keywords using TextRank algorithm which starts with individual tokens as candidate keywords and then merges them when appropriate. To extract keywords using TextRank, use the textrankKeywords function. To learn more, see Extract Keywords from Text Data Using TextRank.

    Algorithms

    collapse all

    Rapid Automatic Keyword Extraction

    For each document, the rakeKeywords function extracts keywords independently using the following steps based on [1]:

    1. Determine candidate keywords:

      • Extract sequences of tokens between the delimiters specified by the 'Delimiters' and 'MergingDelimiters' options. The function treats each sequence as a single candidate keyword.

    2. Calculate scores for the candidate keywords:

      • Create an undirected, unweighted graph with nodes corresponding to the individual tokens in the candidate keywords.

      • Add edges between nodes where tokens co-occur in a candidate keyword, including self co-occurrences, weighted by the number of candidate keywords containing that co-occurrence.

      • Score each token using the formula deg(token) / freq(token), where deg(token) is the number of edges for the specified token and freq(token) is the number of times that the specified token occurs in the document.

      • For each candidate keyword, assign a score given by the sum of scores of the contained tokens.

    3. Extract top keywords from candidates:

      • If there are multiple instances of the same pair of candidate keywords separated by the same single merging delimiter, then merge the candidate keywords and the delimiter into a single keyword and sum the corresponding scores.

      • Return the top k keywords, where k is given by the 'MaxNumKeywords' option.

    Language Details

    tokenizedDocument objects contain details about the tokens including language details. The language details of the input documents determine the behavior of rakeKeywords. The tokenizedDocument function, by default, automatically detects the language of the input text. To specify the language details manually, use the 'Language' name-value pair argument of tokenizedDocument. To view the token details, use the tokenDetails function.

    References

    [1] Rose, Stuart, Dave Engel, Nick Cramer, and Wendy Cowley. "Automatic keyword extraction from individual documents." Text mining: applications and theory 1 (2010): 1-20.

    Introduced in R2020b