Extract summary from documents
[
specifies additional options using one or more name-value pair arguments.summary
,scores
] = extractSummary(documents
,Name,Value
)
Create an array of tokenized documents.
str = [ "The quick brown fox jumped over the lazy dog." "The fox jumped over the dog." "The lazy dog saw a fox jumping." "There seem to be animals jumping other animals." "There are quick animals and lazy animals"]; documents = tokenizedDocument(str);
Extract a summary of the documents using the extractSummary
function. The function, by default, chooses 1/10 of the input documents, rounding up.
summary = extractSummary(documents)
summary = tokenizedDocument: 10 tokens: The quick brown fox jumped over the lazy dog .
To specify a larger summary, use the 'SummarySize'
option. Extract a three-document summary.
summary = extractSummary(documents,'SummarySize',3)
summary = 3x1 tokenizedDocument: 10 tokens: The quick brown fox jumped over the lazy dog . 7 tokens: The fox jumped over the dog . 9 tokens: There seem to be animals jumping other animals .
Create an array of tokenized documents.
str = [ "The quick brown fox jumped over the lazy dog." "The fox jumped over the dog." "The lazy dog saw a fox jumping." "There seem to be animals jumping over other animals." "There are quick animals and lazy animals"]; documents = tokenizedDocument(str);
Extract a three-document summary. The second output scores
contains the summary document importance scores.
[summary,scores] = extractSummary(documents,'SummarySize',3)
summary = 3x1 tokenizedDocument: 10 tokens: The quick brown fox jumped over the lazy dog . 10 tokens: There seem to be animals jumping over other animals . 7 tokens: The fox jumped over the dog .
scores = 3×1
0.2426
0.2174
0.1911
Visualize the scores in a bar chart.
figure bar(scores) xlabel("Summary Document") ylabel("Score") title("Summary Document Importance")
To summarize a single document, split the document into an array of sentences, and use the extractSummary
function.
Create a string scalar containing the document.
str = ... "There is a quick fox. The fox is brown. There is a dog which " + ... "is lazy. The dog is very lazy. The fox jumped over the dog. " + ... "The quick brown fox jumped over the lazy dog.";
Split the string into sentences using the splitSentences
function.
str = splitSentences(str)
str = 6x1 string
"There is a quick fox."
"The fox is brown."
"There is a dog which is lazy."
"The dog is very lazy."
"The fox jumped over the dog."
"The quick brown fox jumped over the lazy dog."
Create a tokenized document array containing the sentences.
documents = tokenizedDocument(str)
documents = 6x1 tokenizedDocument: 6 tokens: There is a quick fox . 5 tokens: The fox is brown . 8 tokens: There is a dog which is lazy . 6 tokens: The dog is very lazy . 7 tokens: The fox jumped over the dog . 10 tokens: The quick brown fox jumped over the lazy dog .
Extract a summary from the sentences using the extractSummary
function. To return a summary withthree documents, set the 'SummarySize'
option to 3.To ensure the summary documents appear in the same order as the input documents, set the 'OrderBy'
option to 'position'
.
summary = extractSummary(documents,'SummarySize',3,'OrderBy','position')
summary = 3x1 tokenizedDocument: 6 tokens: There is a quick fox . 7 tokens: The fox jumped over the dog . 10 tokens: The quick brown fox jumped over the lazy dog .
To reconstruct the sentences into a single document, convert the documents to string using the joinWords
function and join the sentences using the join
function.
sentences = joinWords(summary); summaryStr = join(sentences)
summaryStr = "There is a quick fox . The fox jumped over the dog . The quick brown fox jumped over the lazy dog ."
To remove the surrounding punctuation characters, use the replace
function.
punctuationRight = ["." "," "’" ")" ":" "?" "!"]; summaryStr = replace(summaryStr," " + punctuationRight,punctuationRight); punctuationLeft = ["(" "‘"]; summaryStr = replace(summaryStr,punctuationLeft + " ",punctuationLeft)
summaryStr = "There is a quick fox. The fox jumped over the dog. The quick brown fox jumped over the lazy dog."
documents
— Input documentstokenizedDocument
arrayInput documents, specified as a tokenizedDocument
array.
Specify optional
comma-separated pairs of Name,Value
arguments. Name
is
the argument name and Value
is the corresponding value.
Name
must appear inside quotes. You can specify several name and value
pair arguments in any order as
Name1,Value1,...,NameN,ValueN
.
extractSummary(documents,'ScoringMethod','lexrank')
extracts
a summary from documents
and sets the scoring method option to
'lexrank'
.'ScoringMethod'
— Scoring method'textrank'
(default) | 'lexrank'
| 'mmr'
Scoring method used for extractive summarization, specified as the comma-separated
pair consisting of 'ScoringMethod'
and one of the following:
'textrank'
– Use the TextRank algorithm.
'lexrank'
– Use the LexRank algorithm.
'mmr'
– Use the MMR algorithm.
'Query'
— Query document for MMR scoringtokenizedDocument
scalar | string array | cell array of character vectorsQuery document for MMR scoring, specified as the comma-separated pair consisting
of 'Query'
and a tokenizedDocument
scalar, a string array of words, or a cell array of
character vectors. If 'Query'
not a
tokenizedDocument
scalar, then it must be a row vector representing
a single document, where each element is a word.
This option only has an effect when 'ScoringMethod'
is
'mmr'
.
'SummarySize'
— Size of summaryInf
Size of summary, specified as the comma-separated pair consisting of
'SummarySize'
and one of the following:
Scalar in the range (0,1) – Extract the specified proportion of input
documents, rounding up. In this case, the number of summary documents
ceil(SummarySize*numDocuments)
, where
numDocuments
is the number of input documents.
Positive integer – Extract a summary with the specified number of documents.
If SummarySize
is greater than or equal to the number of
input documents, then the function returns the input documents sorted according
to the 'OrderBy'
option.
Inf
– Return the input documents sorted according to the
'OrderBy'
option.
Data Types: double
'OrderBy'
— Order of documents in summary'score'
(default) | 'position'
Order of documents in summary, specified as the comma-separated pair consisting of
'OrderBy'
and one of the following:
'score'
– Order documents by their score according to the
'ScoringMethod'
option.
'position'
– Maintain the document order from the
input.
summary
— Extracted summarytokenizedDocument
arrayExtracted summary, returned as a tokenizedDocument
array. The
summary is a subset of documents
, and is sorted according to the
'OrderBy'
option.
scores
— Summary document scoresSummary document scores, returned as a vector, where scores(i)
is
the score of the j
th summary document according to the
'ScoringMethod'
option. The scores are sorted according to the
'OrderBy'
option.
bleuEvaluationScore
| bm25Similarity
| cosineSimilarity
| lexrankScores
| mmrScores
| rakeKeywords
| rougeEvaluationScore
| textrankKeywords
| textrankScores
| tokenizedDocument
You have a modified version of this example. Do you want to open this example with your edits?