mmrScores

Document scoring with Maximal Marginal Relevance (MMR) algorithm

Syntax

scores = mmrScores(documents,queries)

scores = mmrScores(bag,queries)

scores = mmrScores(___,lambda)

Description

scores = mmrScores(documents,queries) scores documents according to their relevance to a queries avoiding redundancy using the MMR algorithm. The score in scores(i,j) is the MMR score of documents(i) relative to queries(j).

scores = mmrScores(bag,queries) scores documents encoded by the bag-of-words or bag-of-n-grams model bag relative to queries. The score in scores(i,j) is the MMR score of the ith document in bag relative to queries(j).

scores = mmrScores(___,lambda) also specifies the trade off between relevance and redundancy.

Examples

collapse all

Relevance to Query

Open Live Script

Create an array of input documents.

str = [
    "the quick brown fox jumped over the lazy dog"
    "the fast fox jumped over the lazy dog"
    "the dog sat there and did nothing"
    "the other animals sat there watching"];
documents = tokenizedDocument(str)

documents = 
  4x1 tokenizedDocument:

    9 tokens: the quick brown fox jumped over the lazy dog
    8 tokens: the fast fox jumped over the lazy dog
    7 tokens: the dog sat there and did nothing
    6 tokens: the other animals sat there watching

Create an array of query documents.

str = [
    "a brown fox leaped over the lazy dog"
    "another fox leaped over the dog"];
queries = tokenizedDocument(str)

queries = 
  2x1 tokenizedDocument:

    8 tokens: a brown fox leaped over the lazy dog
    6 tokens: another fox leaped over the dog

Calculate MMR scores using the mmrScores function. The output is a sparse matrix.

scores = mmrScores(documents,queries);

Visualize the MMR scores in a heat map.

figure
heatmap(scores);
xlabel("Query Document")
ylabel("Input Document")
title("MMR Scores")

Higher scores correspond to stonger relavence to the query documents.

Relevance Versus Redundancy

Open Live Script

Create an array of input documents.

str = [
    "the quick brown fox jumped over the lazy dog"
    "the quick brown fox jumped over the lazy dog"
    "the fast fox jumped over the lazy dog"
    "the dog sat there and did nothing"
    "the other animals sat there watching"
    "the other animals sat there watching"];
documents = tokenizedDocument(str);

Create a bag-of-words model from the input documents.

bag = bagOfWords(documents)

bag = 
  bagOfWords with properties:

          Counts: [6x17 double]
      Vocabulary: [1x17 string]
        NumWords: 17
    NumDocuments: 6

Create an array of query documents.

str = [
    "a brown fox leaped over the lazy dog"
    "another fox leaped over the dog"];
queries = tokenizedDocument(str)

queries = 
  2x1 tokenizedDocument:

    8 tokens: a brown fox leaped over the lazy dog
    6 tokens: another fox leaped over the dog

Calculate the MMR scores. The output is a sparse matrix.

scores = mmrScores(bag,queries);

Visualize the MMR scores in a heat map.

figure
heatmap(scores);
xlabel("Query Document")
ylabel("Input Document")
title("MMR Scores")

Now calculate the scores again, and set the lambda value to 0.01. When the lambda value is close to 0, redundant documents yield lower scores and diverse (but less query-relevant) documents yield higher scores.

lambda = 0.01;
scores = mmrScores(bag,queries,lambda);

Visualize the MMR scores in a heat map.

figure
heatmap(scores);
xlabel("Query Document")
ylabel("Input Document")
title("MMR Scores, lambda = " + lambda)

Finally, calculate the scores again and set the lambda value to 1. When the lambda value is 1, the query-relevant documents yield higher scores despite other documents yielding high scores.

lambda = 1;
scores = mmrScores(bag,queries,lambda);

Visualize the MMR scores in a heat map.

figure
heatmap(scores);
xlabel("Query Document")
ylabel("Input Document")
title("MMR Scores, lambda = " + lambda)

Input Arguments

collapse all

`documents` — Input documents
`tokenizedDocument` array | string array of words | cell array of character vectors

Input documents, specified as a tokenizedDocument array, a string array of words, or a cell array of character vectors. If documents is not a tokenizedDocument array, then it must be a row vector representing a single document, where each element is a word. To specify multiple documents, use a tokenizedDocument array.

`bag` — Input model
`bagOfWords` object | `bagOfNgrams` object

Input bag-of-words or bag-of-n-grams model, specified as a bagOfWords object or a bagOfNgrams object. If bag is a bagOfNgrams object, then the function treats each n-gram as a single word.

`queries` — Set of query documents
`tokenizedDocument` array | string array of words | cell array of character vectors

Set of query documents, specified as one of the following:

A tokenizedDocument array
A 1-by-N string array representing a single document, where each element is a word
A 1-by-N cell array of character vectors representing a single document, where each element is a word

To compute term frequency and inverse document frequency statistics, the function encodes queries using a bag-of-words model. The model it uses depends on the syntax you call it with. If your syntax specifies the input argument documents, then it uses bagOfWords(documents). If your syntax specifies bag, then the function encodes queries using bag then uses the resulting tf-idf matrix.

`lambda` — Trade off between relevance and redundancy
0.3 (default) | nonnegative scalar

Trade off between relevance and redundancy, specified as a nonnegative scalar.

When lambda is close to 0, redundant documents yield lower scores and diverse (but less query-relevant) documents yield higher scores. If lambda is 1, then query-relevant documents yield higher scores despite other documents yielding high scores.

Output Arguments

collapse all

`scores` — MMR scores
vector

MMR scores, returned as an N1-by-N2 matrix, where scores(i,j) is the MMR score of documents(i) relative to jth query document, and N1 and N2 are the number of input and query documents, respectively.

A document has a high MMR score if it is both relevant to the query and has minimal similarity relative to the other documents.

References

[1] Carbonell, Jaime G., and Jade Goldstein. "The use of MMR, diversity-based reranking for reordering documents and producing summaries." In SIGIR, vol. 98, pp. 335-336. 1998.

Documentation

mmrScores

Syntax

Description

Examples

Relevance to Query

Relevance Versus Redundancy

Input Arguments

`documents` — Input documents
`tokenizedDocument` array | string array of words | cell array of character vectors

`bag` — Input model
`bagOfWords` object | `bagOfNgrams` object

`queries` — Set of query documents
`tokenizedDocument` array | string array of words | cell array of character vectors

`lambda` — Trade off between relevance and redundancy
0.3 (default) | nonnegative scalar

Output Arguments

`scores` — MMR scores
vector

References

See Also

Topics

Text Analytics Toolbox Documentation

Support

Documentation

mmrScores

Syntax

Description

Examples

Relevance to Query

Relevance Versus Redundancy

Input Arguments

documents — Input documents tokenizedDocument array | string array of words | cell array of character vectors

bag — Input model bagOfWords object | bagOfNgrams object

queries — Set of query documents tokenizedDocument array | string array of words | cell array of character vectors

lambda — Trade off between relevance and redundancy 0.3 (default) | nonnegative scalar

Output Arguments

scores — MMR scores vector

References

See Also

Topics

Text Analytics Toolbox Documentation

Support

`documents` — Input documents
`tokenizedDocument` array | string array of words | cell array of character vectors

`bag` — Input model
`bagOfWords` object | `bagOfNgrams` object

`queries` — Set of query documents
`tokenizedDocument` array | string array of words | cell array of character vectors

`lambda` — Trade off between relevance and redundancy
0.3 (default) | nonnegative scalar

`scores` — MMR scores
vector