The MoreLikeThis
search component enables users to query for documents similar to a document in their result list.
It does this by using terms from the original document to find similar documents in the index.
There are three ways to use MoreLikeThis. The first, and most common, is to use it as a request handler. In this case, you would send text to the MoreLikeThis request handler as needed (as in when a user clicked on a “similar documents” link).
The second is to use it as a search component. This is less desirable since it performs the MoreLikeThis analysis on every document returned. This may slow search results.
The final approach is to use it as a request handler but with externally supplied text. This case, also referred to as the MoreLikeThisHandler, will supply information about similar documents in the index based on the text of the input document.
How MoreLikeThis Works
MoreLikeThis
constructs a Lucene query based on terms in a document. It does this by pulling terms from the defined list of fields ( see the mlt.fl
parameter, below). For best results, the fields should have stored term vectors in schema.xml
. For example:
<field name="cat" ... termVectors="true" />
If term vectors are not stored, MoreLikeThis
will generate terms from stored fields. A uniqueKey
must also be stored in order for MoreLikeThis to work properly.
The next phase filters terms from the original document using thresholds defined with the MoreLikeThis parameters. Finally, a query is run with these terms, and any other query parameters that have been defined (see the mlt.qf
parameter, below) and a new document set is returned.
Common Parameters for MoreLikeThis
The table below summarizes the MoreLikeThis
parameters supported by Lucene/Solr. These parameters can be used with any of the three possible MoreLikeThis approaches.mlt.fl
Specifies the fields to use for similarity. If possible, these should have stored termVectors
.mlt.mintf
Specifies the Minimum Term Frequency, the frequency below which terms will be ignored in the source document.mlt.mindf
Specifies the Minimum Document Frequency, the frequency at which words will be ignored which do not occur in at least this many documents.mlt.maxdf
Specifies the Maximum Document Frequency, the frequency at which words will be ignored which occur in more than this many documents.mlt.maxdfpct
Specifies the Maximum Document Frequency using a relative ratio to the number of documents in the index. The argument must be an integer between 0 and 100. For example 75 means the word will be ignored if it occurs in more than 75 percent of the documents in the index.mlt.minwl
Sets the minimum word length below which words will be ignored.mlt.maxwl
Sets the maximum word length above which words will be ignored.mlt.maxqt
Sets the maximum number of query terms that will be included in any generated query.mlt.maxntp
Sets the maximum number of tokens to parse in each example document field that is not stored with TermVector support.mlt.boost
Specifies if the query will be boosted by the interesting term relevance. It can be either “true” or “false”.mlt.qf
Query fields and their boosts using the same format as that used by the DisMax Query Parser. These fields must also be specified in mlt.fl
.
Parameters for the MoreLikeThisComponent
Using MoreLikeThis as a search component returns similar documents for each document in the response set. In addition to the common parameters, these additional options are available:mlt
If set to true
, activates the MoreLikeThis
component and enables Solr to return MoreLikeThis
results.mlt.count
Specifies the number of similar documents to be returned for each result. The default value is 5.mlt.interestingTerms
Same as defined below for the MLT Handler.
Parameters for the MoreLikeThisHandler
The table below summarizes parameters accessible through the MoreLikeThisHandler
. It supports faceting, paging, and filtering using common query parameters, but does not work well with alternate query parsers.mlt.match.include
Specifies whether or not the response should include the matched document. If set to false, the response will look like a normal select response.mlt.match.offset
Specifies an offset into the main query search results to locate the document on which the MoreLikeThis
query should operate. By default, the query operates on the first result for the q parameter.mlt.interestingTerms
Controls how the MoreLikeThis
component presents the “interesting” terms (the top TF/IDF terms) for the query. It supports three settings: The setting list
lists the terms. The setting none
lists no terms. The setting details
lists the terms along with the boost value used for each term. Unless mlt.boost=true
, all terms will have boost=1.0
.
MoreLikeThis Query Parser
The mlt
query parser provides a mechanism to retrieve documents similar to a given document, like the handler. More information on the usage of the mlt query parser can be found in the section Other Parsers.
This is it for today, stay tuned for more informative posts in future.