简体   繁体   中英

Full text search by summaries

Is it possible to create a summary of a large document using some out-of-the-box search engines, like Lucene, Solr or Sphinx and search documents most relevant to a query?

I don't need to search inside the document or create a snippet. Just get 5 documents best matching the query.

Update. More specifically I don't want an engine to keep the whole document, but only it's "summary" (you may call it index information or TD-IDF representation).

Basically, if you want to have summarization feature - there are plenty of ways to do it, for example TextRank , big article on the wiki, tons of implementation available in NTLK , and others. However, it will not help you with the querying, you will need to index it anyway somewhere.

I think you could achieve something like this, using feature called More Like This. It exists in both Lucene/Solr/Elasticsearch. The idea behind it, that if you send a query (which is a raw text of the document) the search engine will find most suitable one, by extracting from it the most relevant words (which reminds me about summarization) and then will take a look inside inverted index to find top N similar documents. It will not discard the text, though, but it will do "like" operator based on the TF-IDF metrics.

References for MLT in Elasticsearch , Lucene , Solr

but only it's "summary" (you may call it index information or TD-IDF representation).

What you are looking for seems quite standard :

  • Apache Lucene [1], if you look for a library
  • Apache Solr or Elastic Search, if you are looking for a production ready Enterprise Search Server.

How a Lucene Search Engine works [2] is building an Inverted index of each field in your document ( plus a set of additional data structures required by other features).

What apparently you don't want to do is to store the content of a field, which means taking the text content and store it in full(compressed) in the index ( to be retrieved later) .

In Lucene and Solr this is matter of configuration.

Summarisation is a completely different NLP task and is not probably what you need.

Cheers

[1] http://lucene.apache.org/index.html

[2] https://sease.io/2015/07/26/exploring-solr-internals-the-lucene-inverted-index/

Update. More specifically I don't want an engine to keep the whole document, but only it's "summary" (you may call it index information or TD-IDF representation).

To answer you updated question. Lucene/Solr fit your needs. For the 'summary', you have the option to not storing the original text by specifying:

 org.apache.lucene.document.Field.Store.NO

By saving 'summary' as field org.apache.lucene.document.TextField , the summary will be indexed and tokenized . It will store the TD-IDF information for you to search.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM