简体   繁体   English

在Solr中索引全文和描述性元数据

[英]Indexing full-text and descriptive metadata in Solr

I have a small set of descriptive metadata (~50) and for each of them a corresponding full text file (.txt). 我有一小组描述性元数据(~50),每个元素都有一个相应的全文文件(.txt)。 My understanding is that the Apache Tika framework is used for detecting and extracting metadata and structured text from various types of documents. 我的理解是Apache Tika框架用于从各种类型的文档中检测和提取元数据和结构化文本。 However, I would also need to implement a linkage mechanism whereby a given metadata is matched to its full-text. 但是,我还需要实现一种链接机制,使给定的元数据与其全文匹配。 Can this be done in Solr? 这可以在Solr完成吗?

Thanks, 谢谢,

Ilaria ILARIA

If you have metadata and the document content, you can index the metadata and store the content. 如果您有元数据和文档内容,则可以索引元数据并存储内容。 Your field definition would look something like this 您的字段定义看起来像这样

<field name="filename" type="text" indexed="true" stored="true"/>
... <!-- other metadata /-->
<field name="content" type="text" indexed="false" stored="true"/>

This will allow you to search by any metadata, and give you back the content. 这将允许您按任何元数据进行搜索,并返回内容。 You can add as much meta information as required to search the text. 您可以根据需要添加尽可能多的元信息来搜索文本。 I wouldn't index the full text as there is already some structured metadata available. 我不会将全文编入索引,因为已经有一些结构化元数据可用。

Apache TIKA extracts meta information from HTML pages etc. Since you already have the metadata available, you need not use TIKA. Apache TIKA从HTML页面等中提取元信息。由于您已经拥有元数据,因此无需使用TIKA。 Besides, AFAIK, Tika does not work with plain text files. 此外,AFAIK,Tika不适用于纯文本文件。

Edit 1 : 编辑1

Ok, so the link between the metadata and content will be maintained in Solr. 好的,所以元数据和内容之间的链接将在Solr中保留。 For ex, if you have 对于前者,如果你有

File1.txt <-> Metadata1.txt

You could have one record (document) in Solr that has (no. of metadatafields + 1 plaintextcontent field). 你可以在Solr中有一个记录(文档)(没有元数据字段+ 1个plaintextcontent字段)。 This gives you the flexibility to look up the document by any metadata. 这使您可以灵活地按任何元数据查找文档。 For example, 例如,

q=filename:File1.txt

or 要么

q=filesize:[1 to 100]

where filename and filesize are example metadata fields. 其中filenamefilesize是示例元数据字段。 plaintextcontent would be your text file content, so thus in your Solr schema, you have your link. plaintextcontent将是您的文本文件内容,因此在您的Solr架构中,您有自己的链接。

Now the trick is to setup the indexing. 现在的诀窍是设置索引。 Here's one way to do it - 这是一种方法 -

Indexing the text file is very simple. 索引文本文件非常简单。 You could use the DataImportHandler's PlainTextEntityProcessor . 您可以使用DataImportHandler的PlainTextEntityProcessor

Indexing the metadata along with it could be slightly tricky (need to understand the structure of metadata). 将元数据与其一起索引可能有点棘手(需要了解元数据的结构)。 You could use LineEntityProcessor or any one of the Transformers of DataImportHandler , depending on what suits you best. 您可以使用LineEntityProcessorDataImportHandler的任何一个变换器 ,具体取决于最适合您的方式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM