简体繁体 English

Apache Solr-搜索文件中的文本

[英]Apache Solr - Search Text in File

原文 2017-09-15 08:54:04 7 1 apache/ solr/ lucene/ text-search

Well, I'm looking into Solr to fulfill my below specific requirement: 好吧，我正在寻找Solr来满足我的以下特定要求：

Requirements: 要求：

There would be one "X" name of the folder where thousands of XML structured files are situated, now I want to search one term (ie "Hello World"), In result, I want to get the number of files which would have the name "Hello World". 文件夹中会有一个“ X”名称，其中包含成千上万个XML结构化文件，现在我要搜索一个术语（即“ Hello World”），结果，我想获取具有该名称的文件数名称为“ Hello World”。

So Can we achieve using Solr, if yes then can anyone give me bit guide to achieve the same? 那么我们可以使用Solr实现吗，如果可以，那么任何人都可以给我一些指导以实现相同目标吗？

Note: XML file would be in any format, ie ( https://i.stack.imgur.com/wNPTW.png ) 注意： XML文件可以采用任何格式，即（ https://i.stack.imgur.com/wNPTW.png ）

Question: Is structure define in "wNPTW.png" is valid for Solr to search text? 问题：“ wNPTW.png”中定义的结构是否对Solr搜索文本有效？ or we must need to depend on Solr specific document structure. 否则我们必须依赖于Solr特定的文档结构。 ie ( https://i.stack.imgur.com/sqn5q.png ) 即（ https://i.stack.imgur.com/sqn5q.png ）

In addition, performance is my primary requirement. 另外，性能是我的主要要求。

Please suggest me how I can move ahead on this? 请建议我如何继续前进？ if is there any other technology available then kindly suggest me. 如果还有其他可用的技术，那么请建议我。

Looking forward to hearing from you guys :) 期待收到你们的来信:)

1 个解决方案

Yes. 是。

If the XML format is more or less identical across all documents, you can use the Data Import Handler to configure a mapping (using xpath) from nodes to fields. 如果XML格式在所有文档中或多或少都相同，则可以使用数据导入处理程序来配置（使用xpath）从节点到字段的映射。 You can do this to map almost any XML field to a common Solr field as well (if the XML files aren't well defined). 您可以执行此操作以将几乎所有XML字段也映射到公共Solr字段（如果XML文件定义不正确）。

Another option is to use the built-in support with Apache Tika to parse files and use that to extract data into a content field and search against that. 另一个选择是使用Apache Tika的内置支持来解析文件，并使用该支持将数据提取到内容字段中并对其进行搜索。

If you require more specific handling of the files, writing a small indexer and performing the required transformation in that layer is probably the easiest path ahead. 如果您需要对文件进行更具体的处理，则编写一个小的索引器并在该层中执行所需的转换可能是最简单的方法。