简体   繁体   English

我们能否以两种不同的格式(即csv和text)在Solr单核数据中建立索引?

[英]Can we index in Solr single core data from two different formats i.e. from csv and text?

I have data in two formates CSV and TEXT. 我有两种格式的数据CSV和TEXT。

1) CSV file contains metadata. 1)CSV文件包含元数据。 ie ModifyScore, Size, fileName etc. 即ModifyScore,Size,fileName等。

2) actual text are in Text folders having files like a.txt, b.txt etc. 2)实际文本在具有a.txt,b.txt等文件的Text文件夹中。

Please is it possible to index such data in Solr in a single core through DIH or another possible way? 请问是否有可能通过DIH或其他可能的方式在Solr中将此类数据编制索引?

According to your use case I would proceed with a custom indexing app. 根据您的用例,我将继续使用自定义索引应用程序。 Apparently you want to build your Solr document fetching some field from the CSV and some other field( the content) from the TXT . 显然,您想构建Solr文档,以从CSV提取某些字段,并从TXT提取其他字段(内容)。

Using Java for example, it is going to be quite simple : You can use SolrJ, fetch the data from the CSV and TXT, build each Solr Document and then index it. 例如,使用Java将会非常简单:您可以使用SolrJ,从CSV和TXT中获取数据,构建每个Solr文档,然后对其进行索引。

I would use the DIH if I can move the data in a DB ( even 2 tables are fine, as DIH supports joins). 如果可以在数据库中移动数据,我将使用DIH(即使2个表都可以,因为DIH支持联接)。 Out of the box, you may be interested in using the script [1] transformer. 开箱即用,您可能对使用脚本[1]转换器感兴趣。 Using it in combination with your different data sources could work. 将其与您的不同数据源结合使用可能会起作用。 You need to play a little bit with it as it's not a direct solution to your problem. 您需要使用它,因为它不能直接解决您的问题。

[1] https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler#UploadingStructuredDataStoreDatawiththeDataImportHandler-TheScriptTransformer [1] https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler#UploadingStructuredDataStoreDatawiththeDataImportHandler-TheScriptTransformer

Just to mention a couple more possibilities: 仅提及更多可能性:

  1. Use DIH to index txt files into collectionA, and use /update handler to ingest csv directly into collectionB, then use Streaming Expressions to merge both into a third collection that is the one you want to keep. 使用DIH将txt文件索引到collectionA中,并使用/ update处理程序将csv直接摄取到collectionB中,然后使用“ 流表达式”将两者合并到您要保留的第三个集合中。 The main advantage is everything is in Solr, no external code. 主要优点是一切都在Solr中,而无需外部代码。

  2. Use DIH to index files (or /update to index csv) and write a Update Request Processor that will intercept docs before they are indexed, that looks up the info from the other source, and adds it to the doc. 使用DIH为文件编制索引(或/ update为csv编制索引),并编写一个更新请求处理器 ,该文件将在文档被编制索引之前对其进行拦截,该文档从其他来源查找信息并将其添加到文档中。

是的,信息和代码是可能的,如何为来自多个异构数据源的数据建立索引,请参阅为什么tikaEntityProcesor不为以下数据配置文件中的“文本”字段建立索引?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM