简体   繁体   English

Solr Cloud:如何禁用文档(pdf,office)元数据作为字段

[英]Solr Cloud: How to disable document (pdf, office) metadata as fields

I am new to Solr and using Solr 7.3.1 in solr cloud mode and trying to index pdf, office documents in solr, using contentextraction in solr. 我是Solr的新手,并在solr云模式下使用Solr 7.3.1,并尝试在solr中使用contentextraction索引pdf,solr中的Office文档。

I created a collection with 我创建了一个收藏
bin\\solr create -c tsindex -s 2 -rf 2

in SolrJ my code looks like 在SolrJ中,我的代码看起来像

public static void main(String[] args) {
    System.out.println("Solr Indexer");
    final String solrUrl = "http://localhost:8983/solr/tsindex/";
    HttpSolrClient solr = new HttpSolrClient.Builder(solrUrl).build();
    String filename="C:\\iSampleDocs\\doc-file.doc";    
    ContentStreamUpdateRequest solrRequest = new ContentStreamUpdateRequest("/update/extract");
    try {
        solrRequest.addFile(new File(filename), "application/msword");
        solrRequest.setParam("litral.ts_ref", "ts-456123");
        //solrRequest.setParam("defaultField", "text");

        solrRequest.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
        NamedList<Object> result= solr.request(solrRequest);
        System.out.println(result);

    } catch (IOException  e) {
        e.printStackTrace();
    }catch ( SolrServerException e) {
        e.printStackTrace();
    }
}

I am getting multiple issues 我遇到多个问题

  1. Although I have created field ts_ref as text_general in Solr Admin UI, this field does not get set at all. 尽管我已经在Solr Admin UI text_general字段ts_reftext_general ,但此字段完全没有设置。

  2. My goal is to index the complete document including its metadata in one field and then set couple of more fileds refrencing document in another system like eg ts_ref field. 我的目标是在一个字段中索引包含其元数据的完整文档,然后在另一个系统(例如ts_ref字段)中设置多个引用文件的文件。 But what actually happens is the solr extracts the metadata of files and create seperate fileds for each metadata value. 但是实际发生的是Solr提取文件的元数据并为每个元数据值创建单独的文件。

I have tried disabling data driven schema functionality by bin\\solr config -c tsindex -zkHost localhost:9983 -property update.autoCreateFields -value false 我尝试通过bin\\solr config -c tsindex -zkHost localhost:9983 -property update.autoCreateFields -value false禁用data driven schema functionality

When I uncomment line solrRequest.setParam("defaultField", "text"); 当我取消注释行solrRequest.setParam("defaultField", "text"); from beginning, there is not separate fields for all metadata extracted, but as soon as I comment this line and upload the files, the meta data are again in separate fields afterwards (even if I uncomment its again). 从一开始,就没有针对提取的所有元数据的单独字段,但是,一旦我注释了这一行并上传了文件,元数据便会再次位于单独的字段中(即使我再次取消注释)。

  1. "litral.ts_ref" there is a typo here, missing an e “ litral.ts_ref”这里有一个错字,缺少一个e
  2. you can achieve ignoring all metadata fields by using uprefix field, and a dynamic field that goes with it. 您可以通过使用uprefix字段及其附带的动态字段来忽略所有元数据字段。 See the doc that shows exactly that case. 请参阅说明该情况的文档

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM