简体   繁体   English

在Solr中存储PDF

[英]Storing PDFs in Solr

I'm trying to set things up (in my local environment) so I can store PDFs in Solr, but I cannot get it to work. 我正在尝试在本地环境中进行设置,以便可以将PDF存储在Solr中,但无法使其正常工作。 Right now I'm working with the files in the example folder Solr provides. 现在,我正在使用Solr提供的示例文件夹中的文件。

I did not modify the solrconfig.xml in solr-3.6.0/example/conf because it seems to already be configured as described in Extracting Request Handler . 我没有在solr-3.6.0 / example / conf中修改solrconfig.xml,因为它似乎已经按照“ 提取请求处理程序”中所述进行了配置。 That is, it already contains this: 也就是说,它已经包含以下内容:

<lib dir="../../dist/" regex="apache-solr-cell-\d.*\.jar" />
<lib dir="../../contrib/extraction/lib" regex=".*\.jar" />

And this: 和这个:

<requestHandler name="/update/extract" 
              startup="lazy"
              class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
  <str name="fmap.content">text</str>
  <str name="lowernames">true</str>
  <str name="uprefix">ignored_</str>
  <str name="captureAttr">true</str>
  <str name="fmap.a">links</str>
  <str name="fmap.div">ignored_</str>
</lst>

I'm running Solr from the example directory with this command: 我正在使用以下命令从示例目录运行Solr:

java -jar start.jar 

And I'm trying to send the pdf to Solr with this command: 我正在尝试使用以下命令将pdf发送给Solr:

java -Durl=http://localhost:8983/solr/update/extract -Dauto -jar /Applications/Solr-3.6.0/example/exampledocs/post.jar /path/to/pdf/mypdf.pdf

If I don't make any changes to /Solr-3.6.0/example/solr/conf/schema.xml I get the message: 如果我不对/Solr-3.6.0/example/solr/conf/schema.xml进行任何更改,则会收到以下消息:

FATAL: Solr returned an error #400 [doc=null] missing required field: id

If I change the value of the property "required" in the id element in schema.xml to false I get: 如果将schema.xml的id元素中的“必需”属性值更改为false,则会得到:

FATAL: Solr returned an error #400 Document is missing mandatory uniqueKey field: id

I would think that if the required property of an element is false in the schema then I could just send files that do not contain that field but apparently that is not the case. 我认为如果元素的必需属性在架构中为false,那么我可以发送不包含该字段的文件,但是显然不是这样。

I have also tried adding the parameter -Dparams=literal.id=mypdf1 in the command that send that pdf but that doesn't help either. 我也尝试过在发送该pdf的命令中添加参数-Dparams = literal.id = mypdf1 ,但这无济于事。 Any thoughts? 有什么想法吗?

I believe my confusion was due to the fact that you need to have an id for the document you are sending to Solr, and at the same time there is an id element in Solr-3.6.0/example/solr/conf/ schema.xml . 我相信我的困惑是由于您需要为要发送到Solr的文档提供一个ID,同时在Solr-3.6.0 / example / solr / conf / 模式中还有一个id元素。 xml

I believe the first error I was getting was referring to the id element in the schema. 我相信我遇到的第一个错误是引用架构中的id元素。 The second error was referring to the document id. 第二个错误是引用文档ID。

With the help of ZeroPage I was able to overcome the second error as well, by adding the document id to the url instead of passing it as a separate parameter. 在ZeroPage的帮助下,通过将文档ID添加到url中而不是将其作为单独的参数传递,我还能够克服第二个错误。 This query now works for me: 这个查询现在对我有用:

java -Durl=http://localhost:8983/solr/update/extract?literal.id=form1 -jar /Applications/Solr-3.6.0/example/exampledocs/post.jar /path/to/pdf/form1.pdf 

If we want Solr to index the full content of the PDF we need to add the uprefix and fmap.content atrributes: 如果我们希望Solr为PDF的全部内容建立索引,则需要添加uprefixfmap.content属性:

java -Durl="http://localhost:8983/solr/update/extract?literal.id=form1&uprefix=attr_&fmap.content=attr_content&commit=true" -jar /Applications/Solr-3.6.0/example/exampledocs/post.jar /path/to/pdf/form1.pdf

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM