简体   繁体   中英

No “content” field created when indexing PDF with solr

I have succesfully indexed PDF's using the POST command as described in the following link: http://makble.com/how-to-extract-text-from-pdf-and-post-into-solr

Terms stored within an indexed PDF file can be queried and can be found using general queries or the text field.

However, I do not see the "content" field as generated as I can with the other PDF related fields. I tried editing the managed-schema file to add the fields:

<field name="content" type="text_general" indexed="false" stored="true" multiValued="true"/>

<copyField source="content" dest="text"/>

I get the following error when I attemp to reload the core:

<str name="msg">Error handling 'reload' action</str>
<str name="trace">
org.apache.solr.common.SolrException: Error handling 'reload' action at org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$2(CoreAdminOperation.java:110) at org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:370) at org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:388) at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:174)

My solrconfig.xml has this:

<requestHandler name="/update/extract"
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="fmap.meta">ignored_</str>
      <str name="fmap.content">_text_</str>
    </lst>
  </requestHandler>

I would like to have the "content" field available to perform search only for the text located within the indexed pdf files.

1) Do not manually edit the schema file. Instead use the Schema API .

2) fmap.content maps the content field to the _text_ field in your case. If you have a content field already defined, then just removing this particular parameter from the ExtractingRequestHandler definition should do the job.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM