[英]How to index pdf's content with SolrJ?
I'm trying to index a few pdf documents using SolrJ as described at http://wiki.apache.org/solr/ContentStreamUpdateRequestExample , below there's the code: 我正在尝试使用SolrJ为一些pdf文档编制索引,如http://wiki.apache.org/solr/ContentStreamUpdateRequestExample所述 ,下面是代码:
import static org.apache.solr.handler.extraction.ExtractingParams.LITERALS_PREFIX;
import static org.apache.solr.handler.extraction.ExtractingParams.MAP_PREFIX;
import static org.apache.solr.handler.extraction.ExtractingParams.UNKNOWN_FIELD_PREFIX;
import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer;
import org.apache.solr.client.solrj.request.AbstractUpdateRequest;
import org.apache.solr.client.solrj.request.ContentStreamUpdateRequest;
import org.apache.solr.common.util.NamedList;
...
public static void indexFilesSolrCell(String fileName) throws IOException, SolrServerException {
String urlString = "http://localhost:8080/solr";
SolrServer server = new CommonsHttpSolrServer(urlString);
ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract");
up.addFile(new File(fileName));
String id = fileName.substring(fileName.lastIndexOf('/')+1);
System.out.println(id);
up.setParam(LITERALS_PREFIX + "id", id);
up.setParam(LITERALS_PREFIX + "location", fileName); // this field doesn't exists in schema.xml, it'll be created as attr_location
up.setParam(UNKNOWN_FIELD_PREFIX, "attr_");
up.setParam(MAP_PREFIX + "content", "attr_content");
up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
NamedList<Object> request = server.request(up);
for(Entry<String, Object> entry : request){
System.out.println(entry.getKey());
System.out.println(entry.getValue());
}
}
Unfortunately when querying for *:* I get the list of indexed documents but the content field is empty. 不幸的是,当查询*:*时,我得到了索引文档的列表,但是content字段为空。 How can I change the code above to extract also the document's content?
如何更改上面的代码以提取文档的内容?
Below there's the xml frament that describes this document : 下面是描述此文档的xml框架:
<doc>
<arr name="attr_content">
<str> </str>
</arr>
<arr name="attr_location">
<str>/home/alex/Documents/lsp.pdf</str>
</arr>
<arr name="attr_meta">
<str>stream_size</str>
<str>31203</str>
<str>Content-Type</str>
<str>application/pdf</str>
</arr>
<arr name="attr_stream_size">
<str>31203</str>
</arr>
<arr name="content_type">
<str>application/pdf</str>
</arr>
<str name="id">lsp.pdf</str>
</doc>
I don't think that this problem is related to an incorrect installation of Apache Tika, because previously I had a few ServerException but now I've installed the required jars in the correct path. 我不认为此问题与Apache Tika的错误安装有关,因为以前我有一些ServerException,但现在我已经在正确的路径中安装了所需的jar。 Moreover I've tried to index a txt file using the same class but the attr_content field is always empty.
此外,我尝试使用相同的类为txt文件编制索引,但attr_content字段始终为空。
In the schema.xml file, have you set "stored= true" in the content field, an example of my schema.xml file, taht I use to store the content of pdf and other binaries files. 在schema.xml文件中,是否已在内容字段中设置“ stored = true”,这是我的schema.xml文件的示例,我用来存储pdf和其他二进制文件的内容。
<field name="text" type="textgen" indexed="true" stored="true" required="false" multiValued="true"/>
Did it help you? 对您有帮助吗?
Héctor 埃克托
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.