简体   繁体   English

索引编制过程中出现SOLR RuntimeException:如何写文档ID进行记录?

[英]SOLR RuntimeException during indexing: how to write document id to log?

We are indexing millions of documents. 我们正在索引数百万个文档。 We use Solr 3.1 and Jetty. 我们使用Solr 3.1和Jetty。 I enabled logging in Jetty as described here: http://wiki.apache.org/solr/LoggingInDefaultJettySetup 我启用了Jetty的登录功能,如下所述: http : //wiki.apache.org/solr/LoggingInDefaultJettySetup

For some fulltexts we get exceptions and therefore logs like this one: 对于某些全文,我们会得到例外,因此日志如下:

<record>
  <date>2012-09-04T15:55:16</date>
  <millis>1346766916578</millis>
  <sequence>0</sequence>
  <logger>org.apache.solr.core.SolrCore</logger>
  <level>SEVERE</level>
  <class>org.apache.solr.common.SolrException</class>
  <method>log</method>
  <thread>10</thread>
  <message>java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xd835(a surrogate character)  at c
har #1144, byte #127)
        at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
        at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
        at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
        at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
        at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:287)
        at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:146)
        at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77)
        at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
        at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
        at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
        at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
        at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
        at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
        at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
        at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
        at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
        at org.mortbay.jetty.Server.handle(Server.java:326)

</message>
</record>

It would be great to also log the sent document id. 最好也记录发送的文档ID。 How can we do this? 我们应该怎么做?

Thank you! 谢谢!

Are you asking how to get Jetty to log the ID? 您是否在问如何让Jetty记录ID? It is unlikely that you will be able to log it through Jetty as the XML in the request can't be parsed in order to get to the ID value. 您不可能通过Jetty记录它,因为无法解析请求中的XML以获得ID值。 Notice the stack trace informs that the XMLLoader.readDoc() method never gets past line 287. Here's the code for that class (for your version): http://svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_3_1/solr/src/java/org/apache/solr/handler/XMLLoader.java?revision=1086927&view=markup 请注意,堆栈跟踪通知XMLLoader.readDoc()方法永远不会超出第287行。这是该类的代码(针对您的版本): http : //svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_3_1 /solr/src/java/org/apache/solr/handler/XMLLoader.java?revision=1086927&view=markup

The relevant section: 相关部分:

 SolrInputDocument readDoc(XMLStreamReader parser) throws XMLStreamException {
264     SolrInputDocument doc = new SolrInputDocument();
265     
266     String attrName = "";
267     for (int i = 0; i < parser.getAttributeCount(); i++) {
268     attrName = parser.getAttributeLocalName(i);
269     if ("boost".equals(attrName)) {
270     doc.setDocumentBoost(Float.parseFloat(parser.getAttributeValue(i)));
271     } else {
272     XmlUpdateRequestHandler.log.warn("Unknown attribute doc/@" + attrName);
273     }
274     }
275     
276     StringBuilder text = new StringBuilder();
277     String name = null;
278     float boost = 1.0f;
279     boolean isNull = false;
280     while (true) {
281     int event = parser.next();
282     switch (event) {
283     // Add everything to the text
284     case XMLStreamConstants.SPACE:
285     case XMLStreamConstants.CDATA:
286     case XMLStreamConstants.CHARACTERS:
287     text.append(parser.getText());

The Solr document has not yet been built, so there's no real way to get to the records ID field. Solr文档尚未构建,因此没有真正的方法可以访问记录ID字段。

The workaround is to have your indexer script check the status codes of the Solr responses and write the record ID to a log if status is not 0 (success). 解决方法是让索引器脚本检查Solr响应的状态码,如果状态不为0(成功),则将记录ID写入日志。 Likewise if you are using Java or PHP or a language that can trap exceptions you can catch those too and write out to log. 同样,如果您使用Java或PHP或可以捕获异常的语言,则也可以捕获异常并写出日志。

Hope this helps, and good luck. 希望这会有所帮助,并祝你好运。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM