简体   繁体   English

Java SaxParser在&之后修剪字符串

[英]Java SaxParser trim the string after &

I want to parse this xml: 我想解析这个xml:

<sparql xmlns="http://www.w3.org/2005/sparql-results#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/sw/DataAccess/rf1/result2.xsd">
 <head>
  <variable name="uri"/>
  <variable name="id"/>
  <variable name="label"/>
</head>
<results distinct="false" ordered="true">
<result>
  <binding name="uri"><uri>http://dbpedia.org/resource/Davis_&amp;_Weight_Motorsports</uri></binding> 
  <binding name="label"><literal xml:lang="en">Davis &amp; Weight Motorsports</literal></binding>
  <binding name="id"><literal datatype="http://www.w3.org/2001/XMLSchema#integer">5918444</literal></binding>
  <binding name="label"><literal xml:lang="en">Davis &amp; Weight Motorsports</literal></binding>
</result></results></sparql>

This is my handler: 这是我的处理程序:

public class DBpediaLookupClient extends DefaultHandler{

public DBpediaLookupClient(String query) throws Exception {
    this.query = query;
   HttpMethod method = new GetMethod("some_uri&query=" + query2);
    try {         
      client.executeMethod(method);       
      InputStream ins = method.getResponseBodyAsStream();
      SAXParserFactory factory = SAXParserFactory.newInstance();
      SAXParser sax = factory.newSAXParser();
      sax.parse(ins, this);


    } catch (HttpException he) {
      System.err.println("Http error connecting to lookup.dbpedia.org");
    } catch (IOException ioe) {
      System.err.println("Unable to connect to lookup.dbpedia.org");
    }
    method.releaseConnection();
  }

public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {      
    if (qName.equalsIgnoreCase("td") || qName.equalsIgnoreCase("uri") || qName.equalsIgnoreCase("literal")) {
      tempBinding = new HashMap<String, String>();
    }
    lastElementName = qName;
  }

  public void endElement(String uri, String localName, String qName) throws SAXException {     
    if (qName.equalsIgnoreCase("uri") || qName.equalsIgnoreCase("literal") || qName.equalsIgnoreCase("td")) {
      if (!variableBindings.contains(tempBinding))
        variableBindings.add(tempBinding);
    }
  }

  public void characters(char[] ch, int start, int length) throws SAXException {
    String s = new String(ch, start, length).trim();
    if (s.length() > 0) {
      if ("td".equals(lastElementName)) {
        if (tempBinding.get("td") == null) {
          tempBinding.put("td", s);
        }           
      }

      else if ("uri".equals(lastElementName)) {
            if (tempBinding.get("uri") == null) {
                  tempBinding.put("uri", s);
                }
      }
      else if ("literal".equals(lastElementName)) {
            if (tempBinding.get("literal") == null) {
                  tempBinding.put("literal", s);
                }
      }
      //if ("URI".equals(lastElementName)) tempBinding.put("URI", s);
      if ("URI".equals(lastElementName) && s.indexOf("Category")==-1 && tempBinding.get("URI") == null) {
        tempBinding.put("URI", s);
      }
      if ("Label".equals(lastElementName)) tempBinding.put("Label", s);
    }
  }
}

And this is the result: 结果如下:

key: uri, value: http://dbpedia.org/resource/Davis_
key: literal, value: 5918444
key: literal, valueDavis

As you can see it gets seperated from the & 如您所见,它与&分开

When I trace through the character() function I see that the lenght is wrong and is up to & instead of being up to the end of the string that I want to get as the result. 当我遍历character()函数时,我发现长度错误,并且取决于&而不是取决于要获得结果的字符串的结尾。

I copied this part of code and I don't know much about parser and handlers, I just know that much that I got from tracing the code, and wherever I searched it was said there should be &amp; 我复制了这部分代码,对解析器和处理程序一无所知,我只是从跟踪代码中学到了很多,无论我在哪里搜索,都说应该有&amp; instead of & in an xml document, which is the case here. 而不是&在xml文档中,在这种情况下。

What should I do in this code to get the complete string not get trimed by & character? 我应该在这段代码中怎么做才能使完整的字符串不被&字符修饰?

This is a lesson everyone has to learn when using SAX: the parser can break up text nodes and report the content in multiple calls to characters(), and it's the application's job to reassemble it (eg by using a StringBuilder). 这是每个使用SAX时每个人都必须学习的课程:解析器可以分解文本节点并在对characters()的多次调用中报告内容,并且重新组装它是应用程序的工作(例如,通过使用StringBuilder)。 It's very common for parsers to break the text at any point where it would otherwise have to shunt characters around in memory, eg where entity references occur or where it hits an I/O buffer boundary. 解析器通常在任何地方都必须中断文本,否则将不得不在内存中分流字符,例如出现实体引用或到达I / O缓冲区边界的地方。

It was designed this way to make SAX parsers super-efficient by minimizing text copying, but I suspect there's no real benefit, because the text copying just has to be done by the application instead. 它的设计方式是通过最大限度地减少文本复制来使SAX解析器超高效,但我怀疑这没有真正的好处,因为文本复制只需要由应用程序来完成。

Don't try and second-guess the parser as @DavidWallace suggests. 不要尝试像@DavidWallace所建议的那样猜测解析器。 The parser is allowed to break the text up any way it likes, and your application should cater for that. 允许解析器以任何喜欢的方式分解文本,并且您的应用程序应满足这一要求。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM