简体   繁体   English

XML中进行流过滤的最佳Java方法?

[英]best java approach for stream filtering in XML?

I want to take an XML file as input and output the same XML except for some search/replace actions for attributes and text, based on matching certain node characteristics. 我希望将XML文件作为输入并输出相同的XML,但基于匹配某些节点特征的属性和文本的某些搜索/替换操作除外。

What's the best general approach for this, and are there tutorials somewhere? 最好的一般方法是什么,那里有教程吗?

DOM is out since I can't guarantee being able to keep the whole thing in memory. DOM已淘汰,因为我无法保证能够将整个内容保存在内存中。

I don't mind using SAX or StAX, except that I want the default behavior to be a pass-through no-op filter; 我不介意使用SAX或StAX,除了我希望默认行为是通过无操作滤镜。 I did something similar with StAX once and it was a pain, didn't work with namespaces, and I was never sure if I had included all the cases I needed to handle. 我曾经用StAX做过类似的事情,这很痛苦,不适用于名称空间,而且我不确定是否已经包括了我需要处理的所有情况。

I think XSLT won't work (but am not sure), because it's declarative and I need to do some procedural calculations when figuring out what text/attributes to emit on the output. 我认为XSLT不起作用(但不确定),因为它是声明性的,在确定要在输出中发出什么文本/属性时,我需要进行一些过程计算。

(contrived example: (人为的示例:

Suppose I was looking for all nodes with XPath of /group/item/@number and wanted to evaluate the number attribute as an integer, factor it using a method public List<Integer> factorize(int i) , convert the list of factors to a space-delimited string, and add an attribute factors to the corresponding /group/item node? 假设我正在寻找所有XPath为/group/item/@number节点,并希望将number属性评估为整数,请使用public List<Integer> factorize(int i)方法对其进行public List<Integer> factorize(int i) ,将因子列表转换为一个以空格分隔的字符串,并将属性factors添加到相应的/group/item节点?

input: 输入:

<group name="beatles"><item name="paul" number="64"></group>
<group name="rolling stones"><item name="mick" number="19"></group>
<group name="who"><item name="roger" number="515"></group>

expected output: 预期输出:

<group name="beatles"><item name="paul" number="64" factors="2 2 2 2 2 2"></group>
<group name="rolling stones"><item name="mick" number="19" factors="19"></group>
<group name="who"><item name="roger" number="515" factors="103 5"></group>

)

Update: I got the StAX XMLEventReader/Writer method working easily, but it doesn't preserve certain formatting quirks that are important in my application. 更新:我使StAX XMLEventReader / Writer方法可以轻松工作,但是并没有保留某些对我的应用程序很重要的格式设置怪癖。 (I guess the program that saves/loads XML doesn't honor valid XML files. >:( argh.) Is there a way to process XML that minimizes textual differences between input and output? (at least when it comes to character data.) (我猜想,保存/加载XML的程序不能使用有效的XML文件。> :( argh。)是否有一种处理XML的方法,可以最大限度地减少输入和输出之间的文本差异?(至少在字符数据方面)。 )

XSLT seems like an appropriate model for what you are doing. XSLT似乎适合您的工作。 Look into using XSLT with procedural extensions. 考虑将XSLT与过程扩展一起使用。

If you really can't keep the whole document in memory, Saxon is your only XSLT choice. 如果您真的无法将整个文档保存在内存中,那么Saxon是您唯一的XSLT选择。 It's likely that whatever calculations you need to do can be done in XSLT -- but if not, it's not too hard to write your own extension functions . 您可能需要执行的任何计算都可以在XSLT中完成,但是如果没有,编写您自己的扩展功能并不是太困难。

I find Apache Digester a big help for rules-based parsing of XML. 我发现Apache Digester对于基于规则的XML解析有很大帮助。

Update: If it's filtering and output that you're concerned with, review this set of articles on Developerworks which is concerned with the same issues. 更新:如果您正在关注过滤和输出,请在Developerworks上查看与同一问题有关的这组文章。 Of particular relevance are parts 2 , 3 and 4 . 特别相关的是 2、34部分 The summary: Use SAX, XMLFilter and XMLWriter. 摘要:使用SAX,XMLFilter和XMLWriter。

While I suppose this is technically a good fit for XSLT, I've always found it hard to debug for complex transformations. 尽管我认为从技术上讲,这很适合XSLT,但我总是发现很难调试复杂的转换。 YMMV :-) YMMV :-)

Further Update: XMLWriter is available from here . 进一步更新: XMLWriter从此处可用。 I don't know what your particular difficulty with SAX is. 我不知道您使用SAX的特别困难是什么。 I created a file groups.xml containing: 我创建了一个包含以下内容的文件groups.xml

<groups>
<group name="beatles"><item name="paul" number="64"/></group>
<group name="rolling stones"><item name="mick" number="19"/></group>
<group name="who"><item name="roger" number="515"/></group>
</groups>

Note that I had to make some changes to make it well-formed XML. 请注意,我必须进行一些更改以使其格式正确的XML。 Then, I knocked up this simple Jython script, groups.py , to illustrate how to solve your problem: 然后,我打开了这个简单的Jython脚本groups.py ,以说明如何解决您的问题:

import java.io
import org.xml.sax.helpers
import sys

sys.path.append("xml-writer.jar")
import com.megginson.sax

def get_factors(n):
    return "factors for %s" % n

class MyFilter(org.xml.sax.helpers.XMLFilterImpl):
    def startElement(self, uri, localName, qName, attrs):
        if qName == "item":
            newAttrs = org.xml.sax.helpers.AttributesImpl(attrs)
            n = attrs.length
            for i in range(n):
                name = attrs.getLocalName(i)
                if name == "number":
                    newAttrs.addAttribute("", "factors", "factors",
                                          "CDATA",
                                          get_factors(attrs.getValue(i)))
            attrs = newAttrs
        #call superclass method...
        org.xml.sax.helpers.XMLFilterImpl.startElement(self, uri, localName,
                                                       qName, attrs)

source = org.xml.sax.InputSource(java.io.FileInputStream("groups.xml"))
reader = org.xml.sax.helpers.XMLReaderFactory.createXMLReader()
filter = MyFilter(reader)
writer = com.megginson.sax.XMLWriter(filter,
                                     java.io.FileWriter("output.xml"))
writer.parse(source)

Obviously, I've mocked up the factor finding function as your example was, I believe, purely illustrative. 显然,我已经模拟了因子查找功能,因为我相信您的示例纯粹是说明性的。 The script reads groups.xml , applies a filter, and outputs to output.xml . 该脚本读取groups.xml ,应用过滤器,然后输出到output.xml Let's run it: 让我们运行它:

$ jython groups.py
$ cat output.xml
<?xml version="1.0" standalone="yes"?>

<groups>
<group name="beatles"><item name="paul" number="64" factors="factors for 64"></item></group>
<group name="rolling stones"><item name="mick" number="19" factors="factors for 19"></item></group>
<group name="who"><item name="roger" number="515" factors="factors for 515"></item></group>
</groups>

Job done? 任务完成? Of course, you'll need to transcribe this code to Java. 当然,您需要将此代码转录为Java。

StAX should work well for you. StAX应该适合您。 Piping input to output is super easy; 输入到输出的管道超级简单; you just write the XMLEvent you get from the XMLEventReader to the XMLEventWriter. 您只需编写从XMLEventReader获得的XMLEvent到XMLEventWriter。

XMLEventFactory EVT_FACTORY;
XMLEventReader reader;
XMLEventWriter writer;

QName numberQName = new QName("number");
QName factorsQName = new QName("factors");
while(reader.hasNext()) {
  XMLEvent e = reader.nextEvent();
  if(e.isAttribute() && ((Attribute)e).getName().equals(numberQName)) {
     String v = ((Attribute)e).getValue();
     String factors = factorize(Integer.parseInt(v));
     XMLEvent factorsAttr = EVT_FACTORY.createAttribute(factorsQName, factors);
     writer.add(factorsAttr);
  }
  // pass through
  writers.add(e);
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM