简体   繁体   English

在java中使用regexp来修改xml

[英]Using regexp in java to modify an xml

I'm trying to change an xml by using regular expressions in java, but I can't find the right way. 我试图通过在java中使用正则表达式来更改xml,但我找不到正确的方法。 I have an xml like this (simplified): 我有一个这样的xml(简化):

<ROOT>
   <NODE ord="1" />
   <NODE ord="3,2" />
</ROOT>

The xml actually shows a sentence with its nodes, chunks ... in two languages and has more attributes. xml实际上显示了一个句子,其节点,块...以两种语言显示并具有更多属性。 Each sentence it's loaded in two RichTextAreas (one for the source sentence, and the other for the translated one). 每个句子都加载在两个RichTextAreas中(一个用于源语句,另一个用于翻译的句子)。

What I need to do is add a style attribute to every node that has an specific value in its ord attribute (this style attribute will show correspondences between two languages, like Google Translate does when you mouse over a word). 我需要做的是将样式属性添加到其ord属性中具有特定值的每个节点(此样式属性将显示两种语言之间的对应关系,例如当您将鼠标悬停在单词上时,Google翻译会执行此操作)。 I know this could be done using DOM (getting all the NODE nodes and then seeing the ord attribute one by one), but I am looking for the fastest way to do the change as it is going to execute in the client side of my GWT app. 我知道这可以使用DOM(获取所有NODE节点,然后逐个查看ord属性)来完成,但我正在寻找最快的方式来进行更改,因为它将在我的GWT的客户端执行应用程序。

When that ord attribute has a single value (like in the first node) it is easy to do just taking the xml as a string and using the replaceAll() function . 当该ord属性具有单个值时(如在第一个节点中),只需将xml作为字符串并使用replaceAll()函数即可。 The problem is when the attribute has composed values (like in the second node). 问题是当属性具有组合值时(如在第二个节点中)。

For example, how could I do to add that attribute if the value I'm looking for is 2? 例如,如果我要查找的值是2,我该如何添加该属性? I believe this could be done using regular expressions, but I can't find out how. 我相信这可以使用正则表达式完成,但我无法弄清楚如何。 Any hint or help would be appreciated (even if it doesn't use regexp and replaceAll function). 任何提示或帮助将不胜感激(即使它不使用regexp和replaceAll函数)。

Thanks in advance. 提前致谢。

String resultString = subjectString.replaceAll("<NODE ord=\"([^\"]*\\b2\\b[^\"]*)\" />", "<NODE ord=\"$1\" style=\"whatever\"/>");

will find any <NODE> tag that has a single ord attribute with a value of "2" (or "1,2" or "2,3" or "1,2,3" but not "12") and adds a style attribute. 将找到任何具有单个ord属性的<NODE>标记,其值为“2”(或“1,2”或“2,3”或“1,2,3”但不是“12”)并添加style属性。

This is quick and dirty, and rightfully advised against by many here, but for a one-off quick job it should be OK. 快速和肮脏的,并且在这里被许多人理所当然地建议,但对于一次性的快速工作,应该没问题。

Explanation: 说明:

<NODE ord="  # Match <NODE ord:" verbatim
(            # Match and capture...
 [^"]*       #  any number of characters except "
 \b2\b       #  "2" as a whole word (surrounded by non-alphanumerics)
 [^"]*       #  any number of characters except "
)            # End of capturing group
" />         # Match " /> verbatim

XPath can do this for you. XPath可以为你做到这一点。 You could select: 你可以选择:

/ROOT/NODE[contains(concat(',', @ord, ','), ',2,')]

Since you intend to use GWT on the client, you could give gwtxslt a try. 由于您打算在客户端上使用GWT,您可以试试gwtxslt With it you could specify an XSLT stylesheet to do the transformation (ie adding the attribute) for you: 有了它,您可以指定一个XSLT样式表来为您进行转换(即添加属性):

XsltProcessor processor = new XsltProcessor();
processor.importStyleSheet(styleSheetText);
processor.importSource(sourceText);
processor.setParameter("ord", "2");
processor.setParameter("style", "whatever");
String resultString = processor.transform();
// do something with resultString

where styleSheetText could be an XSLT document along the lines of 其中styleSheetText可以是一行XSLT文档

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:param name="ord"   select="''" />
  <xsl:param name="style" select="''" />

  <xsl:template match="node()|@*">
    <xsl:copy>
      <xsl:apply-templates select="node()|@*" />
    </xsl:copy>
  </xsl:template>

  <xsl:template match="NODE">
    <xsl:copy>
      <xsl:apply-templates select="@*" />
      <xsl:if test="contains(concat(',', @ord, ','), concat(',', $ord, ','))">
        <xsl:attribute name="style">
          <xsl:value-of select="$style" />
        </xsl:attribute>
      </xsl:if>
      <xsl:apply-templates select="node()" />
    </xsl:copy>
  </xsl:template>
</xsl:stylesheet>

Note that I use concat() to prevent partial matches in the comma-separated list that the attribute value of @ord actually is. 请注意,我使用concat()来防止逗号分隔列表中的@ord实际属性值的部分匹配。

I'm trying to change an xml by using regular expressions in java, but I can't find the right way. 我试图通过在java中使用正则表达式来更改xml,但我找不到正确的方法。

That's because there isn't a right way. 那是因为没有正确的方法。 Regular expressions are not the right way to manipulate XML. 正则表达式不是操纵XML的正确方法。 That's because XML is not a regular grammar (which is a technical term in computer science, not a generalized insult.) 那是因为XML不是常规语法(这是计算机科学中的技术术语,而不是普遍的侮辱。)

It might sound like overkill, but I'd consider using the standard DOM parsers to read the fragment, modify it using setAttribute() calls, and then write it out again. 这可能听起来有些过分,但我会考虑使用标准DOM解析器来读取片段,使用setAttribute()调用对其进行修改,然后再将其写出来。 I know you said that efficiency is important, but how long does this really take? 我知道你说效率很重要,但这需要多长时间? Testing shows 60ms on my ageing 2GHz pentium. 测试显示我老化的2GHz奔腾60ms。

This approach will be more robust against comments, things split across lines etc. It is also much more likely to give you well-formed XML. 这种方法对于评论,跨行分割的东西更加强大。它也更有可能为您提供格式良好的XML。 Also things like your requirement of only doing it if certain values are present will become trivial. 此外,如果某些值存在,您只需要执行此操作就会变得微不足道。

public class AddStyleExample {

    public static void main(final String[] args) {
        String input = "<ROOT> <NODE ord=\"1\" /> <NODE ord=\"3,2\" /> </ROOT>";
        try {
            final DocumentBuilderFactory factory = DocumentBuilderFactory
                    .newInstance();
            factory.setValidating(false);
            factory.setNamespaceAware(false);
            DocumentBuilder builder;

            builder = factory.newDocumentBuilder();

            final Document doc = builder.parse(new InputSource(
                    new StringReader(input)));

            NodeList tags = doc.getElementsByTagName("NODE");
            for (int i = 0; i < tags.getLength(); i++) {
                Element node = (Element) tags.item(i);
                node.setAttribute("style", "example value");
            }
            StringWriter writer = new StringWriter();
            final StreamResult result = new StreamResult(writer);
            final Transformer t = TransformerFactory.newInstance()
                    .newTransformer();
            t.setOutputProperty(OutputKeys.INDENT, "yes");
            t.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
            t.transform(new DOMSource(doc), result);
            System.out.println(writer.toString());

        } catch (ParserConfigurationException e) {
            e.printStackTrace();
        } catch (TransformerException e) {
            e.printStackTrace();
        } catch (SAXException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }

    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM