简体   繁体   English

如何编写正确的正则表达式来识别 XML 内容?

[英]How to write proper regex to recognize the XML content?

I have some content and I would like to know whether they are XML or not.我有一些内容,我想知道它们是否是XML How to do that ?怎么做 ? I would only need to know the answer true or false from a method return type.我只需要从方法返回类型中知道答案是true还是false I plan to use REgex but open for better suggestions.我打算使用正则表达式,但愿意提供更好的建议。

The XML content is as following and will be always in the same format (may be the molecule ID will be increased or decreased), XML内容如下,格式始终相同(可能是分子 ID 增加或减少),

<?xml version="1.0" encoding="UTF-8"?>
<molecules>
    <molecule id="1">
        <atoms>
            <atom id="1" symbol="C"/>
            <atom id="2" symbol="C"/>
            <atom id="3" symbol="N"/>
        </atoms>
        <bonds>
            <bond id="1" atomAId="1" atomBId="2" order="SINGLE"/>
            <bond id="2" atomAId="2" atomBId="3" order="DOUBLE"/>
        </bonds>
    </molecule>
     <molecule id="2">
        <atoms>
            <atom id="1" symbol="C"/>
            <atom id="2" symbol="C"/>
            <atom id="3" symbol="N"/>
        </atoms>
        <bonds>
            <bond id="1" atomAId="1" atomBId="2" order="SINGLE"/>
            <bond id="2" atomAId="2" atomBId="3" order="DOUBLE"/>
        </bonds>
    </molecule>
</molecules> 

I make the Regex to recognize the XML as following,我让正则Regex识别XML ,如下所示,

public static final String REGEX_FOR_XML = "((<(\\S(.*?))(\\s.*?)?>(.*?)<\\/\\3>)|(<\\S(.*?)(.*?)(\\/>)))";

The issue is it only matches with the inner content while I would like to make an entire content match.问题是它只与内部内容匹配,而我想让整个内容匹配。 I use this validator for matching,我使用这个验证器进行匹配,

public static boolean isValidXML(String inXMLStr) {

    if (inXMLStr == null || inXMLStr.isEmpty())
        return false;

    final Pattern pattern = Pattern.compile(Constants.REGEX_FOR_XML);
    if (pattern.matcher(inXMLStr).matches()) {
        return true;
    }
    return false;
}

How can I correct the Regex to match with the XML content or what to do as better option ?如何更正正则Regex以与XML内容匹配或如何做更好的选择?

There is an infamous answer on using Regex for XML-Parsing, which I will not link (@Henrik did anyway ;P) or go into.有一个关于使用正则表达式进行 XML 解析的臭名昭著的答案,我不会链接(@Henrik 无论如何都做了;P)或进入。 But bottomline: Regex is very rarely a good idea to do XML validation (or parsing for that matter).但底线:正则表达式很少是进行 XML 验证(或为此进行解析)的好主意。

I suggest you go here: XML validation Oracle Docs我建议你去这里: XML 验证 Oracle Docs

I guess it should be what you want.我想这应该是你想要的。 See, in Java you can use Schema-Validation to validate XML - which is what you want to do if I read the question correctly.看,在 Java 中,您可以使用 Schema-Validation 来验证 XML - 如果我正确阅读了问题,这就是您想要做的。

What you will have to do is to write a schema definition instead of a regex.您需要做的是编写模式定义而不是正则表达式。 This is not only the "correct and straight-forward" way to go, it will be much easier to maintain, too.这不仅是“正确和直接”的方式,而且维护起来也容易得多。 It is no rocket science, neither and your schema seems to be pretty clear and rather easy to be condensed into an xsd.这不是火箭科学,也不是火箭科学,而且您的架构似乎非常清晰,并且很容易浓缩到 xsd 中。 There are also tools which can help you do that.还有一些工具可以帮助您做到这一点。 The outcome of those might still have to be fine-tuned, though.不过,这些结果可能仍需进行微调。

Note: I know that "link-only" answers are discouraged on SO, but the resource is too big to be copied to the answer (at least IMHO).注意:我知道在 SO 上不鼓励“仅链接”答案,但是资源太大而无法复制到答案中(至少恕我直言)。 Also, there might be some copyright on behalf of Oracle.此外,可能有代表 Oracle 的一些版权。 Since it is official Oracle Docs it should not be prone to "broken link" probably, too.由于它是官方的 Oracle Docs,它也不应该容易出现“断开的链接”。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM