从原始文本文件中获取所有 XML？

Question

I have log file and i need to write programm which get all xml's from this file.我有日志文件，我需要编写程序来从这个文件中获取所有 xml。 File looks like文件看起来像

text
text
xml
text
xml
text 
etc

Can you give me advice what is better to use regexp or something else?你能给我建议使用正则表达式或其他什么更好吗？ Maybe it's possible to do it with dom4j?也许可以用 dom4j 做到这一点？
If i'll try to use regexp i see next problem that text parts have <> tags.如果我尝试使用正则表达式，我会看到下一个问题，即文本部分具有<>标签。

Update 1: XML example更新 1： XML 示例

  SOAP message:
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
<soapenv:Body>
 here is body part of valid xml
</soapenv:Body>
</soapenv:Envelope>
text,text,text,text
symbols etc
  SOAP message:
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
<soapenv:Body>
 here is body part of valid xml
</soapenv:Body>
</soapenv:Envelope>
text,text,text,text
symbols etc

Thanks.谢谢。

Answer 1

if your XMl is always on one line then you can just iterate over lines checking if it starts with < .如果您的 XML 总是在一行上，那么您可以遍历行检查它是否以<开头。 If so try to parse the whole line as DOM.如果是这样，请尝试将整行解析为 DOM。

String xml = "hello\n" + //
        "this is some text\n" + //
        "<foo>I am XML</foo>\n" + //
        "<bar>me too!</bar>\n" + //
        "foo is bar\n" + //
        "<this is not valid XML\n" + //
        "<foo><bar>so am I</bar></foo>\n";
List<Document> docs = new ArrayList<Document>(); // the documents we can find
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
for (String line : xml.split("\n")) {
    if (line.startsWith("<")) {
        try {
            ByteArrayInputStream bis = new ByteArrayInputStream(line.getBytes());
            Document doc = docBuilder.parse(bis);
            docs.add(doc);
        } catch (Exception e) {
            System.out.println("Problem parsing line: `" + line + "` as XML");
        }
    } else {
        System.out.println("Discarding line: `" + line + "`");
    }
}
System.out.println("\nFound " + docs.size() + " XML documents.");
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
for (Document doc : docs) {
    StringWriter sw = new StringWriter();
    transformer.transform(new DOMSource(doc), new StreamResult(sw));
    String docAsXml = sw.getBuffer().toString().replaceAll("</?description>", "");
    System.out.println(docAsXml);
}

Output:输出：

Discarding line: `hello`
Discarding line: `this is some text`
Discarding line: `foo is bar`
Problem parsing line: `<this is not valid XML` as XML

Found 3 XML documents.
<foo>I am XML</foo>
<bar>me too!</bar>
<foo><bar>so am I</bar></foo>

Answer 2

如果每个这样的部分都在单独的行中，那么它应该非常简单：

s = s.replaceAll("(?m)^\\s*[^<].*\\n?", "");

从原始文本文件中获取所有 XML？

问题描述

2 个解决方案

解决方案1
1 2012-11-26 13:40:30

解决方案2
1 已采纳 2012-11-26 14:01:47

从原始文本文件中获取所有 XML？

问题描述

2 个解决方案

解决方案1 1 2012-11-26 13:40:30

解决方案2 1 已采纳 2012-11-26 14:01:47

解决方案1
1 2012-11-26 13:40:30

解决方案2
1 已采纳 2012-11-26 14:01:47