简体   繁体   English

从原始文本文件中获取所有 XML?

[英]Get all XMLs from raw text file?

I have log file and i need to write programm which get all xml's from this file.我有日志文件,我需要编写程序来从这个文件中获取所有 xml。 File looks like文件看起来像

text
text
xml
text
xml
text 
etc

Can you give me advice what is better to use regexp or something else?你能给我建议使用正则表达式或其他什么更好吗? Maybe it's possible to do it with dom4j?也许可以用 dom4j 做到这一点?
If i'll try to use regexp i see next problem that text parts have <> tags.如果我尝试使用正则表达式,我会看到下一个问题,即文本部分具有<>标签。

Update 1: XML example更新 1: XML 示例

  SOAP message:
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
<soapenv:Body>
 here is body part of valid xml
</soapenv:Body>
</soapenv:Envelope>
text,text,text,text
symbols etc
  SOAP message:
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
<soapenv:Body>
 here is body part of valid xml
</soapenv:Body>
</soapenv:Envelope>
text,text,text,text
symbols etc

Thanks.谢谢。

if your XMl is always on one line then you can just iterate over lines checking if it starts with < .如果您的 XML 总是在一行上,那么您可以遍历行检查它是否以<开头。 If so try to parse the whole line as DOM.如果是这样,请尝试将整行解析为 DOM。

String xml = "hello\n" + //
        "this is some text\n" + //
        "<foo>I am XML</foo>\n" + //
        "<bar>me too!</bar>\n" + //
        "foo is bar\n" + //
        "<this is not valid XML\n" + //
        "<foo><bar>so am I</bar></foo>\n";
List<Document> docs = new ArrayList<Document>(); // the documents we can find
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
for (String line : xml.split("\n")) {
    if (line.startsWith("<")) {
        try {
            ByteArrayInputStream bis = new ByteArrayInputStream(line.getBytes());
            Document doc = docBuilder.parse(bis);
            docs.add(doc);
        } catch (Exception e) {
            System.out.println("Problem parsing line: `" + line + "` as XML");
        }
    } else {
        System.out.println("Discarding line: `" + line + "`");
    }
}
System.out.println("\nFound " + docs.size() + " XML documents.");
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
for (Document doc : docs) {
    StringWriter sw = new StringWriter();
    transformer.transform(new DOMSource(doc), new StreamResult(sw));
    String docAsXml = sw.getBuffer().toString().replaceAll("</?description>", "");
    System.out.println(docAsXml);
}

Output:输出:

Discarding line: `hello`
Discarding line: `this is some text`
Discarding line: `foo is bar`
Problem parsing line: `<this is not valid XML` as XML

Found 3 XML documents.
<foo>I am XML</foo>
<bar>me too!</bar>
<foo><bar>so am I</bar></foo>

如果每个这样的部分都在单独的行中,那么它应该非常简单:

s = s.replaceAll("(?m)^\\s*[^<].*\\n?", "");

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM