[英]Get all XMLs from raw text file?
我有日志文件,我需要編寫程序來從這個文件中獲取所有 xml。 文件看起來像
text
text
xml
text
xml
text
etc
你能給我建議使用正則表達式或其他什么更好嗎? 也許可以用 dom4j 做到這一點?
如果我嘗試使用正則表達式,我會看到下一個問題,即文本部分具有<>
標簽。
更新 1: XML 示例
SOAP message:
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
<soapenv:Body>
here is body part of valid xml
</soapenv:Body>
</soapenv:Envelope>
text,text,text,text
symbols etc
SOAP message:
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
<soapenv:Body>
here is body part of valid xml
</soapenv:Body>
</soapenv:Envelope>
text,text,text,text
symbols etc
謝謝。
如果您的 XML 總是在一行上,那么您可以遍歷行檢查它是否以<
開頭。 如果是這樣,請嘗試將整行解析為 DOM。
String xml = "hello\n" + //
"this is some text\n" + //
"<foo>I am XML</foo>\n" + //
"<bar>me too!</bar>\n" + //
"foo is bar\n" + //
"<this is not valid XML\n" + //
"<foo><bar>so am I</bar></foo>\n";
List<Document> docs = new ArrayList<Document>(); // the documents we can find
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
for (String line : xml.split("\n")) {
if (line.startsWith("<")) {
try {
ByteArrayInputStream bis = new ByteArrayInputStream(line.getBytes());
Document doc = docBuilder.parse(bis);
docs.add(doc);
} catch (Exception e) {
System.out.println("Problem parsing line: `" + line + "` as XML");
}
} else {
System.out.println("Discarding line: `" + line + "`");
}
}
System.out.println("\nFound " + docs.size() + " XML documents.");
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
for (Document doc : docs) {
StringWriter sw = new StringWriter();
transformer.transform(new DOMSource(doc), new StreamResult(sw));
String docAsXml = sw.getBuffer().toString().replaceAll("</?description>", "");
System.out.println(docAsXml);
}
輸出:
Discarding line: `hello`
Discarding line: `this is some text`
Discarding line: `foo is bar`
Problem parsing line: `<this is not valid XML` as XML
Found 3 XML documents.
<foo>I am XML</foo>
<bar>me too!</bar>
<foo><bar>so am I</bar></foo>
如果每個這樣的部分都在單獨的行中,那么它應該非常簡單:
s = s.replaceAll("(?m)^\\s*[^<].*\\n?", "");
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.