简体   繁体   中英

How to parse plain text file with occasional XML tags using Java and SAX?

I have a rather large log file from a server which contains plain text. The server logs every thing it does and occasionally it prints xml tags which I am interested in parsing. To give you an example:

-----------log file-------------
bla bla bla random text
<logMessage>test Message</logMessage>
some more random server output
<logMessage>some other message</logMessage>
bla bla bla
end of log file

I just want to extract the data from the < logMessage > tags and ignore the rest. I am using Java and SAX, but the SAX parser expects the content of the file to be strictly XML formatted and it cannot handle this type of file. Is there a way to tell SAX to ignore/overlook the fact that the file is not a well formatted XML? What's the alternative? read the file line by line and look for the tags? :(

For simplicity's sake I would opt for reading the file line by line and looking for <logMessage> and </logMessage> tokens. Note that you can make a generic parser of that kind which takes a delegate parser and feeds it SAX-like events. (May be useful depending on how much work it would otherwise be to rewrite parsers, now your SAX based solution turns out to not work.)

EDIT: The delegate approach is also useful if you are interested in more than one kind of element. If these happen to have complex (embedded) XML hierarchies, you could even collate all the characters in between the opening and closing tokens into a buffer, then feed that buffer to a real SAX parser. This would be overkill in most cases, but again, if you have logs which essentially contains XML dumps it might be more suitable than trying to parse it all yourself.

I don't think straight XML parsing would be appropriate for parsing this sort of file. If all XML snippets are contained in the line (opening and closing tags are on the same line) then reading it line by line and checking for presence of tags, skipping non-XML lines would be simplest way to do it. After you skipped non-XML lines you could pass stream for processing to SAX parser, or just use regexp on line-by-line basis.

Essentially above approach is identical to grepping file first to leave only XML tags, then wrap it in root element to make well formed XML and parse it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM