简体   繁体   中英

Parsing mixed data with Java

Kind of a Java noob, and have flipped between about 6 different streams+scanner and still haven't found a way to do everything I need to. Am trying to implement an algorithm to parse a file that follows a certain syntax. There are several places where I need to peek at the next character to see if it is a parenthesis or comma, and also need to be able to read strings and decimal values. I had it working with a stream up to the point that I tried to read the double. The double is NOT in a binary format, so DataInputStream is not what I want.

I could use the scanner for its getNextFloat, but the problem with using a scanner is there are no real delimiters in the file: (test:1.234,rightTest:5.6789)

If I specify ( , : ) as delimeters with the Scanner, then I lose the ability to test for their existence of the delimiter(I thionk, because the way it seems is it eats the delimiter). These blocks can be nested in each other in various ways, so I often need to test the next char to see if it's a opening parenthesis and then branch to different pieces of logic. Ie it forms a tree(but please don't right code to parse a tree because that is my homework assignment).

I could do away with the scanner and just go back to my original solution with a stream if I could only figure out how to parse the decimal value. I need something that does a "read until you find one of these characters" so that I can say stream.ReadUntil(",)"). The decimals are always followed by a comma or closing paren. As a hack I will probably just read one char at a time. This is the same thing I did to grab the string like "test" and "rightTest", and it felt really awful.

The only other option I know if is something with a string tokenizer, but my feeling from examples is that I'd have to read the entire file into a string to tokenize it, essentially defeating the purpose of using a stream. These files can be really big and just as an exersize for myself I like to try and code such that I don't bring it all into memory if it is unnecessary, even though for this assignment it doesn't really matter.

So essentially what I'm looking for is some help on the mechanics of the file IO to be able to peek at the next char so I can check for ( , : ) when necesary, and also have the ability to read a string up to a : and read a decimal value up to a : or )

Have you looked at the PushbackReader from java.io ? Peeking is one of its usecases. Below is a sample.

PushbackReader pusher = new PushbackReader(reader);
char c = (char)pusher .read();
// code to work with the peeked character
pusher .unread((int)c); //push character back into the buffer

Stream and Scanner are the only acceptable options? I would've used the Matcher Pattern. For example this snippet determines the charset of a given html page and encodes the rest of content using that charset:

BufferedReader in = new BufferedReader(new FileReader(new File("index.html")));
String inputLine;
String returnedContent = "";
Pattern charsetPattern = Pattern.compile(".*<meta.*content=\"text/html;.*charset=([A-Za-z0-9\\-]*)\">.*");
while ((inputLine = in.readLine()) != null) {
    if (serviceCharset == null) {
        Matcher m = charsetPattern.matcher(inputLine);
        if (m.find()) {
            charset = m.group(1);//the expression included in the () is one ordered group
        }

    }
    returnedContent += new String(inputLine.getBytes(), charset != null? charset : "UTF8");
}
in.close();

I know the example has not much to do with your question, it just shows how handy is regex in this sort of problems: you read the file line after line (so no worries about your buffer) and match the text you need using regular expressions.

You can also try to map your file through a MappedByteBuffer to access it (roughly) as it were a byte array in memory. And if you need to treat it as a character stream, you can wrap it into a CharBuffer . See for eg. here ( Mapped Files section).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM