简体   繁体   中英

Java: reading utf-8 file page by page using FileInputStream

I need some code that will allow me to read one page at a time from a UTF-8 file.

I've used the code;

 File fileDir = new File("DIRECTORY OF FILE");
 BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(fileDir), "UTF8"));
 String str;
 while ((str = in.readLine()) != null) {
        System.out.println(str);
   }
 in.close();
 } 

After surrounding it with a try catch block it runs but outputs the entire file! Is there a way to amend this code to just display ONE PAGE of text at a time? The file is in UTF-8 format and after viewing it in notepad++, i can see the file contains FF characters to denote the next page.

You will need to look for the form feed character by comparing to 0x0C.

For example:

char c = in.read();
while ( c != -1 ) {
   if ( c == 0x0C ) {
     // form feed
   } else {
     // handle displayable character
   }

   c = in.read();
}

EDIT added an example of using a Scanner, as suggested by Boris

    Scanner s = new Scanner(new File("a.txt")).useDelimiter("\u000C");
    while ( s.hasNext() ) {
        String str = s.next();

        System.out.println( str );
    }

You can use a Regex to detect form-feed (page break) characters. Try something like this:

File fileDir = new File("DIRECTORY OF FILE");
BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(fileDir), "UTF8"));
String str;
Regex pageBreak = new Regex("(^.*)(\f)(.*$)")
while ((str = in.readLine()) != null) {
    Match match = pageBreak.Match(str);
    bool pageBreakFound = match.Success;
    if(pageBreakFound){
        String textBeforeLineBreak = match.Groups[1].Value;  
        //Group[2] will contain the form feed character
        //Group[3] will contain the text after the form feed character
        //Do whatever logic you want now that you know you hit a page boundary
    }
    System.out.println(str);
}

in.close();

The parenthesis around portions of the Regex denote capture groups, which get recorded in the Match object. The \\f matches on the form feed character.

Edited Apologies, for some reason I read C# instead of Java, but the core concept is the same. Here's the Regex documentation for Java: http://docs.oracle.com/javase/tutorial/essential/regex/

If the file is valid UTF-8, that is, the pages are split by U+00FF, aka (char) 0xFF, aka "\ÿ", 'ÿ' , then a buffered reader can do. If it is a byte 0xFF there would be a problem, as UTF-8 may use a byte 0xFF.

int soughtPageno = ...; // Counted from 0
int currentPageno = 0;
try (BufferedReader in = new BufferedReader(new InputStreamReader(
        new FileInputStream(fileDir), StandardCharsets.UTF_8))) {
    String str;
    while ((str = in.readLine()) != null && currentPageno <= soughtPageno) {
        for (int pos = str.indexOf('\u00FF'; pos >= 0; )) {
            if (currentPageno == soughtPageno) {
                System.out.println(str.substring(0, pos);
                ++currentPageno;
                break;
            }
            ++currentPageno;
            str = str.substring(pos + 1);
        }
        if (currentPageno == soughtPageno) {
            System.out.println(str);
        }
    }
} 

For a byte 0xFF (wrong, hacked UTF-8) use a wrapping InputStream between FileInputStream and the reader:

class PageInputStream implements InputStream {
   InputStream in;
   int pageno = 0;
   boolean eof = false;
   PageInputSTream(InputStream in, int pageno) {
       this.in = in;
       this.pageno = pageno;
   }
   int read() throws IOException {
       if (eof) {
           return -1;
       }
       while (pageno > 0) {
           int c = in.read();
           if (c == 0xFF) {
               --pageno;
           } else if (c == -1) {
               eof = true;
               in.close();
               return -1;
           }
       }
       int c = in.read();
       if (c == 0xFF) {
           c = -1;
           eof = true;
           in.close();
       }
       return c;
   }

Take this as an example, a bit more work is to be done.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM