简体   繁体   English

使用Java扫描程序读取文件

[英]Reading a file using Java scanner

One of the lines in a java file I'm trying to understand is as below. 我试图理解的java文件中的一行是如下所示。

return new Scanner(file).useDelimiter("\\Z").next();

The file is expected to return upto "The end of the input but for the final terminator, if any" as per java.util.regex.Pattern documentation. 根据java.util.regex.Pattern文档,该文件应返回到“输入的结尾,但对于最终的终结符,如果有的话”。 But what happens is it returns only the first 1024 characters from the file. 但是,它只返回文件中的前1024个字符。 Is this a limitation imposed by the regex Pattern matcher? 这是正则表达式模式匹配器施加的限制吗? Can this be overcome? 这可以克服吗? Currently I'm going ahead using a filereader. 目前我正在使用文件阅读器。 But I would like to know the reason for this behaviour. 但我想知道这种行为的原因。

Myself, I couldn't reproduce this. 我自己,我无法重现这一点。 But I think I can shed light as to what is going on. 但我想我能说明发生了什么。

Internally, the Scanner uses a character buffer of 1024 characters. 在内部,扫描仪使用1024个字符的字符缓冲区。 The Scanner will read from your Readable 1024 characters by default, if possible, and then apply the pattern. 默认情况下,扫描仪将从可读的1024个字符中读取,如果可能,然后应用该模式。

The problem is in your pattern...it will always match the end of the input, but that doesn't mean the end of your input stream/data. 问题在于你的模式...它总是与输入的结尾匹配,但这并不意味着输入流/数据的结束。 When Java applies your pattern to the buffered data, it tries to find the first occurrence of the end of input. 当Java将模式应用于缓冲数据时,它会尝试查找输入结束的第一个匹配项。 Since 1024 characters are in the buffer, the matching engine calls position 1024 the first match of the delimiter and everything before it is returned as the first token. 由于缓冲区中有1024个字符,因此匹配引擎将位置1024调用分隔符的第一个匹配项,并将其前面的所有内容作为第一个标记返回。

I don't think the end-of-input anchor is valid for use in the Scanner for that reason. 由于这个原因,我认为输入结束锚不适用于扫描仪。 It could be reading from an infinite stream, after all. 毕竟,它可能是从无限的流中读取的。

尝试将file对象包装在FileInputStream

Scanner is intended to read multiple primitives from a file. Scanner旨在从文件中读取多个基元。 It really isn't intended to read an entire file. 它实际上并不打算读取整个文件。

If you don't want to include third party libraries, you're better off looping over a BufferedReader that wraps a FileReader / InputStreamReader for text, or looping over a FileInputStream for binary data. 如果您不想包含第三方库,最好循环一个BufferedReader ,它包装文件的FileReader / InputStreamReader ,或者循环遍历FileInputStream以获取二进制数据。

If you're OK using a third-party library, Apache commons-io has a FileUtils class that contains the static methods readFileToString and readLines for text and readFileToByteArray for binary data.. 如果你可以使用第三方库,那么Apache commons-io有一个FileUtils类,它包含静态方法readFileToStringreadLines for text和readFileToByteArray for binary data ..

You can use the Scanner class, just specify a char-set when opening the scanner, ie: 您可以使用Scanner类,只需在打开扫描仪时指定一个字符集,即:

Scanner sc = new Scanner(file, "ISO-8859-1");

Java converts bytes read from the file into characters using the specified charset, which is the default one (from underlying OS) if nothing is given ( source ). Java使用指定的字符集将从文件读取的字节转换为字符,如果没有给出( ),则这是默认的字符集(来自底层操作系统)。 It is still not clear to me why Scanner reads only 1024 bytes with the default one, whilst with another one it reaches the end of a file. 我仍然不清楚为什么Scanner只使用默认值读取1024个字节,而另一个则到达文件末尾。 Anyway, it works fine! 无论如何,它工作正常!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM