简体繁体 English

Java：扫描文件，但从特定的行索引开始？

[英]Java: scanning a file, but starting at a specific line index?

原文 2014-10-14 20:47:41 4 3 java/ parsing/ text-files/ java.util.scanner

I need to scan through newline-delimited text files with potentially over a million lines apiece. 我需要扫描每个可能超过一百万行的以换行符分隔的文本文件。 Due to webserver limitations, the only way to do this reasonably is to break the process up into smaller scanning chunks. 由于网络服务器的限制，合理地执行此操作的唯一方法是将过程分成较小的扫描块。

One way to do this that I've been able to find is using a Scanner and skipping lines until you reach your desired line index... but this has a less than desirable amount of overhead for numerous scanning visits to files with hundreds of thousands of lines. 我能够找到的做到这一点的一种方法是使用Scanner并跳过行，直到达到所需的行索引为止...但是，对于成千上万的文件进行多次扫描访问而言，这样做的开销要少一些行。

RandomAccessFile.skip() and InputReader.skip() both allow seeking, but it is measured by bytes, and I cannot guarantee that every line will have the same number of bytes. RandomAccessFile.skip()和InputReader.skip()都允许查找，但是它是以字节为单位的，我不能保证每一行的字节数都相同。 Is there any way to skip several lines based on a delimiter rather than by bytes? 有什么方法可以基于分隔符而不是按字节跳过几行？

Or is there any other way to pull this off? 还是有其他方法可以做到这一点？

3 个解决方案

If you want to start at a particular line, you have to count new line characters. 如果要从特定行开始，则必须计算新行字符。 There's no way to do this other than some form of linear scan. 除了某种形式的线性扫描之外，没有其他方法可以这样做。 A new line character is not "special" from a file system point of view. 从文件系统的角度来看，换行符不是“特殊”字符。

I've had poor experience with the performance of Scanner though. 我在Scanner的性能方面经验很差。 I think your best bet is to use a BufferedReader with a large buffer. 我认为您最好的选择是使用带有大缓冲区的BufferedReader 。

If you're using the same file over and over again, you should create an index for line offsets so you can quickly seek to a given line. 如果您一次又一次使用同一文件，则应为行偏移量创建索引，以便快速查找到给定的行。

No. If your lines are variable-length, such that you need to analyze whitespace to determine where they end, then there is no alternative to scanning sequentially through the file. 否。如果您的行是可变长度的，因此您需要分析空白以确定行的结尾，那么除了按顺序扫描文件外，别无选择。 You can write your code in a way that disguises the fact that you're doing so, but that doesn't change the performance characteristics. 您可以以掩盖您这样做的事实的方式编写代码，但这不会改变性能特征。

Why do you need to seek by lines? 为什么需要逐行搜索？ Grab a chunk of N bytes, do whatever processing you need up to the last newline. 抓取一个N字节的块，执行直到最后一个换行符所需的任何处理。 There will be some bytes left unprocessed, possibly that number will be zero. 将会有一些未处理的字节，该数字可能为零。 Use that to step back, grab another chunk of N bytes, and so forth. 使用它可以退后一步，获取另外N个字节的块，依此类推。 (this is probably easier than gluing sections together) （这可能比将各个部分粘合在一起更容易）

(I'm assuming that you're looking to do some sort of processing on the whole file. If you're trying to seek for some line k, let your processing step be just "count newlines") （我假设您要对整个文件进行某种处理。如果您要查找第k行，则让您的处理步骤只是“计算换行数”）