[英]How can I efficiently index a file?
I am dealing with an application that needs to randomly read an entire line of text from a series of potentially large text files (~3+ GB). 我正在处理一个需要从一系列可能很大的文本文件(~3 + GB)中随机读取整行文本的应用程序。
The lines can be of a different length. 线条可以具有不同的长度。
In order to reduce GC
and create unnecessary strings, I am using the solution provided at: Is there a better way to determine the number of lines in a large txt file(1-2 GB)? 为了减少
GC
并创建不必要的字符串,我使用的解决方案是: 有更好的方法来确定大型txt文件中的行数(1-2 GB)吗? to detect each new line and store that in a map in one pass therefore producing an index of lineNo => position
ie: 检测每个新行并在一次通过中将其存储在地图中,从而生成
lineNo => position
的索引,即:
// maps each line to it's corresponding fileStream.position in the file
List<int> _lineNumberToFileStreamPositionMapping = new List<int>();
new line
increment lineCount
and add the fileStream.Position
to the _lineNumberToFileStreamPositionMapping
new line
增量lineCount
并将fileStream.Position
添加到_lineNumberToFileStreamPositionMapping
We then use an API similar to: 然后我们使用类似于以下的API:
public void ReadLine(int lineNumber)
{
var getStreamPosition = _lineNumberToFileStreamPositionMapping[lineNumber];
//... set the stream position, read the byte array, convert to string etc.
}
This solution is currently providing a good performance however there are two things I do not like: 这个解决方案目前提供了良好的性能,但有两件事我不喜欢:
array
therefore I have to use a List<int>
which has the potential inefficiency of resizing to double of what I actually need; array
因此我必须使用List<int>
,它具有调整大小的潜在效率,是我实际需要的两倍; Any ideas are very much appreciated. 任何想法都非常感谢。
Use List.Capacity to manually increase the capacity, perhaps every 1000 lines or so. 使用List.Capacity手动增加容量,可能每1000行左右。
If you want to trade performance for memory, you can do this: instead of storing the positions of every line, store only the positions of every 100th (or something) line. 如果你想换取内存的性能,你可以这样做:不是存储每一行的位置,而是只存储每100行(或某物)行的位置。 Then when, say, line 253 is required, go to the position of line 200 and count forward 53 lines.
然后,当需要第253行时,转到第200行的位置并向前计数53行。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.