简体   繁体   English

如何有效地索引文件?

[英]How can I efficiently index a file?

I am dealing with an application that needs to randomly read an entire line of text from a series of potentially large text files (~3+ GB). 我正在处理一个需要从一系列可能很大的文本文件(~3 + GB)中随机读取整行文本的应用程序。

The lines can be of a different length. 线条可以具有不同的长度。

In order to reduce GC and create unnecessary strings, I am using the solution provided at: Is there a better way to determine the number of lines in a large txt file(1-2 GB)? 为了减少GC并创建不必要的字符串,我使用的解决方案是: 有更好的方法来确定大型txt文件中的行数(1-2 GB)吗? to detect each new line and store that in a map in one pass therefore producing an index of lineNo => position ie: 检测每个新行并在一次通过中将其存储在地图中,从而生成lineNo => position的索引,即:

// maps each line to it's corresponding fileStream.position in the file    
List<int> _lineNumberToFileStreamPositionMapping = new List<int>();
  1. go through the entire file 浏览整个文件
  2. when detect a new line increment lineCount and add the fileStream.Position to the _lineNumberToFileStreamPositionMapping 当检测到new line增量lineCount并将fileStream.Position添加到_lineNumberToFileStreamPositionMapping

We then use an API similar to: 然后我们使用类似于以下的API:

public void ReadLine(int lineNumber)
{
     var getStreamPosition = _lineNumberToFileStreamPositionMapping[lineNumber];
     //... set the stream position, read the byte array, convert to string etc.
}

This solution is currently providing a good performance however there are two things I do not like: 这个解决方案目前提供了良好的性能,但有两件事我不喜欢:

  1. Since I do not know the total number of lines in the file, I cannot preallocate an array therefore I have to use a List<int> which has the potential inefficiency of resizing to double of what I actually need; 因为我不知道文件中的行总数,所以我无法预先分配array因此我必须使用List<int> ,它具有调整大小的潜在效率,是我实际需要的两倍;
  2. Memory usage, so as an example for a text file of ~1GB with ~5 million lines of text the index occupies ~150MB I would really like to decrease this as much as possible. 内存使用情况,所以作为一个文本文件的例子~~ 1GB,文本约500万行,索引占用~150MB我真的希望尽可能减少这个。

Any ideas are very much appreciated. 任何想法都非常感谢。

  1. Use List.Capacity to manually increase the capacity, perhaps every 1000 lines or so. 使用List.Capacity手动增加容量,可能每1000行左右。

  2. If you want to trade performance for memory, you can do this: instead of storing the positions of every line, store only the positions of every 100th (or something) line. 如果你想换取内存的性能,你可以这样做:不是存储每一行​​的位置,而是只存储每100行(或某物)行的位置。 Then when, say, line 253 is required, go to the position of line 200 and count forward 53 lines. 然后,当需要第253行时,转到第200行的位置并向前计数53行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM