简体   繁体   English

C#代码在一个非常大的文本文件中执行二进制搜索

[英]C# code to perform Binary search in a very big text file

Is there a library that I can use to perform binary search in a very big text file (can be 10GB). 是否有一个库可以用来在一个非常大的文本文件中执行二进制搜索(可以是10GB)。

The file is a sort of a log file - every row starts with a date and time. 该文件是一种日志文件 - 每行都以日期和时间开头。 Therefore rows are ordered. 因此行是有序的。

As the line lengths are not guaranteed to be the same length, you're going to need some form of recognisable line delimiter eg carriage return or line feed. 由于线路长度不能保证长度相同,因此您需要某种形式的可识别线路分隔符,例如回车或换行。

The binary search pattern can then be pretty much your traditional algorithm. 二进制搜索模式可以是您的传统算法。 Seek to the 'middle' of the file (by length), seek backwards (byte by byte) to the start of the line you happen to land in, as identified by the line delimiter sequence, read that record and make your comparison. 寻找文件的“中间”(按长度),向后搜索(逐字节)到您碰巧进入的行的开头,由行分隔符序列标识,读取该记录并进行比较。 Depending on the comparison, seek halfway up or down (in bytes) and repeat. 根据比较,向上或向下寻找(以字节为单位)并重复。

When you identify the start index of a record, check whether it was the same as the last seek. 识别记录的起始索引时,请检查它是否与上次搜索相同。 You may find that, as you dial in on your target record, moving halfway won't get you to a different record. 您可能会发现,当您拨入目标记录时,中途移动不会让您获得不同的记录。 eg you have adjacent records of 100 bytes and 50 bytes respectively, so jumping in at 75 bytes always takes you back to the start of the first record. 例如,你有相邻的100字节和50字节的记录,所以跳入75字节总是会带你回到第一条记录的开头。 If that happens, read on to the next record before making your comparison. 如果发生这种情况,请在进行比较之前继续阅读下一条记录。

You should find that you will reach your target pretty quickly. 您会发现很快就能达到目标。

I started to write the pseudo-code on how to do it, but I gave up since it may seem condescending. 我开始写关于如何做到的伪代码,但我放弃了,因为它看起来似乎居高临下。 You probably know how to write a binary search, it's really not complicated. 您可能知道如何编写二进制搜索,它实际上并不复杂。

You won't find it in a library, for two reasons: 您将无法在库中找到它,原因有两个:

  1. It's not really "binary search" - the line sizes are different, so you need to adapt the algorithm (eg look for the middle of the file, then look for the next "newline" and consider that to be the "middle"). 它不是真正的“二分搜索” - 线条大小不同,所以你需要调整算法(例如查找文件的中间部分,然后查找下一个“换行符”并将其视为“中间”)。
  2. Your datetime log format is most likely non-standard (ok, it may look "standard", but think a bit.... you probably use '[]' or something to separate the date from the log message, something like [10/02/2001 10:35:02] My message ). 你的日期时间日志格式很可能是非标准的(好吧,它可能看起来像“标准”,但想一想......你可能会使用'[]'或者某些东西将日期与日志消息分开,类似于[10] / 02/2001 10:35:02]我的留言)。

On summary - I think your need is too specific and too simple to implement in custom code for someone to bother writing a library :) 总结 - 我认为你的需求太具体,太简单,无法在自定义代码中实现,有人打扰编写库:)

You would need to be able to stream the file, but you would also need random access. 您需要能够流式传输文件,但您还需要随机访问。 I'm not sure how you accomplish this short of a guarantee that each line of the file contains the same number of bytes. 我不确定你是如何做到这一点的,保证文件的每一行都包含相同的字节数。 If you had that, you could get a Stream of the object and use the Seek method to move around in the file, and from there you could conduct your binary search by reading in the number of bytes that constitute a line. 如果你有这个,你可以得到一个对象流,并使用Seek方法在文件中移动,然后从那里你可以通过读取构成一行的字节数来进行二进制搜索。 But again, this is only valid if the lines are the same number of bytes. 但同样,这只有在行数相同的情况下才有效。 Otherwise, you would jump in and out of the middle of lines. 否则,你会跳入和跳出线条的中间。

Something like 就像是

byte[] buffer = new byte[lineLength];
stream.Seek(lineLength * searchPosition, SeekOrigin.Begin);
stream.Read(buffer, 0, lineLength);
string line = Encoding.Default.GetString(buffer);

If your file is static (or changes rarely) and you have to run "enough" queries against it, I believe the best approach will be creating "index" file: 如果您的文件是静态的(或很少更改)并且您必须针对它运行“足够”的查询,我相信最好的方法是创建“索引”文件:

  1. Scan the initial file and take the datetime parts of the file plus their positions in the original (this is why has to be pretty static) encode them some how (for example: unix time (full 10 digits) + nanoseconds (zero-filled 4 digits) and line position (zero filed 10 digits). this way you will have file with consistent "lines" 扫描初始文件并获取文件的日期时间部分加上它们在原始文件中的位置(这就是为什么必须非常静态)将它们编码为一些如何(例如:unix time(全10位)+纳秒(零填充4)数字)和行位置(归零10位数字)。这样你将拥有一致的“行”文件

  2. preform binary search on that file (you may need to be a bit creative in order to achieve range search) and get the relevant location(s) in the original file 在该文件上执行二进制搜索(您可能需要有点创意才能实现范围搜索)并获取原始文件中的相关位置

  3. read directly from the original file starting from the given location / read the given range 从给定位置开始直接从原始文件读取/读取给定范围

You've got range search with O(log(n)) run-time :) (and you've created primitive DB functionality) 你有O(log(n))运行时:)的范围搜索(你已经创建了原始数据库功能)

Needless to say that if the file data file is updated "too" frequently or you don't run "enough" queries against the index file you mat end up with spending more time on creating the index file than you are saving from the query file. 毋庸置疑,如果文件数据文件“太频繁”更新,或者您没有对索引文件运行“足够”的查询,则最终会花费更多时间来创建索引文件而不是从查询文件中保存。

Btw, working with this index file doesn't require the data file to be sorted. 顺便说一下,使用这个索引文件不需要对数据文件进行排序。 As log files tend to be append only, and sorted, you may speed up the whole thing by simply creating index file that only holds the locations of the EOL marks (zero-filled 10 digits) in the data file - this way you can preform the binary search directly on the data-file (using the index file in order to determinate the seek positions in the original file) and if lines are appended to the log file you can simply add (append) their EOL positions to the index file. 由于日志文件往往仅附加和排序,您可以通过简单地创建仅保存数据文件中EOL标记位置(零填充10位数)的索引文件来加速整个过程 - 这样您就可以预先形成直接在数据文件上进行二进制搜索(使用索引文件以确定原始文件中的搜索位置),如果将行附加到日志文件,则只需将其EOL位置添加(追加)到索引文件即可。

This shouldn't be too bad under the constraint that you hold an Int64 in memory for every line-feed in the file. 在您为文件中的每个换行符保留内存中的Int64的约束下,这不应该太糟糕。 That really depends upon how long the line of text is on average, given 1000 bytes per line you be looking at around (10,000,000,000 / 1000 * 4) = 40mb. 这实际上取决于文本行的平均长度,给定每行1000个字节(10,000,000,000 / 1000 * 4)= 40mb。 Very big, but possible. 非常大,但可能。

So try this: 试试这个:

  1. Scan the file and store the ordinal offset of each line-feed in a List 扫描文件并将每个换行符的序号偏移量存储在列表中
  2. Binary search the List with a custom comparer that scans to the file offset and reads the data. 二进制搜索List使用自定义比较器扫描到文件偏移量并读取数据。

The List object has a Binary Search method. List对象具有二进制搜索方法。

http://msdn.microsoft.com/en-us/library/w4e7fxsh%28VS.80%29.aspx http://msdn.microsoft.com/en-us/library/w4e7fxsh%28VS.80%29.aspx

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM