简体   繁体   English

将大文件读入字典

[英]Reading a large file into a Dictionary

I have a 1GB file containing pairs of string and long. 我有一个包含字符串和长对的1GB文件。 What's the best way of reading it into a Dictionary, and how much memory would you say it requires? 将它读入字典的最佳方法是什么,你说它需要多少内存?

File has 62 million rows. 文件有6200万行。 I've managed to read it using 5.5GB of ram. 我已经设法用5.5GB的ram读取它。

Say 22 bytes overhead per Dictionary entry, that's 1.5GB. 假设每个Dictionary条目的开销为22字节,即1.5GB。 long is 8 bytes, that's 500MB. long是8个字节,即500MB。 Average string length is 15 chars, each char 2 bytes, that's 2GB. 平均字符串长度为15个字符,每个字符2个字节,即2GB。 Total is about 4GB, where does the extra 1.5 GB go to? 总计大约4GB,额外的1.5 GB到哪里去了?

The initial Dictionary allocation takes 256MB. 初始字典分配需要256MB。 I've noticed that each 10 million rows I read, consume about 580MB, which fits quite nicely with the above calculation, but somewhere around the 6000th line, memory usage grows from 260MB to 1.7GB, that's my missing 1.5GB, where does it go? 我注意到我读取的每1000万行消耗大约580MB,这与上面的计算完全吻合,但在6000行左右,内存使用量从260MB增加到1.7GB,这是我缺少的1.5GB,它在哪里走?

Thanks. 谢谢。

It's important to understand what's happening when you populate a Hashtable. 了解填充Hashtable时发生的情况非常重要。 (The Dictionary uses a Hashtable as its underlying data structure.) (The Dictionary使用Hashtable作为其底层数据结构。)

When you create a new Hashtable, .NET makes an array containing 11 buckets, which are linked lists of dictionary entries. 当您创建新的Hashtable时,.NET会生成一个包含11个存储桶的数组,这些存储桶是字典条目的链接列表。 When you add an entry, its key gets hashed, the hash code gets mapped on to one of the 11 buckets, and the entry (key + value + hash code) gets appended to the linked list. 添加条目时,其密钥将被哈希处理,哈希代码将映射到11个桶中的一个,并且条目(键+值+哈希代码)将附加到链接列表。

At a certain point (and this depends on the load factor used when the Hashtable is first constructed), the Hashtable determines, during an Add operation, that it's encountering too many collisions, and that the initial 11 buckets aren't enough. 在某一点(这取决于首次构造Hashtable时使用的负载因子),Hashtable在Add操作期间确定它遇到了太多的冲突,并且最初的11个桶是不够的。 So it creates a new array of buckets that's twice the size of the old one (not exactly; the number of buckets is always prime), and then populates the new table from the old one. 因此它创建了一个新的桶数组,其大小是旧数据块的两倍(不完全是;桶的数量总是为素数),然后从旧表中填充新表。

So there are two things that come into play in terms of memory utilization. 因此,在内存利用方面有两件事情可以发挥作用。

The first is that, every so often, the Hashtable needs to use twice as much memory as it's presently using, so that it can copy the table during resizing. 第一个是,Hashtable每隔一段时间就需要使用两倍于目前使用的内存,这样它就可以在调整大小时复制表。 So if you've got a Hashtable that's using 1.8GB of memory and it needs to be resized, it's briefly going to need to use 3.6GB, and, well, now you have a problem. 所以如果你有一个使用1.8GB内存的Hashtable并且它需要调整大小,那么它需要使用3.6GB,而且,现在你遇到了问题。

The second is that every hash table entry has about 12 bytes of overhead: pointers to the key, the value, and the next entry in the list, plus the hash code. 第二个是每个哈希表条目有大约12个字节的开销:指向密钥的指针,值和列表中的下一个条目,加上哈希码。 For most uses, that overhead is insignificant, but if you're building a Hashtable with 100 million entries in it, well, that's about 1.2GB of overhead. 对于大多数用途,这种开销是微不足道的,但是如果你正在构建一个包含1亿条目的Hashtable,那么这大约是1.2GB的开销。

You can overcome the first problem by using the overload of the Dictionary's constructor that lets you provide an initial capacity. 您可以通过使用Dictionary的构造函数的重载来解决第一个问题,该构造函数可以提供初始容量。 If you specify a capacity big enough to hold all of the entries you're going to be added, the Hashtable won't need to be rebuilt while you're populating it. 如果您指定的容量足以容纳您将要添加的所有条目,则在填充Hashtable时不需要重建Hashtable。 There's pretty much nothing you can do about the second. 关于第二个,你几乎无能为力。

Everyone here seems to be in agreement that the best way to handle this is to read only a portion of the file into memory at a time. 这里的每个人似乎都同意,处理这个的最好方法是一次只将一部分文件读入内存。 Speed, of course, is determined by which portion is in memory and what parts must be read from disk when a particular piece of information is needed. 当然,速度取决于存储器中的哪个部分以及当需要特定信息时必须从磁盘读取哪些部分。

There is a simple method to handle deciding what's the best parts to keep in memory: 有一种简单的方法来处理决定保留在内存中的最佳部分:

Put the data into a database. 将数据放入数据库。

A real one, like MSSQL Express, or MySql or Oracle XE (all are free). 一个真实的,如MSSQL Express,或MySql或Oracle XE(都是免费的)。

Databases cache the most commonly used information, so it's just like reading from memory. 数据库缓存最常用的信息,因此就像从内存中读取一样。 And they give you a single access method for in-memory or on-disk data. 它们为您提供了内存或磁盘数据的单一访问方法。

Maybe you can convert that 1 GB file into a SQLite database with two columns key and value. 也许您可以将该1 GB文件转换为具有两列键和值的SQLite数据库。 Then create an index on key column. 然后在键列上创建索引。 After that you can query that database to get the values of the keys you provided. 之后,您可以查询该数据库以获取您提供的密钥的值。

Thinking about this, I'm wondering why you'd need to do it... (I know, I know... I shouldn't wonder why, but hear me out...) 考虑到这一点,我想知道为什么你需要这样做......(我知道,我知道......我不应该想知道为什么,但是听我说......)

The main problem is that there is a huge amount of data that needs to be presumably accessed quickly... The question is, will it essentially be random access, or is there some pattern that can be exploited to predict accesses? 主要问题是需要大量数据需要快速访问...问题是,它本质上是随机访问,还是有一些模式可以被利用来预测访问?

In any case, I would implement this as a sliding cache. 在任何情况下,我都会将其实现为滑动缓存。 Eg I would load as much as feasibly possible into memory to start with (with the selection of what to load based as much on my expected access pattern as possible) and then keep track of accesses to elements by time last accessed. 例如,我会尽可能多地将内容加载到内存中(尽可能选择基于我期望的访问模式加载的内容),然后按上次访问的时间跟踪对元素的访问。 If I hit something that wasn't in the cache, then it would be loaded and replace the oldest item in the cache. 如果我点击了不在缓存中的内容,那么它将被加载并替换缓存中最旧的项目。

This would result in the most commonly used stuff being accessible in memory, but would incur additional work for cache misses. 这将导致最常用的内容在内存中可访问,但会导致缓存未命中的额外工作。

In any case, without knowing a little more about the problem, this is merely a 'general solution'. 无论如何,在不了解问题的情况下,这仅仅是一种“通用解决方案”。

It may be that just keeping it in a local instance of a sql db would be sufficient :) 可能只是将它保存在sql db的本地实例中就足够了:)

You'll need to specify the file format, but if it's just something like name=value, I'd do: 你需要指定文件格式,但如果它只是name = value,我会这样做:

Dictionary<string,long> dictionary = new Dictionary<string,long>();
using (TextReader reader = File.OpenText(filename))
{
    string line;
    while ((line = reader.ReadLine()) != null)
    {
        string[] bits = line.Split('=');
        // Error checking would go here
        long value = long.Parse(bits[1]);
        dictionary[bits[0]] = value;
    }
}

Now, if that doesn't work we'll need to know more about the file - how many lines are there, etc? 现在,如果这不起作用,我们需要更多地了解该文件 - 有多少行,等等?

Are you using 64 bit Windows? 你使用的是64位Windows吗? (If not, you won't be able to use more than 3GB per process anyway, IIRC.) (如果没有,那么无论如何,每个进程都不能超过3GB,IIRC。)

The amount of memory required will depend on the length of the strings, number of entries etc. 所需的内存量取决于字符串的长度,条目数等。

I am not familiar with C#, but if you're having memory problems you might need to roll your own memory container for this task. 我不熟悉C#,但如果您遇到内存问题,可能需要为此任务滚动自己的内存容器。

Since you want to store it in a dict, I assume you need it for fast lookup? 既然你想将它存储在一个字典中,我认为你需要它来快速查找? You have not clarified which one should be the key, though. 但是,您还没有明确哪一个应该是关键。

Let's hope you want to use the long values for keys. 我们希望您想要使用密钥的长值。 Then try this: 然后尝试这个:

Allocate a buffer that's as big as the file. 分配一个与文件一样大的缓冲区。 Read the file into that buffer. 将文件读入该缓冲区。

Then create a dictionary with the long values (32 bit values, I guess?) as keys, with their values being a 32 bit value as well. 然后创建一个包含长值(32位值,我猜?)作为键的字典,其值也是32位值。

Now browse the data in the buffer like this: Find the next key-value pair. 现在浏览缓冲区中的数据,如下所示:查找下一个键值对。 Calculate the offset of its value in the buffer. 计算缓冲区中其值的偏移量。 Now add this information to the dictionary, with the long as the key and the offset as its value. 现在将此信息添加到字典中,其中long为键,偏移量为其值。

That way, you end up with a dictionary which might take maybe 10-20 bytes per record, and one larger buffer which holds all your text data. 这样,你最终会得到一个字典,每个记录可能需要10-20个字节,还有一个较大的缓冲区可以保存所有文本数据。

At least with C++, this would be a rather memory-efficient way, I think. 至少在C ++中,我认为这将是一种相当节省内存的方式。

Can you convert the 1G file into a more efficient indexed format, but leave it as a file on disk? 您能否将1G文件转换为更高效的索引格式,但将其作为文件保留在磁盘上? Then you can access it as needed and do efficient lookups. 然后,您可以根据需要访问它并执行有效的查找。

Perhaps you can memory map the contents of this (more efficient format) file, then have minimum ram usage and demand-loading, which may be a good trade-off between accessing the file directly on disc all the time and loading the whole thing into a big byte array. 也许你可以通过内存映射这个(更有效的格式)文件的内容,然后有最小的ram使用和需求加载,这可能是在光盘上直接访问文件和将整个内容加载到文件之间的良好平衡。一个大字节数组。

Loading a 1 GB file in memory at once doesn't sound like a good idea to me. 一次在内存中加载1 GB文件对我来说听起来不是一个好主意。 I'd virtualize the access to the file by loading it in smaller chunks only when the specific chunk is needed. 我只是在需要特定块时才将其加载到较小的块中,从而虚拟化对文件的访问。 Of course, it'll be slower than having the whole file in memory, but 1 GB is a real mastodon... 当然,它比将整个文件放在内存中要慢,但1 GB是一个真正的乳齿象......

Don't read 1GB of file into the memory even though you got 8 GB of physical RAM, you can still have so many problems. 即使您有8 GB的物理RAM,也不要将1GB的文件读入内存,但仍然可能存在很多问题。 -based on personal experience- - 基于个人经验 -

I don't know what you need to do but find a workaround and read partially and process. 我不知道你需要做什么,但找到一个解决方法,并部分阅读和处理。 If it doesn't work you then consider using a database. 如果它不起作用,则考虑使用数据库。

If you choose to use a database, you might be better served by a dbm-style tool, like Berkeley DB for .NET . 如果选择使用数据库,那么dbm样式工具(如Berkeley DB for .NET)可能会更好地为您提供服务。 They are specifically designed to represent disk-based hashtables. 它们专门用于表示基于磁盘的哈希表。

Alternatively you may roll your own solution using some database techniques. 或者,您可以使用一些数据库技术推出自己的解决方案

Suppose your original data file looks like this (dots indicate that string lengths vary): 假设您的原始数据文件如下所示(点表示字符串长度不同):

[key2][value2...][key1][value1..][key3][value3....]

Split it into index file and values file. 将其拆分为索引文件和值文件。

Values file: 值文件:

[value1..][value2...][value3....]

Index file: 索引文件:

[key1][value1-offset]
[key2][value2-offset]
[key3][value3-offset]

Records in index file are fixed-size key->value-offset pairs and are ordered by key. 索引文件中的记录是固定大小的key->value-offset对,并按键排序。 Strings in values file are also ordered by key. 值文件中的字符串也按键排序。

To get a value for key(N) you would binary-search for key(N) record in index, then read string from values file starting at value(N)-offset and ending before value(N+1)-offset . 要获得key(N)的值,您将在索引中二进制搜索key(N)记录,然后从值value(N)-offset并在value(N+1)-offset之前结束的值文件中读取字符串。

Index file can be read into in-memory array of structs (less overhead and much more predictable memory consumption than Dictionary), or you can do the search directly on disk. 索引文件可以读入结构的内存中数组(开销更少,比Dictionary更可预测的内存消耗),或者您可以直接在磁盘上进行搜索。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM