简体   繁体   English

加载巨大的DBpedia转储时出现内存不足

[英]out of memory exception while loading a huge DBpedia dump

I am trying to load a large dump of dbpedia data into my C# application, I get out of memory exepction everytime I try to load it. 我正在尝试将大量dbpedia数据转储到我的C#应用​​程序中,每次尝试加载它时都会出现内存不足的情况。

The files are very large text files, holding millions of records and their size is more than 250MB each (one of them is actually 7GB!!), When I try to load the 250MB file to my application, it waits for about 10 seconds during which my RAM (6GB, initially @ 2GB used) increases to be about 5GB used then the program throws an out of memory exception. 这些文件是非常大的文本文件,拥有数百万条记录,每条记录的大小超过250MB(其中一个实际上是7GB !!),当我尝试将250MB文件加载到我的应用程序时,它会等待大约10秒钟我的RAM(6GB,最初使用@ 2GB)增加到大约5GB使用然后程序抛出内存不足异常。

I understood that the out of memory exception is all about the empty adjacent chunk of memory, I want to know how to manage to load such a file to my program? 我明白内存不足异常是关于空的相邻内存块,我想知道如何管理将这样的文件加载到我的程序中?

Here's the code I use to load the files, I'm using the dotNetRDF library. 这是我用来加载文件的代码,我正在使用dotNetRDF库。

TripleStore temp = new TripleStore();
//adding Uris to the store
temp.LoadFromFile(@"C:\MyTripleStore\pnd_en.nt");

dotNetRDF is simply not designed to handle this amount of data in it's in-memory store. dotNetRDF根本不是为处理内存存储中的大量数据而设计的。 All its data parsing is streaming but you have to build in-memory structures to store the data which is what takes up all the memory and leads to the OOM exception. 它的所有数据解析都是流式传输,但您必须构建内存结构来存储数据,这会占用所有内存并导致OOM异常。

By default triples are fully indexed so they can be efficiently queries with SPARQL and with the current release of the library that will require approximately 1.7kb per Triple so at most the library will let you work on a 2-3 million triples in memory depending on your available RAM. 默认情况下,三元组是完全索引的,因此它们可以使用SPARQL进行高效查询,并且库的当前版本每个Triple需要大约1.7kb,因此最多这个库可以让你在内存中工作2-3百万三倍,具体取决于你的可用内存。 As a related point the SPARQL algorithm in the current release is terrible at that scale so even if you can load your data into memory you won't be able to query it effectively. 作为一个相关的点,当前版本中的SPARQL算法在这种规模上是非常糟糕的,所以即使你可以将数据加载到内存中,你也无法有效地查询它。

While the next release of the library does reduce memory usage and vastly improve SPARQL performance it was still never designed for that volume of data. 虽然库的下一个版本确实减少了内存使用并大大提高了SPARQL性能,但它仍然没有为该数据量设计。

However dotNetRDF does support a wide variety of native triple stores out of the box (see the IQueryableGenericIOManager interface and it's implementations) so you should load the DBPedia dump into an appropriate store using that stores native loading mechanism (which will be faster than loading via dotNetRDF) and then use dotNetRDF simply as a client through which to make your queries 但是,dotNetRDF支持开箱即用的各种本机三重存储(请参阅IQueryableGenericIOManager接口及其实现),因此您应该使用存储本机加载机制将DBPedia转储加载到适当的存储中(这将比通过dotNetRDF加载更快) )然后使用dotNetRDF作为客户端来进行查询

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM