简体   繁体   中英

out of memory exception while loading a huge DBpedia dump

I am trying to load a large dump of dbpedia data into my C# application, I get out of memory exepction everytime I try to load it.

The files are very large text files, holding millions of records and their size is more than 250MB each (one of them is actually 7GB!!), When I try to load the 250MB file to my application, it waits for about 10 seconds during which my RAM (6GB, initially @ 2GB used) increases to be about 5GB used then the program throws an out of memory exception.

I understood that the out of memory exception is all about the empty adjacent chunk of memory, I want to know how to manage to load such a file to my program?

Here's the code I use to load the files, I'm using the dotNetRDF library.

TripleStore temp = new TripleStore();
//adding Uris to the store
temp.LoadFromFile(@"C:\MyTripleStore\pnd_en.nt");

dotNetRDF is simply not designed to handle this amount of data in it's in-memory store. All its data parsing is streaming but you have to build in-memory structures to store the data which is what takes up all the memory and leads to the OOM exception.

By default triples are fully indexed so they can be efficiently queries with SPARQL and with the current release of the library that will require approximately 1.7kb per Triple so at most the library will let you work on a 2-3 million triples in memory depending on your available RAM. As a related point the SPARQL algorithm in the current release is terrible at that scale so even if you can load your data into memory you won't be able to query it effectively.

While the next release of the library does reduce memory usage and vastly improve SPARQL performance it was still never designed for that volume of data.

However dotNetRDF does support a wide variety of native triple stores out of the box (see the IQueryableGenericIOManager interface and it's implementations) so you should load the DBPedia dump into an appropriate store using that stores native loading mechanism (which will be faster than loading via dotNetRDF) and then use dotNetRDF simply as a client through which to make your queries

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM