简体   繁体   中英

data structure for indexing big file

I need to build an index for a very big (50GB+) ASCII text file which will enable me to provide fast random read access to file (get nth line, get nth word in nth line). I've decided to use List<List<long>> map , where map[i][j] element is position of jth word of ith line in the file.

I will build the index sequentially, ie read the whole file and populating index with map.Add(new List<long>()) (new line) and map[i].Add(position) (new word). I will then retrieve specific word position with map[i][j] .

The only problem I see is that I can't predict total count of lines/words, so I will bump into O(n) on every List reallocation, no idea of how I can avoid this.

Are there any other problems with the data structure I chose for the task? Which structure could be better?

UPD : File will not be altered during the runtime. There are no other ways to retrieve content except what I've listed.

  1. Increasing size of a large list is very expensive operation; so, it's better to reserve list size at the beginning.
  2. I'd suggest to use 2 lists. The first contains indexes of words within file, and the second contains indexes in the first list (index of the first word in the appropriate line).
  3. You are very likely to exceed all available RAM. And when the system starts to page in/page out GC-managed RAM, performance of the program will be completely killed. I'd suggest to store your data in memory-mapped file rather than in managed memory. http://msdn.microsoft.com/en-us/library/dd997372.aspx

UPD memory mapped files are effective, when you need to work with huge amounts of data not fitting in RAM. Basically, it's your the only choice if your index becomes bigger than available RAM.

I think not a good idea to keep such huge data in memory. std::map or std::list are not going to work for such scenario, neither memory wise nor performance wise. You will need secondary storage. Search for external sorting and searching techniques. Look at B-tree and trie/B-tie if you want to implement something by yourself. But most probably, you would not want to invent the wheel. Look for some libraries which can do it for you. For example, LevelDB is one file base key-value pair storage. Same like std::map but keeps data on physical files. There are more like these.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM