简体   繁体   English

搜索1GB CSV文件

[英]Search 1GB CSV file

I have a CSV file. 我有一个CSV文件。 Each line is made up of the same format eg/ 每行都由相同的格式组成,例如/

I,h,q,q,3,A,5,Q,3,[,5,Q,8,c,3,N,3,E,4,F,4,g,4,I,V,9000,0000001-100,G9999999990001800000000000001,G9999999990000001100PDNELKKMMCNELRQNWJ010, , , , , , ,D,Z,

I have a Dictionary<string, List<char>> 我有一个Dictionary<string, List<char>>

It is populated by opening the file, reading each line, taking elements from the line and adding it to the dictionary, then the file is closed. 通过打开文件,读取每一行,从该行中获取元素并将其添加到字典中来填充该文件,然后关闭文件。

The dictionary is used elsewhere in the program where it accepts input data into the program and then finds the key in the dictionary and uses the 24 elements to compare against the input data. 字典在程序中的其他地方使用,它在其中将输入数据接受到程序中,然后在字典中找到键,并使用24个元素与输入数据进行比较。

StreamReader s = File.OpenText(file);
 string lineData = null;
 while ((lineData = s.ReadLine()) != null)
 {
   var elements = lineData.Split(',');
   //Do stuff with elements
   var compareElements = elements.Take(24).Select(x => x[0]);
   FileData.Add(elements[27], new List<char>(compareElements));

  }
  s.Close();

I have just been told that the CSV file will now be 800mb and have roughly 8 million records in it. 我刚刚被告知,CSV文件现在将是800mb,其中大约有800万条记录。 I have just tried to load this up on my Dual Core Win 32bit laptop with 4GB of RAM in debug and it threw a OutOfMemoryException . 我刚刚尝试在调试中将其加载到具有4GB RAM的Dual Core Win 32位笔记本电脑上,并抛出OutOfMemoryException

I am now thinking that not loading the file into memory will be the best bet but need to find a way to search the file quickly to see if the input data has a matching item equal to element[27] and then take the first 24 elements in that CSV and compare it to the input data. 我现在认为最好将文件不加载到内存中,但需要找到一种方法来快速搜索文件,以查看输入数据是否具有等于element[27]的匹配项,然后采用前24个元素在该CSV文件中,并将其与输入数据进行比较。

a) Even if I stuck with this approach and used 16GB RAM and Windows 64bit would having that many items in a dictionary be ok? a)即使我坚持使用这种方法并使用16GB RAM和Windows 64bit,在字典中包含这么多项目也可以吗?

b) Could you provide some code/links to ways to search a CSV file quickly if you dont think using a dictionary is a good plan b)如果您认为使用字典不是一个好的计划,您能否提供一些代码/链接来快速搜索CSV文件的方法

UPDATE: Although I have accepted an answer, I just wondered what people's thoughts were on using FileStream to do a lookup and then extract data. 更新:尽管我已经接受了答案,但我只是想知道人们对使用FileStream进行查找然后提取数据有何想法。

如果您打算搜索这么多记录,我建议将文件批量插入到SQL Server这样的DBMS中,并为其指定适当的字段索引,然后使用SQL查询来检查记录是否存在。

  • forget MS access. 忘记MS访问。 Really. 真。
  • try sqlite, it will be more than adequate for a few million rows 尝试使用sqlite,它将足以容纳几百万行
  • if you can't index your data, then don't use a database, use an external utility like egrep with an appropriate regex to search for specific fields. 如果您无法为数据建立索引,则不要使用数据库,请使用带有适当正则表达式的外部实用程序(例如egrep)来搜索特定字段。 It will be much faster. 它将更快。

We had a similar problem with importing a large csv file containing data that needed to be aggregated. 导入包含需要聚合的数据的大型csv文件时,我们遇到了类似的问题。 In the end we did a bulk insert into a SQL Server table and used SQL to perform the aggregation. 最后,我们对SQL Server表进行了批量插入,并使用SQL进行了聚合。 It was pretty quick in the end (a couple of minutes end-to-end). 最终很快(端到端几分钟)。

There are several options available to you but yes, I would agree that loading this data into memory is not the best option. 您可以使用几种选择,但是可以,我同意将数据加载到内存中不是最佳选择。

a) You could load the data into a relational database although this may be overkill for this type of data. a)您可以将数据加载到关系数据库中,尽管这对于这种类型的数据可能是过分的。

b) You could use a NoSQL solution like RavenDB . b)您可以使用像RavenDB这样的NoSQL解决方案。 I think this may be a good option for you. 我认为这可能是您的好选择。

c) You could use a more efficient physical storage option like Lucene c)您可以使用更高效的物理存储选项,如Lucene

d) You could use a more efficient in-memory/caching option like Redis . d)您可以使用更有效的内存/缓存选项,例如Redis

一种解决方案是将文件分解为一些较小的文件,并在每个文件中进行并行搜索,搜索顺序将小于或等于n(读取整个文件)

As the rest of your program uses the StringDictionary entries, you still ideally need to store your results in memory - you dont really want to be querying out to a DB 1000's of times. 由于程序的其余部分使用StringDictionary条目,因此理想情况下,您仍然需要将结果存储在内存中-您实际上并不想查询DB 1000的时间。 (This could depend if your program lives on the DB server)! (这取决于您的程序是否驻留在DB服务器上)!

I'd look into the memory usage of StringDictionary for your structure and see what your theoretical maximums are and see if you can cover this in a caveat on the functional requirements. 我将研究StringDictionary在您的结构中的内存使用情况,看看您的理论最大值是多少,并查看是否可以在功能需求方面加以说明。 Otherwise look for a more efficient way to store - streaming your results out to an XML file for example will be quicker to access than a DB. 否则,寻找一种更有效的存储方式-例如,将结果流式传输到XML文件将比数据库更快地访问。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM