简体   繁体   English

搜索大型Excel文件并有效处理大量数据

[英]Searching large Excel files and handling large amounts of data efficiently

I've started on project, here's what I have: 我已经开始了项目,这是我拥有的:

Excel file (exl1) containing 450K records, with 50K more each month. Excel文件(exl1)包含450K条记录,每月增加50K条记录。

exl1 format exl1格式

invoice#|Box#|Serial-#1|Serial-#2|5-val-enum#1|5-val-enum#2|10-val-enum|4-val-enum|timestamp

ex1: abc123|box1|0987654321|A123456789|Plant|Tree|PersonName1|North|DateTime.Now ex1: abc123|box1|0987654321|A123456789|Plant|Tree|PersonName1|North|DateTime.Now

ex2: qwe345|box9|12345678901234567890|#NA|Animal|Cat|PersonName1|South|DT.Now 例如: qwe345|box9|12345678901234567890|#NA|Animal|Cat|PersonName1|South|DT.Now

Excel file (exl2) containing roughly 50K records. 包含大约50K记录的Excel文件(exl2)。 (named searchcVal for purpose of explanation) exl2 format Serial1 (为了说明起见,将其命名为searchcVal)exl2格式Serial1

ex1a: A123456789 EX1A: A123456789

ex1b: 0987654321 ex1b: 0987654321

ex2a: 12345678901234567890 ex2a: 12345678901234567890

Here's what I have to do: 这是我要做的:

  1. Compare each value in exl2(searchval) to either Serial#1 or Serial#2 depending on the value of 5-val-enum#1 of exl1 (example1 = Plant , example2 = Animal) 根据exl1的5-val-enum#1的值,将exl2(searchval)中的每个值与Serial#1或Serial#2进行比较(example1 = Plant,example2 = Animal)

  2. if searchVal starts with [az] search serial2 else search serial1 ; 如果searchVal以[az]开头,则搜索serial2,否则搜索serial1; so, with searchVal ex1a search col3 and serachval ex1b search col2 因此,使用searchVal ex1a搜索col3和serachval ex1b搜索col2

      if (exl1.Rows[columnHeader][col4].ToString() == "Plant") { string rowVal = exl2.Rows[rowIterator][col0].ToString(); if (regex.IsMatch(rowVal[0].ToString()))//checks to see if serial1 or serial2 { if (rowVal == exl1.Rows[rowIterator][col3].ToString()) { //add matched row to ResultsDT } else { //next row } } else { //search col2 with same procedure } } else {//search col2 } 
  3. for the sake of explanation lets say Person1 matched 400 Plants of which 100 were trees, 100 were bushes , 100 were grasses and 100 were flowers and he matched 400 Animals of which 100 were cats, dogs, snakes and birds each. 为了便于说明,假设Person1匹配了400种植物,其中树木100棵,灌木丛100棵,草100朵,花100朵,他匹配了400种动物,其中100种是猫,狗,蛇和鸟。 with these matches I'd like to summarize it to the output of SUMMARY1 : PersonName|Plants|Animals|category3|Category4|Category5 with a more detailed one for each of the categories like SUMMARY 2: PersonName|Trees|Bushes|Grasses|Flowers leading to SUMM1 : Person1|400|400|x|n|y SUMM2(plants only) : Person1|100|100|100|100 通过这些匹配,我想将其汇总到摘要1的输出中:PersonName | Plants | Animals | category3 | Category4 | Category5,其中对每个类别都提供了更详细的摘要2:PersonName | Trees | Bushes | Grasses | Flowers导致SUMM1:Person1 | 400 | 400 | x | n | y SUMM2(仅植物):Person1 | 100 | 100 | 100 | 100

  4. Most importantly : do all this without killing the PC it's running on for 3 hours while it computes 最重要的是:在进行计算的同时,不中断运行3个小时的PC

There are at least two options: 至少有两个选择:

  1. Treat excel as the database and check its performance - here is how to do this: http://www.beansoftware.com/NET-Tutorials/Excel-ADO.NET-Database.aspx . 将excel视为数据库并检查其性能-这是执行此操作的方法: http : //www.beansoftware.com/NET-Tutorials/Excel-ADO.NET-Database.aspx
  2. If option no 1 is too slow import this data to a real database (ms sql, mysql, postgresql, etc.), add appropriate indexes and perform your searches in the db. 如果选项1太慢,则将此数据导入到实际数据库(ms sql,mysql,postgresql等)中,添加适当的索引并在db中执行搜索。 The excel would be treated just as a datasource for an initial import. excel会被视为初始导入的数据源。

Depending on the ratio of Excel updates/Queries run, it might be a good idea to simply read the values into an sql server database and query/process the data there? 根据Excel更新/查询运行的比率,将值读入sql server数据库并在那里查询/处理数据可能是个好主意? I would imagine that it takes some time to read the values into sql server, but the queries should take no time... 我想像一下要花一些时间将值读入sql server,但是查询应该不花时间...

I'm assuming the question here is "how can I perform this tasks efficiently?" 我假设这里的问题是“我如何才能有效地执行此任务?”

The answer is, you shouldn't. 答案是,你不应该。 It sounds like you are trying to do OLAP on the cheap (except that, well, it may not be happening strictly online), and there are a lot of solutions already available for this. 听起来您正在尝试以便宜的价格进行OLAP (除了,好吧,它可能不是严格在线上发生的),并且已经有许多解决方案可供使用。

Since you already have an established procedure of using an excel spreadsheet, PALO may serve your needs ( edit: it's free). 由于您已经具有使用excel电子表格的既定程序,因此PALO可以满足您的需求( 编辑:它是免费的)。

Alternatively, what you have there is a denormalized set of records; 另外,您拥有的是一组非规范化的记录; if you normalize it into several sets & enter it into a database (using a script, obviously), you can let your database take care of the intensive computations. 如果将其标准化为几组并将其输入数据库(显然是使用脚本),则可以让数据库处理大量的计算。 Edit: There are a lot of free databases you can use ( SQL is a language, not a brand). 编辑:有很多免费数据库可以使用( SQL是一种语言,而不是一种品牌)。 Eg PostgrSQL , MySQL 例如PostgrSQLMySQL的

If you insist on parsing the files & analyzing the files yourself, then I suggest you modify your algorithm to do 2 things: 如果您坚持要自己解析文件并分析文件,那么我建议您修改算法以做两件事:

Firstly, get your 50k set of records to fit into as little memory as is reasonable possible. 首先,获取您的5万条记录以尽可能地减少内存。 Obviously, you don't want to store your records as 50k strings: parse them, and build up a memory structure which lets you access only the information you need. 显然,您不想将记录存储为50k字符串:解析它们,并建立一个存储结构,该存储结构仅允许您访问所需的信息。 Edit: Nevermind, misunderstood your input data. 编辑:没关系,误解了您的输入数据。

Secondly, modify your algorithm so that it can be run piecemeal. 其次,修改您的算法,使其可以逐步运行。 Currently you have 1 set of 50k records, and another set of 450k records, and it sounds like you expect to run your program each month (or more frequently) on the full set of records + whatever records have been added to the 450k set of records. 当前,您有1组5万条记录,还有另一组450万条记录,这听起来像您希望每月(或更频繁地)在整个记录集上运行程序+已添加到450k条记录中的任何记录记录。 If you start storing incremental results, you can structure your script so that it processes (for example) up to 10k records at a time from your 450k record set at a time, and run several instances of your script in sequence, you can avoid re-analyzing the whole 450k of records every month and also have a handy way to stop & start the process midway (using some kind of parent script). 如果您开始存储增量结果,则可以对脚本进行结构设计,使其一次处理450万条记录中的一次(例如)处理多达1万条记录,并依次运行脚本的多个实例,可以避免重新运行-每月分析整个450k记录,并且有一种方便的方法来中途停止和启动该过程(使用某种父脚本)。

For a more complex approach, look Divide and Conquer as it applies to algorithms. 对于更复杂的方法,请查看应用于算法的分而治之。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM