简体   繁体   English

如何在python中更有效地搜索大型列表?

[英]How can I more efficiently search a large list in python?

Problem: I am working with a very large data set that I need to iterate over. 问题:我正在处理一个需要迭代的非常大的数据集。 Every five minutes my program adds about 1300 rows of information each with 4 columns. 每五分钟,我的程序会添加大约1300行信息​​,每行包含4列。 This means that in the course of one day it gathers about 374,400 rows of information or 1,497,600 cells per day. 这意味着在一天的过程中,它每天收集大约374,400行信息​​或1,497,600个单元格。 There are 1300 rows because there are 1300 items that the program is tracking every five minutes. 有1300行,因为程序每五分钟跟踪1300个项目。 For example: 例如:

Item_Name       Price      Quantity_in_Stock        Maximum_Stock_Level
----------
Soap            1.00              10                     10                    
Frogs           1.25              12                     16
Pickled Yogurt  1.35               7                      8
Malodorous Ooze 6.66               6                     66

I'm trying to tally the changes over the course of the day in the stock levels of each unique item. 我试图在每个独特项目的库存水平中计算一天中的变化。 My current technique pulls the entire data set from a mysql server. 我当前的技术从mysql服务器中提取整个数据集。 I rely upon the item name, the stock level, the maximum stock, and the observation date: 我依赖项目名称,库存水平,最大库存和观察日期:

q = """SELECT Item_Name,Item_In_Stock,Item_Max,Observation_Date
    FROM DB WHERE
    Observation_Date>DATE_ADD(curdate(),INTERVAL -1 DAY) """ 


try:
    x.execute(q)
    conn.commit()
    valueValue= x.fetchall() # The entire data set
except:
    conn.rollback()

Then I iterate through each Item_Name and for each item I find all matching values: 然后我遍历每个Item_Name,对于每个项目,我找到所有匹配的值:

for item in ItemNames:
     matching = [s for s in valueValue if item[0] in s] # item[0] is an item name, i.e. Soap, Frogs, Pickled Yogurt, etc.

After that, I want to find out the number of items purchased for that day. 之后,我想知道当天购买的商品数量。 This is tricky because items are restocked and therefore I have to compare each time interval against the last to see if there is any change in stock level (I can't just compare the beginning and end): 这很棘手,因为物品是重新进货的,因此我必须将每个时间间隔与最后一个时间间隔进行比较,以查看库存水平是否有任何变化(我不能只比较开头和结尾):

for item in matching:
    if not tempValue:
        tempValue = item[1] #for first row, set value equal to first row

    if tempValue > item[1]: #if last row greater than current row
        buyCount = buyCount + (item[1]-tempValue) # Add the different to the buyCount (volume sold)
    tempValue = item[1] #set tempValue for next row comparison

This method works, but it is fairly slow. 这种方法有效,但速度相当慢。 I've timed it at about 2.2 seconds per unique item (out of the 1300) tallying iteration. 我在每个独特项目(1300次)计算迭代中的时间约为2.2秒。 This means that the entire day takes about 50 minutes to calculate. 这意味着整天需要大约50分钟来计算。 I'd like to cut down on this time if possible. 如果可能的话,我想减少这个时间。 What can I do to improve on this searching and tallying function? 我该怎么做才能改进这种搜索和统计功能?

EDIT: I've tried letting MySQL do the work with the following code, but it is actually slower than using python to sort through it all: 编辑:我已经尝试让MySQL使用以下代码完成工作,但它实际上比使用python对它进行排序更慢:

for item in getnameValues: # for each item name execute the following query
    q = """SELECT Item_Name,Item_In_Stock,Item_Max,Observation_Date
       FROM DB WHERE
       Item_Name=%s and Observation_Date>DATE_ADD(curdate(),INTERVAL -1 DAY) """
 try:
    x.execute(q,item[0]) # executes the query for the current item
    conn.commit()
    valueValue= x.fetchall() 

I'm assuming I need a way to loop through all the items within MySQL, and then have it send a list of lists back to python. 我假设我需要一种方法来遍历MySQL中的所有项目,然后让它将列表列表发送回python。 Right? 对?

I'm sorry, in its current form this all looks very scary. 对不起,就目前的形式而言,这看起来非常可怕。

First, the results of the computation seem to depend on the time you run them. 首先,计算结果似乎取决于您运行它们的时间。 You compute something starting from yesterday up to the moment, not just for yesterday. 你从昨天开始计算一些东西,而不仅仅是昨天。 That is records for today (inserted before you run the script), are processed today and tomorrow. 这是今天的记录(在运行脚本之前插入),今天和明天都会处理。

Second, you seem to iterate over the whole dataset len(item_names) times, that is 1300 times you iterate 1.5m rows! 其次,您似乎遍历整个数据集len(item_names)次,即迭代1.5m行的1300次! Why not do the processing in a single iteration using defaultdict or Counter ? 为什么不使用defaultdictCounter在单次迭代中进行处理?

Third, you should better operate with integer values instead of comparing item name strings. 第三,你应该更好地使用整数值而不是比较项目名称字符串。

for better performance you should use mysql programming instead python programming. 为了获得更好的性能,你应该使用mysql编程而不是python编程。

if you want to have a control over each insertion to your table, it's better use trigger in mysql. 如果你想控制每个表的插入,最好在mysql中使用trigger And if you want at the end of (for example) each day have search or whatever you want, you'd better use cursor . 如果你想在每天结束时(例如)每天都有搜索或任何你想要的东西,你最好使用光标

you can find a lot of stuff around both of cursor and trigger by a simple search in internet. 通过简单的互联网搜索,您可以在光标和触发器周围找到很多东西。 By the way tutsplus.com, have some neat and clean tutorial about them. 顺便说一句tutsplus.com,有一些关于它们的整洁干净的教程。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM