在 pandas dataframe 上迭代 function 的最快方法

Question

I have a function which operates over lines of a csv file, adding values of different cells to dictionaries depending on whether conditions are met:我有一个 function 它在 csv 文件的行上运行，根据是否满足条件将不同单元格的值添加到字典中：

df = pd.concat([pd.read_csv(filename) for filename in args.csv], ignore_index = True)

ID_Use_Totals = {}
ID_Order_Dates = {}
ID_Received_Dates = {}
ID_Refs = {}
IDs = args.ID

def TSQs(row):

    global ID_Use_Totals, ID_Order_Dates, ID_Received_Dates

    if row['Stock Item'] not in IDs:
        pass
    else:
        if row['Action'] in ['Order/Resupply', 'Cons. Purchase']:
            if row['Stock Item'] not in ID_Order_Dates:
                ID_Order_Dates[row['Stock Item']] = [{row['Ref']: pd.to_datetime(row['TransDate'])}]
            else:
                ID_Order_Dates[row['Stock Item']].append({row['Ref']: pd.to_datetime(row['TransDate'])})
        
        elif row['Action'] == 'Received':
                
             if row['Stock Item'] not in ID_Received_Dates:
                ID_Received_Dates[row['Stock Item']] = [{row['Ref']: pd.to_datetime(row['TransDate'])}]
            else:
                ID_Received_Dates[row['Stock Item']].append({row['Ref']: pd.to_datetime(row['TransDate'])})
                                    
        elif row['Action'] == 'Use':
            if row['Stock Item'] in ID_Use_Totals: 
                ID_Use_Totals[row['Stock Item']].append(row['Qty'])
            else:
                ID_Use_Totals[row['Stock Item']] = [row['Qty']]
                                       
        else:
            pass

Currently, I am doing:目前，我正在做：

for index, row in df.iterrows():
    TSQs(row)

But timer() returns between 70 and 90 seconds for a 40,000 line csv file.但是对于 40,000 行 csv 文件， timer()返回 70 到 90 秒。

I want to know what the fastest way of implementing this is over the entire dataframe (which could potentially be hundreds of thousands of rows).我想知道在整个 dataframe （可能是数十万行）上实现这一点的最快方法是什么。

Answer 1

I'd wager not using Pandas for this could be faster.我敢打赌不使用 Pandas 因为这可能会更快。

Additionally, you can use defaultdict s to avoid having to check whether you've seen a given product yet:此外，您可以使用defaultdict避免检查您是否已经看过给定的产品：

import csv
import collections
import datetime

ID_Use_Totals = collections.defaultdict(list)
ID_Order_Dates = collections.defaultdict(list)
ID_Received_Dates = collections.defaultdict(list)
ID_Refs = {}
IDs = set(args.ID)
order_actions = {"Order/Resupply", "Cons. Purchase"}

for filename in args.csv:
    with open(filename) as f:
        for row in csv.DictReader(f):
            item = row["Stock Item"]
            if item not in IDs:
                continue
            ref = row["Ref"]
            action = row["Action"]
            if action in order_actions:
                date = datetime.datetime.fromisoformat(row["TransDate"])
                ID_Order_Dates[item].append({ref: date})
            elif action == "Received":
                date = datetime.datetime.fromisoformat(row["TransDate"])
                ID_Received_Dates[item].append({ref: date})
            elif action == "Use":
                ID_Use_Totals[item].append(row["Qty"])

EDIT: If the CSV is really of the form编辑：如果 CSV 真的是形式

"Employee", "Stock Location", "Stock Item"
"Ordered", "16", "32142"

the stock CSV module can't quite parse it.库存的 CSV 模块无法完全解析它。

You could use Pandas to parse the file, then iterate over rows, though I'm not sure if this'll end up being much faster in the end:您可以使用 Pandas 解析文件，然后遍历行，但我不确定这最终是否会更快：

import collections
import datetime
import pandas

ID_Use_Totals = collections.defaultdict(list)
ID_Order_Dates = collections.defaultdict(list)
ID_Received_Dates = collections.defaultdict(list)
ID_Refs = {}
IDs = set(args.ID)
order_actions = {"Order/Resupply", "Cons. Purchase"}

for filename in args.csv:
    for index, row in pd.read_csv(filename).iterrows():
        item = row["Stock Item"]
        if item not in IDs:
            continue
        ref = row["Ref"]
        action = row["Action"]
        if action in order_actions:
            date = datetime.datetime.fromisoformat(row["TransDate"])
            ID_Order_Dates[item].append({ref: date})
        elif action == "Received":
            date = datetime.datetime.fromisoformat(row["TransDate"])
            ID_Received_Dates[item].append({ref: date})
        elif action == "Use":
            ID_Use_Totals[item].append(row["Qty"])

Answer 2

You can use the apply function.您可以使用申请 function。 The code will look like this:代码将如下所示：

df.apply(TSQs, axis=1)

Here when axis=1 , each row will be sent to the function TSQs as a pd.Series from where you can index like row["Ref"] to get value of that line.在这里，当axis=1时，每一行将作为pd.Series发送到TSQs ，您可以从中索引row["Ref"]以获取该行的值。 Since this is a vector operation, it will run so much after that a for loop.由于这是一个向量操作，它会在一个for循环之后运行这么多。

Answer 3

Probably fastest not to iterate at all:可能最快根本不迭代：

# Build some boolean indices for your various conditions
idx_stock_item = df["Stock Item"].isin(IDs)
idx_purchases =  df["Action"].isin(['Order/Resupply', 'Cons. Purchase'])
idx_order_dates = df["Stock Item"].isin(ID_Order_Dates)

# combine the indices to act on specific rows all at once
idx_combined = idx_stock_item & idx_purchases & ~idx_order_dates
# It looks like you were putting a single entry dictionary in each row - wouldn't it make sense to rather just use two columns? i.e. take advantage of the DataFrame data structure
ID_Order_Dates.loc[df.loc[idx_combined, "Stock Item"], "Ref"] = df.loc[idx_combined, "Ref"]   
ID_Order_Dates.loc[df.loc[idx_combined, "Stock Item"], "Date"] = df.loc[idx_combined, "TransDate"]

# repeat for your other cases
# ...

在 pandas dataframe 上迭代 function 的最快方法

问题描述

3 个解决方案

解决方案1
1 已采纳 2020-07-30 12:40:57

解决方案2
1 2020-07-30 12:44:30

解决方案3
1 2020-07-30 12:58:15

在 pandas dataframe 上迭代 function 的最快方法

问题描述

3 个解决方案

解决方案1 1 已采纳 2020-07-30 12:40:57

解决方案2 1 2020-07-30 12:44:30

解决方案3 1 2020-07-30 12:58:15

解决方案1
1 已采纳 2020-07-30 12:40:57

解决方案2
1 2020-07-30 12:44:30

解决方案3
1 2020-07-30 12:58:15