为什么我的python DataFrame性能这么慢

Question

I'm building an application that provides some very simple analysis on large datasets. 我正在构建一个应用程序，它可以对大型数据集进行一些非常简单的分析。 These datasets are delivered in CSV files of 10 million + rows with about 30 columns. 这些数据集以1000万行以上，约30列的CSV文件格式提供。 (I don't need many of the columns.) （我不需要很多列。）

Logic tells me that the entire file into a DataFrame should make it faster to access. 逻辑告诉我，整个文件放入一个DataFrame中应该可以使其访问更快。 But my computer says no. 但是我的电脑说不。

I've tried loading in batches, as well as loading the entire files, then performing the functions in batches. 我尝试过分批加载，以及加载整个文件，然后分批执行功能。

But the end result is that it is taking more than 10 times as long to perform the same process, than using a simple file read option. 但是最终结果是，执行同一过程所需的时间比使用简单的文件读取选项要多十倍。

Here is the DataFrame version: 这是DataFrame版本：

def runProcess():
    global batchSize
    batchCount = 10
    if rowLimit < 0:
        with open(df_srcString) as f:
            rowCount = sum(1 for line in f)
        if batchSize < 0:
            batchSize = batchSize * -1
            runProc = readFileDf
        else:
            runProc = readFileDfBatch
        batchCount = int(rowCount / batchSize) + 1
    else:
        batchCount = int(rowLimit / batchSize) + 1
    for i in range(batchCount):
        result = runProc(batchSize, i)
        print(result)

def readFileDfBatch(batch, batchNo):
    sCount = 0
    lCount = 0
    jobStartTime = datetime.datetime.now()
    eof = False
    totalRowCount = 0

    startRow = batch * batchNo
    df_wf = pd.read_csv(df_srcString, sep='|', header=None, names=df_fldHeads.split(','), usecols=df_cols, dtype=str, nrows=batch, skiprows=startRow)
    for index, row in df_wf.iterrows():
        result = parseDfRow(row)
        totalRowCount = totalRowCount + 1
        if result == 1:
            sCount = sCount + 1
        elif result == 2:
            lCount = lCount + 1
    eof = batch > len(df_wf)
    if rowLimit >= 0:
        eof = (batch * batchNo >= rowLimit)
    jobEndTime = datetime.datetime.now()
    runTime = jobEndTime - jobStartTime
    return [batchNo, sCount, lCount, totalRowCount, runTime]

def parseDfRow(row):
#df_cols = ['ColumnA','ColumnB','ColumnC','ColumnD','ColumnE','ColumnF']
    status = 0
    s2 = getDate(row['ColumnB'])
    l2 = getDate(row['ColumnD'])
    gDate = datetime.date(1970,1,1)
    r1 = datetime.date(int(row['ColumnE'][1:5]),12,31)
    r2 = row['ColumnF']
    if len(r2) > 1:
        lastSeen = getLastDate(r2)
    else:
        lastSeen = r1
    status = False
    if s2 > lastSeen:
        status = 1
    elif l2 > lastSeen:
        status = 2
    return status

And here is the simple file reader version: 这是简单的文件阅读器版本：

def readFileStd(rows, batch):
    print("Starting read: ")
    batchNo = 1
    global targetFile
    global totalCount
    global sCount
    global lCount
    targetFile = open(df_srcString, "r")
    eof = False
    while not eof:
        batchStartTime = datetime.datetime.now()
        eof = readBatch(batch)
        batchEndTime = datetime.datetime.now()
        runTime = batchEndTime - batchStartTime
        if rows > 0 and totalCount >= rows: break
        batchNo = batchNo + 1
    targetFile.close()
    return [batchNo, sCount, lCount, totalCount, runTime]

def readBatch(batch):
    global targetFile
    global totalCount
    rowNo = 1
    rowStr = targetFile.readline()
    while rowStr:
        parseRow(rowStr)
        totalCount = totalCount + 1
        if rowNo == batch: 
            return False
        rowStr = targetFile.readline()
        rowNo = rowNo + 1
    return True

    def parseRow(rowData):
    rd = rowData.split('|')
    s2 = getDate(rd[3])
    l2 = getDate(rd[5])
    gDate = datetime.date(1970,1,1)
    r1 = datetime.date(int(rd[23][1:5]),12,31)
    r2 = rd[24]
    if len(r2) > 1:
        lastSeen = getLastDate(r2)
    else:
        lastSeen = r1
    status = False
    if s2 > lastSeen:
        global sCount
        sCount = sCount + 1
        status = True
        gDate = s2
    elif l2 > lastSeen:
        global lCount
        lCount = lCount + 1
        gDate = s2

Am I doing something wrong? 难道我做错了什么？

Answer 1

iterrows doesn't take advantage of vectorized operations. iterrows没有利用向量化运算的优势。 Most of the benefits of using pandas come from vectorized and parallel operations. 使用pandas大多数好处来自矢量化和并行操作。

Replace for index, row in df_wf.iterrows(): with df_wf.apply(something, axis=1) where something is a function that encapsulates the logic you needed from iterrows , and uses numpy vectorized operations. 用df_wf.apply(something, axis=1)替换for index, row in df_wf.iterrows():其中something是一个函数，该函数封装了iterrows所需的逻辑，并使用了numpy向量化操作。

Also if your df doesn't fit in memory such that you need to batch read, consider using dask or spark over pandas . 另外，如果您的df无法容纳在内存中，因此您需要批量读取，请考虑使用dask或spark覆盖pandas 。

Further reading: https://pandas.pydata.org/pandas-docs/stable/enhancingperf.html 进一步阅读： https : //pandas.pydata.org/pandas-docs/stable/enhancingperf.html

Answer 2

a few comments about your code: 关于您的代码的一些注释：

all those global variables are scaring me! 所有这些global变量都吓到我了！ what's wrong with passing parameters and returning state? 传递参数和返回状态有什么问题？
you're not using any functionality from Pandas , creating a dataframe just to use it to do a dumb iteration over rows is causing it to do lots of unnecessary work 您没有使用Pandas任何功能，创建一个数据框只是为了用它对行进行愚蠢的迭代会导致它做很多不必要的工作
the standard csv module (can be used with delimiter='|' ) provides a much closer interface if this is really the best way you can to do this 如果这确实是最好的方法，则标准的csv模块（可以与delimiter='|' ）提供了更紧密的接口

this might be a better question for https://codereview.stackexchange.com/ 对于https://codereview.stackexchange.com/，这可能是一个更好的问题

just playing with performance of alternative ways of working row wise. 只是发挥着替代行方式的性能。 the take home from the below seems to be that working "row wise" is basically always slow with Pandas 从下面获得的收获似乎是，熊猫的“行明智”工作通常总是很慢

start by creating a dataframe to test this: 首先创建一个数据框进行测试：

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(1, 1e6, (10_000, 2)))
df[1] = df[1].apply(str)

this takes 3.65 ms to create a dataframe with int and str columns. 这需要3.65毫秒来创建带有int和str列的数据帧。 next I try the iterrows approach: 接下来，我尝试iterrows方法：

tot = 0
for i, row in df.iterrows():
    tot += row[0] / 1e5 < len(row[1])

the aggregation is pretty dumb, I just wanted something that uses both columns. 聚合是相当愚蠢的，我只想要一些同时使用两列的东西。 it takes a scary long 903ms. 耗时长达903ms。 next I try iterating manually: 接下来，我尝试手动进行迭代：

tot = 0
for i in range(df.shape[0]):
    tot += df.loc[i, 0] / 1e5 < len(df.loc[i, 1])

which reduces this down to 408 ms. 减少到408毫秒。 next I try apply : 接下来我尝试apply ：

def fn(row):
    return row[0] / 1e5 < len(row[1])

sum(df.apply(fn, axis=1))

which is basically the same at 368 ms. 与368毫秒基本相同。 finally, I find some code that Pandas is happy with: 终于，我找到了熊猫满意的一些代码：

sum(df[0] / 1e5 < df[1].apply(len))

which takes 4.15 ms. 这需要4.15毫秒。 and another approach that occurred to me: 我想到的另一种方法是：

tot = 0
for a, b in zip(df[0], df[1]):
    tot += a / 1e5 < len(b)

which takes 2.78 ms. 这需要2.78毫秒。 while another variant: 而另一个变体：

tot = 0
for a, b in zip(df[0] / 1e5, df[1]):
    tot += a < len(b)

takes 2.29 ms. 需要2.29毫秒。

为什么我的python DataFrame性能这么慢

问题描述

2 个解决方案

解决方案1
0 已采纳 2019-01-16 14:36:54

解决方案2
0 2019-01-16 15:08:59

为什么我的python DataFrame性能这么慢

问题描述

2 个解决方案

解决方案1 0 已采纳 2019-01-16 14:36:54

解决方案2 0 2019-01-16 15:08:59

解决方案1
0 已采纳 2019-01-16 14:36:54

解决方案2
0 2019-01-16 15:08:59