Why is my python DataFrame performing so slowly

Question

I'm building an application that provides some very simple analysis on large datasets. These datasets are delivered in CSV files of 10 million + rows with about 30 columns. (I don't need many of the columns.)

Logic tells me that the entire file into a DataFrame should make it faster to access. But my computer says no.

I've tried loading in batches, as well as loading the entire files, then performing the functions in batches.

But the end result is that it is taking more than 10 times as long to perform the same process, than using a simple file read option.

Here is the DataFrame version:

def runProcess():
    global batchSize
    batchCount = 10
    if rowLimit < 0:
        with open(df_srcString) as f:
            rowCount = sum(1 for line in f)
        if batchSize < 0:
            batchSize = batchSize * -1
            runProc = readFileDf
        else:
            runProc = readFileDfBatch
        batchCount = int(rowCount / batchSize) + 1
    else:
        batchCount = int(rowLimit / batchSize) + 1
    for i in range(batchCount):
        result = runProc(batchSize, i)
        print(result)

def readFileDfBatch(batch, batchNo):
    sCount = 0
    lCount = 0
    jobStartTime = datetime.datetime.now()
    eof = False
    totalRowCount = 0

    startRow = batch * batchNo
    df_wf = pd.read_csv(df_srcString, sep='|', header=None, names=df_fldHeads.split(','), usecols=df_cols, dtype=str, nrows=batch, skiprows=startRow)
    for index, row in df_wf.iterrows():
        result = parseDfRow(row)
        totalRowCount = totalRowCount + 1
        if result == 1:
            sCount = sCount + 1
        elif result == 2:
            lCount = lCount + 1
    eof = batch > len(df_wf)
    if rowLimit >= 0:
        eof = (batch * batchNo >= rowLimit)
    jobEndTime = datetime.datetime.now()
    runTime = jobEndTime - jobStartTime
    return [batchNo, sCount, lCount, totalRowCount, runTime]

def parseDfRow(row):
#df_cols = ['ColumnA','ColumnB','ColumnC','ColumnD','ColumnE','ColumnF']
    status = 0
    s2 = getDate(row['ColumnB'])
    l2 = getDate(row['ColumnD'])
    gDate = datetime.date(1970,1,1)
    r1 = datetime.date(int(row['ColumnE'][1:5]),12,31)
    r2 = row['ColumnF']
    if len(r2) > 1:
        lastSeen = getLastDate(r2)
    else:
        lastSeen = r1
    status = False
    if s2 > lastSeen:
        status = 1
    elif l2 > lastSeen:
        status = 2
    return status

And here is the simple file reader version:

def readFileStd(rows, batch):
    print("Starting read: ")
    batchNo = 1
    global targetFile
    global totalCount
    global sCount
    global lCount
    targetFile = open(df_srcString, "r")
    eof = False
    while not eof:
        batchStartTime = datetime.datetime.now()
        eof = readBatch(batch)
        batchEndTime = datetime.datetime.now()
        runTime = batchEndTime - batchStartTime
        if rows > 0 and totalCount >= rows: break
        batchNo = batchNo + 1
    targetFile.close()
    return [batchNo, sCount, lCount, totalCount, runTime]

def readBatch(batch):
    global targetFile
    global totalCount
    rowNo = 1
    rowStr = targetFile.readline()
    while rowStr:
        parseRow(rowStr)
        totalCount = totalCount + 1
        if rowNo == batch: 
            return False
        rowStr = targetFile.readline()
        rowNo = rowNo + 1
    return True

    def parseRow(rowData):
    rd = rowData.split('|')
    s2 = getDate(rd[3])
    l2 = getDate(rd[5])
    gDate = datetime.date(1970,1,1)
    r1 = datetime.date(int(rd[23][1:5]),12,31)
    r2 = rd[24]
    if len(r2) > 1:
        lastSeen = getLastDate(r2)
    else:
        lastSeen = r1
    status = False
    if s2 > lastSeen:
        global sCount
        sCount = sCount + 1
        status = True
        gDate = s2
    elif l2 > lastSeen:
        global lCount
        lCount = lCount + 1
        gDate = s2

Am I doing something wrong?

Answer 1

iterrows doesn't take advantage of vectorized operations. Most of the benefits of using pandas come from vectorized and parallel operations.

Replace for index, row in df_wf.iterrows(): with df_wf.apply(something, axis=1) where something is a function that encapsulates the logic you needed from iterrows , and uses numpy vectorized operations.

Also if your df doesn't fit in memory such that you need to batch read, consider using dask or spark over pandas .

Further reading: https://pandas.pydata.org/pandas-docs/stable/enhancingperf.html

Answer 2

a few comments about your code:

all those global variables are scaring me! what's wrong with passing parameters and returning state?
you're not using any functionality from Pandas , creating a dataframe just to use it to do a dumb iteration over rows is causing it to do lots of unnecessary work
the standard csv module (can be used with delimiter='|' ) provides a much closer interface if this is really the best way you can to do this

this might be a better question for https://codereview.stackexchange.com/

just playing with performance of alternative ways of working row wise. the take home from the below seems to be that working "row wise" is basically always slow with Pandas

start by creating a dataframe to test this:

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(1, 1e6, (10_000, 2)))
df[1] = df[1].apply(str)

this takes 3.65 ms to create a dataframe with int and str columns. next I try the iterrows approach:

tot = 0
for i, row in df.iterrows():
    tot += row[0] / 1e5 < len(row[1])

the aggregation is pretty dumb, I just wanted something that uses both columns. it takes a scary long 903ms. next I try iterating manually:

tot = 0
for i in range(df.shape[0]):
    tot += df.loc[i, 0] / 1e5 < len(df.loc[i, 1])

which reduces this down to 408 ms. next I try apply :

def fn(row):
    return row[0] / 1e5 < len(row[1])

sum(df.apply(fn, axis=1))

which is basically the same at 368 ms. finally, I find some code that Pandas is happy with:

sum(df[0] / 1e5 < df[1].apply(len))

which takes 4.15 ms. and another approach that occurred to me:

tot = 0
for a, b in zip(df[0], df[1]):
    tot += a / 1e5 < len(b)

which takes 2.78 ms. while another variant:

tot = 0
for a, b in zip(df[0] / 1e5, df[1]):
    tot += a < len(b)

takes 2.29 ms.

Why is my python DataFrame performing so slowly

Question

2 answers

solution1
0 ACCPTED 2019-01-16 14:36:54

solution2
0 2019-01-16 15:08:59

Why is my python DataFrame performing so slowly

Question

2 answers

solution1 0 ACCPTED 2019-01-16 14:36:54

solution2 0 2019-01-16 15:08:59

solution1
0 ACCPTED 2019-01-16 14:36:54

solution2
0 2019-01-16 15:08:59