简体   繁体   中英

Why is the pandas.read_fwf function so slow compared to some python alternatives?

I have an.rpt file so a fixed witdth file and therefore I thought: Great, I can use pandas and their built in function to read it; The file itself is quite large, several GB and has around 50 million rows, so efficiency is important here;

So I started like that

import pandas as pd
import time

t=time.time()
cnt=0
for line in pd.read_fwf("test.rpt", skiprows=[1] , encoding="utf-8-sig", chunksize=1):
    cnt=cnt+1
    if cnt>100000:
        break
print(time.time()-t)

So running through the first 100.000 lines took 113 seconds on my computer; So running through all 50 million entries would take more than 16 hours which is a lot

So I was searching for an alternative and I thought maybe i could just open it normally in python and then cut each line into the fixed-length pieces, so that was what I did

import time

t=time.time()    
with open('test.rpt', 'r', encoding="utf-8-sig") as testfile:
    for i, line in enumerate(testfile):
        a=line[:20].strip()
        b=line[20:40].strip()
        c=line[40:].strip()
        if i>100000:           
            break
print(time.time()-t)

The time it took: 0.093s so around 1200x faster; Now I was really wondering how this is possible and I thought maybe it is because the read.fwf needs to find the columns properly so I added colspecs=[(0,20),(20,40), (40,60)] but it did not make it a lot faster;

Now I am of course wondering why the difference is so so big? Would it be the same if I use a pandas dataframe instead of using a numpy array? Ie is a numpy array and searching values, assigning values etc etc also much faster than doing this in a pandas dataframe?

Many thanks


So I did some further testing on 1. assign values and 2. search for values

First assigning values:

import time
import pandas as pd
import numpy as np

n=10000
d=100
rep=100000

#generating random indices
ind1=np.random.randint(0, n, rep)
ind2=np.random.randint(0, d, rep)

df=np.zeros((n,d))

t=time.time()
for i in range(rep):
    df[ind1[i],ind2[i]]=i
print(time.time()-t)


df = pd.DataFrame(np.zeros((n, d)))

t=time.time()
for i in range(rep):
    df.iloc[ind1[i]][ind2[i]]=i
print(time.time()-t)

With a numpy array, assigning the values only takes 0.04s whereas pandas needs 8.34s so again 200x faster; I also tried df.iat[ind1[i],ind2[i]]=i with a drastic improvement (0.9s) but still around 20 times slower than numpy

Now to the 2nd point: Searching - since in my example I mostly know the column and need to search the value in a specific column I created one column with many rows;

import time
import pandas as pd
import numpy as np

n=10000000
rep=1000
a=np.random.uniform(0, 1, n)
ind=np.random.randint(0,n,rep)


df=np.array(a)

t=time.time()
for i in range(rep):
    v=df[ind[i]]
    vv=np.where(df == v)
print(time.time()-t)

df=pd.DataFrame(a)

t=time.time()
for i in range(rep):
    v=df.iat[ind[i],0]
    vv=df.loc[df[0] == v]
print(time.time()-t)

Now they are both more or less equally fast

In the first example, pandas is doing type inference on every row, whereas reading the file with open doesn't incur that cost. You would likely see faster results if you specify the dtypes when you call read_fwf .

You can also speed up read operations by installing modin and importing it with import modin.pandas as pd .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM