简体   繁体   中英

Speed of pandas df.loc[x, 'column']

I have a pandas DataFrame of about 100 rows, from which I need to select values from a column for a given index in an efficient way. At the moment I am using df.loc[index, 'col'] for this, but this seems to be relatively slow:

df = pd.DataFrame({'col': range(100)}, index=range(100))    
%timeit df.loc[random.randint(0, 99), 'col']
#100000 loops, best of 3: 19.3 µs per loop

What seems to be much faster (by a factor of about 10x) is to turn the data frame into a dictionary and then query that:

d = df.to_dict()    
%timeit d['col'][random.randint(0, 99)]
#100000 loops, best of 3: 2.5 µs per loop

Is there a way to get similar performance using normal data frame methods, without explicitly creating the dict? Should I be using something other than .loc ?

Or is this just a situation where I am better off using this workaround?

If efficient is a factor to consider, Numpy arrays could be a better choice than pandas dataframe. I try to reproduce your example for measure the efficiency comparison:

import numpy as np
import pandas as pd
import timeit, random

df = pd.DataFrame({'col': range(100)}, index=range(100)) 
print(timeit.timeit('df.loc[random.randint(0, 99), "col"]', number=10000, globals=globals()))

ds_numpy = np.array(df)
print(timeit.timeit('ds_numpy[ds_numpy[random.randint(0, 99)]]', number=10000, globals=globals()))

Results:

$ python test_pandas_vs_numpy.py 
0.1583892970229499
0.05918855100753717

In this scenario it looks like than use Numpy array over pandas dataframe is and advantage in terms of performance.

Reference: 1

A dict does indeed seem to be the fastest option:

df_dict = df.to_dict()
df_numpy = np.array(df)
print(timeit.timeit("df.loc[random.randint(0, 99), 'col']", number = 100000, globals=globals()))
print(timeit.timeit("df.get_value(random.randint(0, 99), 'col')", number = 100000, globals=globals()))
print(timeit.timeit('df_numpy[df_numpy[random.randint(0, 99)]]', number=100000, globals=globals()))
print(timeit.timeit("df_dict['col'][random.randint(0, 99)]", number = 100000, globals=globals()))

Result:

4.859706375747919
1.8850274719297886
1.4855970665812492
0.6550335008651018

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM