简体   繁体   中英

Does Indexing makes Slice of pandas dataframe faster?

I have a pandas dataframe holding more than million records. One of its columns is datetime. The sample of my data is like the following:

time,x,y,z
2015-05-01 10:00:00,111,222,333
2015-05-01 10:00:03,112,223,334
...

I need to effectively get the record during the specific period. The following naive way is very time consuming.

new_df = df[(df["time"] > start_time) & (df["time"] < end_time)]

I know that on DBMS like MySQL the indexing by the time field is effective for getting records by specifying the time period.

My question is

  1. Does the indexing of pandas such as df.index = df.time makes the slicing process faster?
  2. If the answer of Q1 is 'No', what is the common effective way to get a record during the specific time period in pandas?

Let's create a dataframe with 1 million rows and time performance. The index is a Pandas Timestamp.

df = pd.DataFrame(np.random.randn(1000000, 3), 
                  columns=list('ABC'), 
                  index=pd.DatetimeIndex(start='2015-1-1', freq='10s', periods=1000000))

Here are the results sorted from fastest to slowest (tested on the same machine with both v. 0.14.1 (don't ask...) and the most recent version 0.17.1):

%timeit df2 = df['2015-2-1':'2015-3-1']
1000 loops, best of 3: 459 µs per loop (v. 0.14.1)
1000 loops, best of 3: 664 µs per loop (v. 0.17.1)

%timeit df2 = df.ix['2015-2-1':'2015-3-1']
1000 loops, best of 3: 469 µs per loop (v. 0.14.1)
1000 loops, best of 3: 662 µs per loop (v. 0.17.1)

%timeit df2 = df.loc[(df.index >= '2015-2-1') & (df.index <= '2015-3-1'), :]
100 loops, best of 3: 8.86 ms per loop (v. 0.14.1)
100 loops, best of 3: 9.28 ms per loop (v. 0.17.1)

%timeit df2 = df.loc['2015-2-1':'2015-3-1', :]
1 loops, best of 3: 341 ms per loop (v. 0.14.1)
1000 loops, best of 3: 677 µs per loop (v. 0.17.1)

Here are the timings with the Datetime index as a column:

df.reset_index(inplace=True)

%timeit df2 = df.loc[(df['index'] >= '2015-2-1') & (df['index'] <= '2015-3-1')]
100 loops, best of 3: 12.6 ms per loop (v. 0.14.1)
100 loops, best of 3: 13 ms per loop (v. 0.17.1)

%timeit df2 = df.loc[(df['index'] >= '2015-2-1') & (df['index'] <= '2015-3-1'), :]
100 loops, best of 3: 12.8 ms per loop (v. 0.14.1)
100 loops, best of 3: 12.7 ms per loop (v. 0.17.1)

All of the above indexing techniques produce the same dataframe:

>>> df2.shape
(250560, 3)

It appears that either of the first two methods are the best in this situation, and the fourth method also works just as fine using the latest version of Pandas.

I've never dealt with a data set that large, but maybe you can try recasting the time column as a datetime index and then slicing directly. Something like this.

timedata.txt (extended from your example):

time,x,y,z
2015-05-01 10:00:00,111,222,333
2015-05-01 10:00:03,112,223,334
2015-05-01 10:00:05,112,223,335
2015-05-01 10:00:08,112,223,336
2015-05-01 10:00:13,112,223,337
2015-05-01 10:00:21,112,223,338

df = pd.read_csv('timedata.txt')
df.time = pd.to_datetime(df.time)
df = df.set_index('time')
print(df['2015-05-01 10:00:02':'2015-05-01 10:00:14'])

                       x    y    z
time                              
2015-05-01 10:00:03  112  223  334
2015-05-01 10:00:05  112  223  335
2015-05-01 10:00:08  112  223  336
2015-05-01 10:00:13  112  223  337

Note that in the example the times used for slicing are not in the column, so this will work for the case where you only know the time interval.

If your data has a fixed time period you can create a datetime index which may provide more options. I didn't want to assume your time period was fixed so constructed this for a more general case.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM