简体   繁体   中英

Fast pandas.DataFrame initialization

Question

What is an efficient way to get the following pandas DataFrame? (Update: numbers change each time)

   alpha  beta  gamma
0    1.5   2.5    3.5

[1 rows x 3 columns]

Motivation

I added a pandas.DataFrame API to some of my methods be able to do calculations in batches.

When replicating some of my testcases for the new API the execution of my testbenches raised from 200ms to over 8 seconds. Doing a profile run, I noticed that the main cause is creating 20k pandas.DataFrame objects.

See the comparison

In [1]: import pandas as pd

In [2]: timeit pd.DataFrame({'alpha': 1.5, 'beta': 2.5, 'gamma': 3.5}, [0])
1000 loops, best of 3: 405 us per loop

In [3]: timeit {'alpha': 1.5, 'beta': 2.5, 'gamma': 3.5}
1000000 loops, best of 3: 200 ns per loop

It seems that creating a DataFrame object is 2000 times slower than lower level structures. I tried to optimize it, but this is as fast as I got:

In [4]: import numpy as np

In [5]: timeit pd.DataFrame(np.array([[1.5, 2.5, 3.5]]), columns=['alpha', 'beta', 'gamma'])
1000 loops, best of 3: 144 us per loop

This is still 720 times slower than the dict. Is it possible to be faster? Creating numpy arrays is eg only 10 times slower:

In [6]: timeit np.array([[1.5, 2.5, 3.5]])
100000 loops, best of 3: 1.99 us per loop

You could have a global data frame for your tests and just do df = global_df.copy() , example:

In[1] global_df = pd.DataFrame({'alpha': 1.5, 'beta': 2.5, 'gamma': 3.5}, [0])
In[2] timeit global_df.copy()
10000 loops, best of 3: 20.2 us per loop

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM