I have a function that for every row gets all the previous rows based on the values of three columns of the current row. I use two ways for getting the rows I need:
import pandas as pd
df = pd.read_csv("data.csv")
# Way 1
rows = df[(df["colA"] == 1.2) & (df["colB"] == 5) & (df["colC"] == 2.5)]
# Way 2
cols = ["colA", "colB", "colC"]
group_by_cols = df.groupby(cols)
rows = group_by_cols.get_group((1.2, 5, 2.5))
Using %timeit
in a IPython Notebook:
# Way 1
100 loops, best of 3: 16.6 ms per loop
# Way 2
100 loops, best of 3: 3.42 ms per loop
I am trying to find a way to improve the time it takes. I have read about using Cython to enhance the performance, but I have never used it.
The values in the columns I use are floats, if that helps.
Update:
In the comments it was mentioned using HDF over csv.
I am not familiar with it, so I would like to ask if I created a hdf file with a table called "data" containing all my data and tables containing the rows that match each combination of the parameters I want and then calling the table needed for each row, would that be faster than the way 2 ?
I tried using hdf with pandas but there is unicode text in my data, so that's a problem.
Both of those methods are already pretty optimized, I'd be surprised if you picked up much going to cython.
But, there is a .query
method, that should help performance, assuming your frame is somewhat large. See the docs for more, or below for an example.
df = pd.DataFrame({'A':[1.0, 1.2, 1.5] * 250000, 'B':[1.0, 5.0, 1.5] * 250000, 'C':[1.0, 2.5, 99.0] * 250000})
In [5]: %timeit rows = df[(df["A"] == 1.2) & (df["B"] == 5) & (df["C"] == 2.5)]
10 loops, best of 3: 33.4 ms per loop
In [6]: %%timeit
...: cols = ["A", "B", "C"]
...: group_by_cols = df.groupby(cols)
...: rows = group_by_cols.get_group((1.2, 5, 2.5))
...:
10 loops, best of 3: 140 ms per loop
In [8]: %timeit rows = df.query('A == 1.2 and B == 5 and C == 2.5')
100 loops, best of 3: 14.8 ms per loop
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.