简体   繁体   English

如何使用Pandas根据列值选择行?

[英]How do I select rows based on columns values with Pandas?

I have a function that for every row gets all the previous rows based on the values of three columns of the current row. 我有一个函数,根据当前行的三列的值,为每一行获取所有先前的行。 I use two ways for getting the rows I need: 我使用两种方法来获取所需的行:

import pandas as pd

df = pd.read_csv("data.csv")

# Way 1
rows = df[(df["colA"] == 1.2) & (df["colB"] == 5) & (df["colC"] == 2.5)]

# Way 2
cols = ["colA", "colB", "colC"]
group_by_cols = df.groupby(cols)
rows = group_by_cols.get_group((1.2, 5, 2.5))

Using %timeit in a IPython Notebook: 在IPython Notebook中使用%timeit

# Way 1
100 loops, best of 3: 16.6 ms per loop

# Way 2
100 loops, best of 3: 3.42 ms per loop

I am trying to find a way to improve the time it takes. 我正在尝试寻找一种方法来减少花费的时间。 I have read about using Cython to enhance the performance, but I have never used it. 我已经读过有关使用Cython增强性能的信息,但是我从未使用过。

The values in the columns I use are floats, if that helps. 如果有帮助,我使用的列中的值是浮点数。

Update: 更新:

In the comments it was mentioned using HDF over csv. 在评论中提到使用HDF而不是csv。

I am not familiar with it, so I would like to ask if I created a hdf file with a table called "data" containing all my data and tables containing the rows that match each combination of the parameters I want and then calling the table needed for each row, would that be faster than the way 2 ? 我不熟悉它,所以我想问一下我是否创建了一个包含名为“数据”的表的hdf文件,该表包含我的所有数据以及包含与所需参数的每种组合匹配的行的表,然后调用所需的表对于每一行,这会比方法2快吗?

I tried using hdf with pandas but there is unicode text in my data, so that's a problem. 我尝试将hdf与pandas一起使用,但是我的数据中包含unicode文本,所以这是一个问题。

Both of those methods are already pretty optimized, I'd be surprised if you picked up much going to cython. 这两种方法都已经非常优化,如果您对cython有很多了解,我会感到惊讶。

But, there is a .query method, that should help performance, assuming your frame is somewhat large. 但是,有一个.query方法可以帮助提高性能,前提是您的框架有些大。 See the docs for more, or below for an example. 有关更多信息,请参见文档 ;有关示例,请参见下文。

df = pd.DataFrame({'A':[1.0, 1.2, 1.5] * 250000, 'B':[1.0, 5.0, 1.5] * 250000, 'C':[1.0, 2.5, 99.0] * 250000})

In [5]: %timeit rows = df[(df["A"] == 1.2) & (df["B"] == 5) & (df["C"] == 2.5)]
10 loops, best of 3: 33.4 ms per loop

In [6]: %%timeit
   ...: cols = ["A", "B", "C"]
   ...: group_by_cols = df.groupby(cols)
   ...: rows = group_by_cols.get_group((1.2, 5, 2.5))
   ...: 
10 loops, best of 3: 140 ms per loop


In [8]: %timeit rows = df.query('A == 1.2 and B == 5 and C == 2.5')
100 loops, best of 3: 14.8 ms per loop

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM