简体   繁体   English

从数据框中获取满足熊猫条件的行

[英]Getting rows from a data frame which satisfy a condition in pandas

I have a data frame and I have a range of numbers. 我有一个数据框,并且有一系列数字。 I want to find the rows where values in a particular column lie in that range. 我想找到特定列中的值位于该范围内的行。

This seems like a trivial job. 这似乎是一件微不足道的工作。 I tried with the techniques given here - http://pandas.pydata.org/pandas-docs/dev/indexing.html#indexing-boolean 我尝试了此处提供的技术-http: //pandas.pydata.org/pandas-docs/dev/indexing.html#indexing-boolean

I took a simple example: 我举了一个简单的例子:

In [6]: df_s
Out[6]: 
   time  value
0     1      3
1     2      4
2     3      3
3     4      4
4     5      3
5     6      2
6     7      2
7     8      3
8     9      3

In [7]: df_s[df_s.time.isin(range(1,8))]
Out[7]: 
   time  value
0     1      3
1     2      4
2     3      3
3     4      4
4     5      3
5     6      2
6     7      2

Then, I tried with a sample from the data set I am working with which has timestamp and value as columns: 然后,我尝试使用我正在使用的数据集中的一个样本,该样本具有时间戳和值作为列:

In [8]: df_s = pd.DataFrame({'time': range(1379945743841,1379945743850), 'value': [3,4,3,4,3,2,2,3,3]})

In [9]: df_s
Out[9]: 
            time  value
0  1379945743841      3
1  1379945743842      4
2  1379945743843      3
3  1379945743844      4
4  1379945743845      3
5  1379945743846      2
6  1379945743847      2
7  1379945743848      3
8  1379945743849      3

In [10]: df_s[df_s.time.isin(range(1379945743843,1379945743845))]
Out[10]: 
Empty DataFrame
Columns: [time, value]
Index: []

Why doesn't the same technique work in this case? 为什么在这种情况下不能使用相同的技术? What am I doing wrong? 我究竟做错了什么?

I tried another approach: 我尝试了另一种方法:

In [11]: df_s[df_s.time >= 1379945743843 and df_s.time <=1379945743845]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-45c44def41b4> in <module>()
----> 1 df_s[df_s.time >= 1379945743843 and df_s.time <=1379945743845]

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Then, I tried with a bit more complex approach: 然后,我尝试了一种更为复杂的方法:

In [13]: df_s.ix[[idx for idx in df_s.index if df_s.ix[idx]['time'] in range(1379945743843, 1379945743845)]]
Out[13]: 
            time  value
2  1379945743843      3
3  1379945743844      4

This gives the desired result but it takes way too much time to give any result on my original data set. 这样可以提供理想的结果,但是要花太多时间才能在我的原始数据集上提供任何结果。 It has 209920 rows and it is expected that the number of rows will increase when I actually put my code to test. 它有209920行,预计当我实际测试代码时,行数会增加。

Can anyone direct to me towards the right approach? 谁能指导我采取正确的方法?

I am using python 2.7.3 and pandas 0.12.0 我正在使用python 2.7.3和pandas 0.12.0

Update: 更新:

Jeff's answer worked. 杰夫的答案奏效了。

But I find the isin approach more simple, intuitive and less cluttered. 但是我发现isin方法更简单,直观且不那么混乱。 Please comment if anyone has any idea why it failed. 如果有人知道为什么失败,请发表评论。

Thanks! 谢谢!

Try this way 试试这个

In [7]:  df_s = pd.DataFrame({'time': range(1379945743841,1379945743850), 'value': [3,4,3,4,3,2,2,3,3]})

Convert your ms epoch timestamps to actual times 将您的MS纪元时间戳转换为实际时间

In [8]: df_s['time'] = pd.to_datetime(df_s['time'],unit='ms')

In [9]: df_s
Out[9]: 
                        time  value
0 2013-09-23 14:15:43.841000      3
1 2013-09-23 14:15:43.842000      4
2 2013-09-23 14:15:43.843000      3
3 2013-09-23 14:15:43.844000      4
4 2013-09-23 14:15:43.845000      3
5 2013-09-23 14:15:43.846000      2
6 2013-09-23 14:15:43.847000      2
7 2013-09-23 14:15:43.848000      3
8 2013-09-23 14:15:43.849000      3

These are your converted endpoints 这些是您转换的端点

In [10]: pd.to_datetime(1379945743843,unit='ms')
Out[10]: Timestamp('2013-09-23 14:15:43.843000', tz=None)

In [11]: pd.to_datetime(1379945743845,unit='ms')
Out[11]: Timestamp('2013-09-23 14:15:43.845000', tz=None)

In [12]: df = df_s.set_index('time')

You must use the & and use parens 您必须使用&并使用括号

In [13]: df_s[(df_s.time>pd.to_datetime(1379945743843,unit='ms')) & (df_s.time<pd.to_datetime(1379945743845,unit='ms'))]
Out[13]: 
                    time  value
3 2013-09-23 14:15:43.844000      4

In 0.13 (coming soon), you will be able to do this: 在0.13(即将推出)中,您可以执行以下操作:

In [7]: df_s.query('"2013-09-23 14:15:43.843" < time < "2013-09-23 14:15:43.845"')
Out[7]: 
                    time  value
3 2013-09-23 14:15:43.844000      4

Your isin approach DOES work. 您的isin方法确实有效。 Not sure why its not working for you. 不知道为什么它不适合您。

In [11]: df_s[df_s.time.isin(range(1379945743843,1379945743845))]
Out[11]: 
            time  value
2  1379945743843      3
3  1379945743844      4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM