[英]How to filter a Dataframe based on a criteria using .shift()
I am trying to remove any rows in a dataframe from the first non-sequential 'Period' onwards in a groupby. 我正在尝试从groupby的第一个非顺序“句点”开始删除数据框中的任何行。 I would rather avoid looping if possible.
如果可能,我宁愿避免循环。
import pandas as pd
data = {'Country': ['DE', 'DE', 'DE', 'DE', 'DE', 'US', 'US', 'US', 'US','US'],
'Product': ['Blue', 'Blue', 'Blue', 'Blue','Blue','Green', 'Green', 'Green', 'Green','Green'],
'Period': [1, 2, 3,5,6, 1, 2, 4, 5, 6]}
df = pd.DataFrame(data, columns= ['Country','Product', 'Period'])
print df
OUTPUT: 输出:
Country Product Period
0 DE Blue 1
1 DE Blue 2
2 DE Blue 3
3 DE Blue 5
4 DE Blue 6
5 US Green 1
6 US Green 2
7 US Green 4
8 US Green 5
9 US Green 6
So for example, the final output I would like is below: 因此,例如,我想要的最终输出如下:
Country Product Period
0 DE Blue 1
1 DE Blue 2
2 DE Blue 3
5 US Green 1
6 US Green 2
The way I was attempting to do this is the below to give you an idea but I have so many mistakes in it. 下面是我尝试执行此操作的方法,以便为您提供一个想法,但我有很多错误。 But you can probably see the logic of what I am trying to do.
但是您可能会看到我正在尝试做的逻辑。
df = df.groupby(['Country','Product']).apply(lambda x: x[x.Period.shift(x.Period - 1) == 1]).reset_index(drop=True)
the tricky part is rather than just using .shift(1) or something I am trying to input a value into the .shift() ie if that rows Period is 5 then I want to say .shift(5-1) so it shifts up 4 places and checks the value of that Period. 棘手的部分不是仅仅使用.shift(1)或我试图将值输入.shift()的东西,即如果该行的Period为5,那么我想说.shift(5-1)以便它移动最多4个位置,并检查该期间的值。 If it equals 1 then it means it is still sequential.
如果等于1,则表示它仍然是顺序的。 in this case it would go into Nan territory I guess.
我想在这种情况下,它将进入南疆。
Instead of using shift()
you could use diff()
and cumsum()
: 除了使用
shift()
还可以使用diff()
和cumsum()
:
result = grouped['Period'].apply(
lambda x: x.loc[(x.diff() > 1).cumsum() == 0])
import pandas as pd
data = {'Country': ['DE', 'DE', 'DE', 'DE', 'DE', 'US', 'US', 'US', 'US','US'],
'Product': ['Blue', 'Blue', 'Blue', 'Blue','Blue','Green', 'Green', 'Green', 'Green','Green'],
'Period': [1, 2, 3,5,6, 1, 2, 4, 5, 6]}
df = pd.DataFrame(data, columns= ['Country','Product', 'Period'])
print(df)
grouped = df.groupby(['Country','Product'])
result = grouped['Period'].apply(
lambda x: x.loc[(x.diff() > 1).cumsum() == 0])
result.name = 'Period'
result = result.reset_index(['Country', 'Product'])
print(result)
yields 产量
Country Product Period
0 DE Blue 1
1 DE Blue 2
2 DE Blue 3
5 US Green 1
6 US Green 2
Explanation : 说明 :
A sequential run of numbers have adjacent diffs of 1. For example, if we for the moment treat df['Period']
as part of all one group, 一系列数字的相邻差异为1。例如,如果我们目前将
df['Period']
视为所有一组的一部分,
In [41]: df['Period'].diff()
Out[41]:
0 NaN
1 1
2 1
3 2
4 1
5 -5
6 1
7 2
8 1
9 1
Name: Period, dtype: float64
In [42]: df['Period'].diff() > 1
Out[42]:
0 False
1 False
2 False
3 True <--- We want to cut off before here
4 False
5 False
6 False
7 True
8 False
9 False
Name: Period, dtype: bool
To find the cutoff location -- the first True
in df['Period'].diff() > 1
-- we can use cumsum()
, and select those rows that equal 0: 要找到截止位置
cumsum()
df['Period'].diff() > 1
的第一个True
,我们可以使用cumsum()
,然后选择等于0的那些行:
In [43]: (df['Period'].diff() > 1).cumsum()
Out[43]:
0 0
1 0
2 0
3 1
4 1
5 1
6 1
7 2
8 2
9 2
Name: Period, dtype: int64
In [44]: (df['Period'].diff() > 1).cumsum() == 0
Out[44]:
0 True
1 True
2 True
3 False
4 False
5 False
6 False
7 False
8 False
9 False
Name: Period, dtype: bool
Taking diff()
and cumsum()
might seem wasteful because these operations may be computing a lot of values that are not needed -- especially if x
is very large and the first sequential run is very short. 取
diff()
和cumsum()
可能看起来很浪费,因为这些操作可能正在计算很多不需要的值-尤其是x
很大且第一次顺序运行很短时。
Despite the wastefulness, the speed gained by calling NumPy or Pandas methods (implemented in C/Cython/C++ or Fortran) usually overpowers a less wasteful algorithm coded in pure Python. 尽管存在浪费,但通过调用NumPy或Pandas方法(在C / Cython / C ++或Fortran中实现)获得的速度通常会超过纯Python编码的浪费较少的算法。
You could however replace the call to cumsum
with a call to argmax
: 但是,您可以取代呼叫
cumsum
通过调用argmax
:
result = grouped['Period'].apply(
lambda x: x.loc[:(x.diff() > 1).argmax()].iloc[:-1])
For very large x
this might be somewhat quicker: 对于非常大的
x
这可能会更快一些:
x = df['Period']
x = pd.concat([x]*1000)
x = x.reset_index(drop=True)
In [68]: %timeit x.loc[:(x.diff() > 1).argmax()].iloc[:-1]
1000 loops, best of 3: 884 µs per loop
In [69]: %timeit x.loc[(x.diff() > 1).cumsum() == 0]
1000 loops, best of 3: 1.12 ms per loop
Note, however, that argmax
returns an index level value, not an ordinal index location. 但是请注意,
argmax
返回索引级别值,而不是顺序索引位置。 Therefore, using argmax will not work if x.index
contains duplicate values. 因此,如果
x.index
包含重复值,则无法使用argmax。 (That's why I had to set x = x.reset_index(drop=True)
.) (这就是为什么我必须设置
x = x.reset_index(drop=True)
。)
So while using argmax
is a bit faster in some situations, this alternative is not quite as robust. 因此,尽管在某些情况下使用
argmax
会快一些,但这种选择并不那么健壮。
Sorry .. am not aware of pandas.. But generally it can be achieved in python straight forward. 对不起..我不知道熊猫..但是一般来说,它可以直接在python中实现。
zip(data['Country'],data['Product'],data['Period'])
and the result will be a list ..
[('DE', 'Blue', 1), ('DE', 'Blue', 2), ('DE', 'Blue', 3), ('DE', 'Blue', 5),
('DE', 'Blue', 6), ('US', 'Green', 1), ('US', 'Green', 2), ('US', 'Green', 4),
('US', 'Green', 5), ('US', 'Green', 6)]
After this the result can be easily fed to ur function 之后,结果可以很容易地输入到您的函数中
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.