简体   繁体   English

如何使用.shift()根据条件过滤数据框

[英]How to filter a Dataframe based on a criteria using .shift()

I am trying to remove any rows in a dataframe from the first non-sequential 'Period' onwards in a groupby. 我正在尝试从groupby的第一个非顺序“句点”开始删除数据框中的任何行。 I would rather avoid looping if possible. 如果可能,我宁愿避免循环。

import pandas as pd


data = {'Country': ['DE', 'DE', 'DE', 'DE', 'DE', 'US', 'US', 'US', 'US','US'],
    'Product': ['Blue', 'Blue', 'Blue', 'Blue','Blue','Green', 'Green', 'Green', 'Green','Green'],
    'Period': [1, 2, 3,5,6, 1, 2, 4, 5, 6]}

df = pd.DataFrame(data, columns= ['Country','Product', 'Period'])
print df

OUTPUT: 输出:

  Country Product  Period
0      DE    Blue       1
1      DE    Blue       2
2      DE    Blue       3
3      DE    Blue       5
4      DE    Blue       6
5      US   Green       1
6      US   Green       2
7      US   Green       4
8      US   Green       5
9      US   Green       6

So for example, the final output I would like is below: 因此,例如,我想要的最终输出如下:

  Country Product  Period
0      DE    Blue       1
1      DE    Blue       2
2      DE    Blue       3
5      US   Green       1
6      US   Green       2

The way I was attempting to do this is the below to give you an idea but I have so many mistakes in it. 下面是我尝试执行此操作的方法,以便为您提供一个想法,但我有很多错误。 But you can probably see the logic of what I am trying to do. 但是您可能会看到我正在尝试做的逻辑。

df = df.groupby(['Country','Product']).apply(lambda x: x[x.Period.shift(x.Period - 1) == 1]).reset_index(drop=True)

the tricky part is rather than just using .shift(1) or something I am trying to input a value into the .shift() ie if that rows Period is 5 then I want to say .shift(5-1) so it shifts up 4 places and checks the value of that Period. 棘手的部分不是仅仅使用.shift(1)或我试图将值输入.shift()的东西,即如果该行的Period为5,那么我想说.shift(5-1)以便它移动最多4个位置,并检查该期间的值。 If it equals 1 then it means it is still sequential. 如果等于1,则表示它仍然是顺序的。 in this case it would go into Nan territory I guess. 我想在这种情况下,它将进入南疆。

Instead of using shift() you could use diff() and cumsum() : 除了使用shift()还可以使用diff()cumsum()

result = grouped['Period'].apply(
    lambda x: x.loc[(x.diff() > 1).cumsum() == 0])

import pandas as pd

data = {'Country': ['DE', 'DE', 'DE', 'DE', 'DE', 'US', 'US', 'US', 'US','US'],
    'Product': ['Blue', 'Blue', 'Blue', 'Blue','Blue','Green', 'Green', 'Green', 'Green','Green'],
    'Period': [1, 2, 3,5,6, 1, 2, 4, 5, 6]}

df = pd.DataFrame(data, columns= ['Country','Product', 'Period'])
print(df)
grouped = df.groupby(['Country','Product'])
result = grouped['Period'].apply(
    lambda x: x.loc[(x.diff() > 1).cumsum() == 0])
result.name = 'Period'
result = result.reset_index(['Country', 'Product'])
print(result)

yields 产量

  Country Product  Period
0      DE    Blue       1
1      DE    Blue       2
2      DE    Blue       3
5      US   Green       1
6      US   Green       2

Explanation : 说明

A sequential run of numbers have adjacent diffs of 1. For example, if we for the moment treat df['Period'] as part of all one group, 一系列数字的相邻差异为1。例如,如果我们目前将df['Period']视为所有一组的一部分,

In [41]: df['Period'].diff()
Out[41]: 
0   NaN
1     1
2     1
3     2
4     1
5    -5
6     1
7     2
8     1
9     1
Name: Period, dtype: float64

In [42]: df['Period'].diff() > 1
Out[42]: 
0    False
1    False
2    False
3     True       <--- We want to cut off before here
4    False
5    False
6    False
7     True
8    False
9    False
Name: Period, dtype: bool

To find the cutoff location -- the first True in df['Period'].diff() > 1 -- we can use cumsum() , and select those rows that equal 0: 要找到截止位置cumsum() df['Period'].diff() > 1的第一个True ,我们可以使用cumsum() ,然后选择等于0的那些行:

In [43]: (df['Period'].diff() > 1).cumsum()
Out[43]: 
0    0
1    0
2    0
3    1
4    1
5    1
6    1
7    2
8    2
9    2
Name: Period, dtype: int64

In [44]: (df['Period'].diff() > 1).cumsum() == 0
Out[44]: 
0     True
1     True
2     True
3    False
4    False
5    False
6    False
7    False
8    False
9    False
Name: Period, dtype: bool

Taking diff() and cumsum() might seem wasteful because these operations may be computing a lot of values that are not needed -- especially if x is very large and the first sequential run is very short. diff()cumsum()可能看起来很浪费,因为这些操作可能正在计算很多不需要的值-尤其是x很大且第一次顺序运行很短时。

Despite the wastefulness, the speed gained by calling NumPy or Pandas methods (implemented in C/Cython/C++ or Fortran) usually overpowers a less wasteful algorithm coded in pure Python. 尽管存在浪费,但通过调用NumPy或Pandas方法(在C / Cython / C ++或Fortran中实现)获得的速度通常会超过纯Python编码的浪费较少的算法。

You could however replace the call to cumsum with a call to argmax : 但是,您可以取代呼叫cumsum通过调用argmax

result = grouped['Period'].apply(
    lambda x: x.loc[:(x.diff() > 1).argmax()].iloc[:-1])

For very large x this might be somewhat quicker: 对于非常大的x这可能会更快一些:

x = df['Period']
x = pd.concat([x]*1000)
x = x.reset_index(drop=True)

In [68]: %timeit x.loc[:(x.diff() > 1).argmax()].iloc[:-1]
1000 loops, best of 3: 884 µs per loop

In [69]: %timeit x.loc[(x.diff() > 1).cumsum() == 0]
1000 loops, best of 3: 1.12 ms per loop

Note, however, that argmax returns an index level value, not an ordinal index location. 但是请注意, argmax返回索引级别值,而不是顺序索引位置。 Therefore, using argmax will not work if x.index contains duplicate values. 因此,如果x.index包含重复值,则无法使用argmax。 (That's why I had to set x = x.reset_index(drop=True) .) (这就是为什么我必须设置x = x.reset_index(drop=True) 。)

So while using argmax is a bit faster in some situations, this alternative is not quite as robust. 因此,尽管在某些情况下使用argmax会快一些,但这种选择并不那么健壮。

Sorry .. am not aware of pandas.. But generally it can be achieved in python straight forward. 对不起..我不知道熊猫..但是一般来说,它可以直接在python中实现。

zip(data['Country'],data['Product'],data['Period'])
and the result will be a list ..
[('DE', 'Blue', 1), ('DE', 'Blue', 2), ('DE', 'Blue', 3), ('DE', 'Blue', 5), 
('DE', 'Blue', 6), ('US', 'Green', 1), ('US', 'Green', 2), ('US', 'Green', 4),
('US', 'Green', 5), ('US', 'Green', 6)]

After this the result can be easily fed to ur function 之后,结果可以很容易地输入到您的函数中

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM