简体   繁体   English

如何过滤 dataframe 并根据多个其他列上的条件识别记录

[英]How to filter a dataframe and identify records based on a condition on multiple other columns

            id          zone  price
0        0000001           1   33.0
1        0000001           2   24.0
2        0000001           3   34.0
3        0000001           4   45.0
4        0000001           5   51.0

I have the above pandas dataframe, here there are multiple ids (only 1 id is shown here).我上面有pandas dataframe,这里有多个id(这里只显示1个id)。 dataframe consist of a certain id with 5 zones and 5 prices. dataframe 由具有 5 个区域和 5 个价格的特定 id 组成。 these prices should follow the below pattern这些价格应遵循以下模式

p1 (price of zone 1) < p2< p3< p4< p5 p1(区域 1 的价格)< p2< p3< p4< p5

if anything out of order we should identify and print anomaly records to a file.如果有任何异常,我们应该识别异常记录并将其打印到文件中。

here in this example p3 <p4 <p5 but p1 and p2 are erroneous.在这个例子中 p3 <p4 <p5 但 p1 和 p2 是错误的。 (p1 > p2 whereas p1 < p2 is expected) (p1 > p2 而 p1 < p2 是预期的)

therefore 1st 2 records should be printed to a file因此应将第一 2 条记录打印到文件中

likewise this has to be done to the entire dataframe for all unique ids in it同样,必须对整个 dataframe 中的所有唯一 ID 执行此操作

My dataframe is huge, what is the most efficient way to do this filtering and identify erroneous records?我的 dataframe 很大,进行此过滤和识别错误记录的最有效方法是什么?

You can compute the diff per group after sorting the values to ensure the zones are increasing.您可以在对值进行排序以确保区域增加后计算每组的diff If the diff is ≤ 0 the price is not strictly increasing and the rows should be flagged:如果 diff ≤ 0,则价格未严格增加,应标记行:

s = (df.sort_values(by=['id', 'zone']) # sort rows
       .groupby('id')                  # group by id
       ['price'].diff()                # compute the diff
       .le(0)                          # flag those ≤ 0 (not increasing)
     )
df[s|s.shift(-1)]                      # slice flagged rows + previous row

Example output:示例 output:

   id  zone  price
0   1     1   33.0
1   1     2   24.0

Example input:示例输入:

   id  zone  price
0   1     1   33.0
1   1     2   24.0
2   1     3   34.0
3   1     4   45.0
4   1     5   51.0
5   2     1   20.0
6   2     2   24.0
7   2     3   34.0
8   2     4   45.0
9   2     5   51.0
saving to file保存到文件
df[s|s.shift(-1)].to_csv('incorrect_prices.csv')

Another way would be to first sort your dataframe by id and zone in ascending order and compare the next price with previous price using groupby.shift() creating a new column.另一种方法是首先按 id 和区域ascending对 dataframe 进行sort ,然后使用groupby.shift()创建一个新列,将下一个价格与上一个价格进行比较。 Then you can just print out the prices that have fell in value:然后你可以打印出价值下降的价格:

import numpy as np 
import pandas as pd

df.sort_values(by=['id','zone'],ascending=True)
df['increase'] = np.where(df.zone.eq(1),'no change',
                          np.where(df.groupby('id')['price'].shift(1) < df['price'],'inc','dec'))

>>> df

    id  zone  price   increase
0    1     1     33  no change
1    1     2     24        dec
2    1     3     34        inc
3    1     4     45        inc
4    1     5     51        inc
5    2     1     34  no change
6    2     2     56        inc
7    2     3     22        dec
8    2     4     55        inc
9    2     5     77        inc
10   3     1     44  no change
11   3     2     55        inc
12   3     3     44        dec
13   3     4     66        inc
14   3     5     33        dec

>>> df.loc[df.increase.eq('dec')]

    id  zone  price increase
1    1     2     24      dec
7    2     3     22      dec
12   3     3     44      dec
14   3     5     33      dec

I have added some extra ID's to try and mimic your real data.我添加了一些额外的 ID 来尝试模仿您的真实数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM