将过滤器应用于表的 Pythonic 方式

Question

我有两张桌子。 一个数据表和一个过滤器表。 我想在数据表上应用过滤表以仅选择某些记录。 当过滤器表的列中有 # 时，过滤器将被忽略。 此外，可以使用 | 应用多项选择。 分隔器。

我使用带有一堆 & 和 | 的 for 循环实现了这一点。 状况。 但是，鉴于我的过滤器表非常大，我想知道是否有更有效的方法来实现这一点。 我的过滤器表看起来像：

import pandas as pd
import numpy as np

f = {'business':['FX','FX','IR','IR','CR'],
     'A/L':['A','L','A','L','#'],
     'Company':['207|401','#','#','207','#']}
filter = pd.DataFrame(data=f)
filter

数据表如下所示：

d = {'business': ['FX','a','CR'],
     'A/L': ['A','A','L'],
     'Company': ['207','1','2']}
data = pd.DataFrame(data=d)
data

最后过滤器看起来像：

for counter in range (0, len(filter)):
    businessV = str(filter.iat[counter,0])
    ALV = str(filter.iat[counter,1])
    CompanyV = str(filter.iat[counter,2])


    businessV1 = businessV.split("|", 100)
    ALV1 = ALV.split("|", 100)
    CompanyV1 = CompanyV.split("|", 100)

    businessV2 = ('#' in businessV1)| (data['business'].isin(businessV1))
    ALV2 = ('#' in ALV1)|(data['A/L'].isin(ALV1))
    CompanyV2 = ('#' in CompanyV1)| (data['Company'].isin(CompanyV1))

    final_filter = businessV2 & ALV2 & CompanyV2
    print(final_filter)

我试图找到一种更有效的方法来使用过滤器表中的过滤器来选择数据表中的第一行和最后一行。

具体来说，我想知道如何：

处理过滤表有很多列的情况
当前代码针对过滤器表中的每一行遍历数据表中的每一行一次。 对于大型数据集，这需要太多时间，对我来说似乎效率不高。

Answer 1

这是一个相当复杂的问题。 我首先通过复制包含'|'行来预处理过滤器表，使每个字段只有一个值 . 为了限制无用行的数量，我首先将包含'#'和其他值的任何内容替换为单个'#' 。

完成此操作后，可以使用merge从业务表中选择行，前提是在不包含尖锐的列上进行合并。

代码可以是：

# store the original column names
cols = filter.columns
# remove any alternate value if a # is already present:
tosimp = pd.DataFrame({col: filter[col].str.contains('#')&
                       filter[col].str.contains('\|')
                       for col in cols})

# add a column to store in a (hashable) tuple the columns with no '#'
filter['wild'] = filter.apply(lambda x: tuple(col for col in cols
                                             if x[col] != '#'), axis=1)

# now explode the fields containing a '|'
tosimp = pd.DataFrame({col: filter[col].str.contains('\|')
                       for col in filter.columns})

# again, store in a new column the columns containing a '|'
tosimp['wild'] = filter.apply(lambda x: tuple(col for col in cols
                                             if '|' in filter.loc[x.name, col]),
                              axis=1)

# compute a new filter table with one single value per field (or #)
# by grouping on tosimp['wild']
dfl = [filter[tosimp['wild'].astype(str)=='()']]
for k, df in filter[tosimp['wild'].astype(str)!='()'].groupby(tosimp['wild']):
    for ix, row in df.iterrows():
        tmp = pd.MultiIndex.from_product([df.loc[ix, col].split('|')
                                          for col in k], names=k).to_frame(None)
        l = len(tmp)
        dfl.append(pd.DataFrame({col: tmp[col]
                                 if col in k else [row[col]] * l
                                 for col in filter.columns}))

filter2 = pd.concat(dfl)

# Ok, we can now use that new filter table to filter the business table
result = pd.concat([data.merge(df, on=k, suffixes=('', '_y'),
                               right_index=True)[cols]
                    for k, df in filter2.groupby('wild')]).sort_index()

限制：

预处理按数据帧对 group 进行迭代并使用iterrows调用：在大型过滤器表上可能需要一些时间
当前算法根本不处理所有字段中包含'#'的行。 如果这是一个可能的用例，则必须在任何其他处理之前对其进行搜索。 无论如何，在这种情况下，业务表中的任何行都将被保留。

pd.concat(...行的说明：

[... for k, df in filter2.groupby('wild')] ：将过滤器数据帧拆分为子数据帧，每个子数据帧都有不同的wild值，即一组不同的非#字段
data.merge(df, on=k, suffixes=('', '_y'), right_index=True) ：将每个子过滤器数据帧与非#字段上的数据数据帧合并，即从数据数据帧中选择行匹配这些过滤器行之一。 保留数据数据框的原始索引
...[cols]只保留相关字段
pd.concat(...)连接所有这些部分数据帧
... .sort_index()根据其索引对连接的数据帧进行排序，该索引是通过构造原始数据帧的索引

Answer 2

我对您的问题的理解是，您希望business,A/L的所有第一个匹配项与在相应过滤器中指定的Company （或任何如果使用# ）。

我假设您的预期结果是一个只有第一行data框。 当您的过滤器变大时，您可以通过在过滤器上使用连接操作来加快速度并仅保留第一个结果。

# Split on | so that every option is represented in a single row
filter0 = filter.set_index(['business','A/L']).Company.str.split('|',expand=True).stack().reset_index().drop('level_2',axis=1).rename(columns={0:'Company'})

# The set of *all* rows in data which are caught by filters with a Company specification
r1 = data.merge(filter0[filter0.Company != '#'])

# The set of *all* rows in data which are caught by filters allowing for *any* Company
r2 = data.merge(filter0[filter0.Company == '#'].drop('Company', axis=1))

# r1 and r2 are not necessarily disjoint, and each one may have multiple rows that pass one filter
# Take the union, sort on the index to preserve the original ordering,
# then finally drop duplicates of business+A/L, keeping only the first entry
pd.concat([r1,r2]).drop_duplicates(subset=['business','A/L'], keep='first')

关于您在过滤器上处理多列的情况：过滤器中的一行基本上会说出以下内容，

“我想要field1=foo AND field2=bar AND field3=baz1 OR field3=baz2 AND field4=qux1 OR field4=qux2 。”

主要思想是将其扩展为仅由 AND 条件组成的多行，因此在这种情况下，将其变成四行

field1=foo AND field2=bar AND field3=baz1 AND field4=qux1

field1=foo AND field2=bar AND field3=baz1 AND field4=qux2

field1=foo AND field2=bar AND field3=baz2 AND field4=qux1

field1=foo AND field2=bar AND field3=baz2 AND field4=qux2

换句话说，多次使用.split和.stack ，对具有 OR 条件的每一列使用一次。 这可能有点低效（您可能会在某处使用itertools.product获得更好的速度和代码可读性），但您的瓶颈通常出现在连接操作中，因此就速度而言，这并不太值得担心。

将过滤器应用于表的 Pythonic 方式

问题描述

2 个解决方案

解决方案1
1 已采纳 2019-06-17 13:37:41

解决方案2
0 2019-06-17 09:44:03

将过滤器应用于表的 Pythonic 方式

问题描述

2 个解决方案

解决方案1 1 已采纳 2019-06-17 13:37:41

解决方案2 0 2019-06-17 09:44:03

解决方案1
1 已采纳 2019-06-17 13:37:41

解决方案2
0 2019-06-17 09:44:03