使用列表过滤 Pandas Dataframe 的最快方法

Question

Suppose I have a DataFrame such as:假设我有一个 DataFrame 例如：

   col1  col2
0     1     A
1     2     B
2     6     A
3     5     C
4     9     C
5     3     A
6     5     B

And multiple lists such as:以及多个列表，例如：

list_1 = [1, 2, 4]
list_2 = [3, 8]
list_3 = [5, 6, 7, 9]

I can update the value of col2 depending on whether the value of col1 is included in a list, for example:我可以根据col1的值是否包含在列表中来更新col2的值，例如：

for i in list_1:
    df.loc[df.col1 == i, 'col2'] = 'A'

for i in list_2:
    df.loc[df.col1 == i, 'col2'] = 'B'

for i in list_3:
    df.loc[df.col1 == i, 'col2'] = 'C'

However this is very slow.然而，这是非常缓慢的。 With a dataframe of 30,000 rows, and each list containing approx 5,000-10,000 items, it can take a long time to calculate, especially compared to other pandas operations.使用 30,000 行的 dataframe，每个列表包含大约 5,000-10,000 个项目，计算可能需要很长时间，尤其是与其他 pandas 操作相比。 Is there a better (faster) way of doing this?有没有更好（更快）的方法来做到这一点？

Answer 1

You can use isin with np.select here:您可以在此处将isin与np.select一起使用：

df['col2'] = (np.select([df['col1'].isin(list_1),
                         df['col1'].isin(list_2),
                         df['col1'].isin(list_3)]
                    ,['A','B','C']))

With Map :使用Map ：

d = dict(zip(map(tuple,[list_1,list_2,list_3]),['A','B','C']))
df['col2'] = df['col1'].map({val: v for k,v in d.items() for val in k})

   col1 col2
0     1    A
1     2    A
2     6    C
3     5    C
4     9    C
5     3    B
6     5    C

Answer 2

You can first convert the lists to dicts and then map to col1.您可以先将列表转换为字典，然后将 map 转换为 col1。

d1 = {k:'A' for k in list_1}
d2 = {k:'B' for k in list_2}
d3 = {k:'C' for k in list_3}

df['col2'] = (
    df.col1.apply(lambda x: d1.get(x,x))
    .combine_first(df.col1.apply(lambda x: d2.get(x,x)))
    .combine_first(df.col1.apply(lambda x: d2.get(x,x)))
)

If there is no duplicates in the lists, you can make it even faster by merging them to a single dict:如果列表中没有重复项，您可以通过将它们合并到单个 dict 来使其更快：

d = {**{k:'A' for k in list_1}, 
     **{k:'B' for k in list_2}, 
     **{k:'C' for k in list_3}}
df['col2'] = df.col1.apply(lambda x: d.get(x,x))

Answer 3

I would suggest iterating through your lists with a dictionary using conditional updating:我建议使用条件更新使用字典遍历您的列表：

# Create your update dictionary
col_dict = {
    "A":[1, 2, 4],
    "B":[3, 8],
    "C":[5, 6, 7, 9]
}

# Iterate and update
for key, value in col_dict.items():
  # key is the col name; value is the lookup list
  df["col2"] = np.where(df["col1"].isin(value), key, df["col2"])

There is a concern of overwriting values – since a row can technically match multiple lists.存在覆盖值的问题——因为从技术上讲，一行可以匹配多个列表。 How those updates are reconciled is not obvious.这些更新如何协调并不明显。

If rows don't match multiple keys, consider a dynamic programming approach where a running index of "unmatched" rows are used for each iteration, updating as your proceed so that the number of rows you're iterating through are fewer with each iteration.如果行不匹配多个键，请考虑一种动态编程方法，其中每次迭代都使用“不匹配”行的运行索引，并随着您的进行进行更新，以便每次迭代时迭代的行数更少。

使用列表过滤 Pandas Dataframe 的最快方法

问题描述

3 个解决方案

解决方案1
6 已采纳 2020-04-30 03:40:30

解决方案2
4 2020-04-30 03:44:38

解决方案3
1 2020-04-30 04:14:36

使用列表过滤 Pandas Dataframe 的最快方法

问题描述

3 个解决方案

解决方案1 6 已采纳 2020-04-30 03:40:30

解决方案2 4 2020-04-30 03:44:38

解决方案3 1 2020-04-30 04:14:36

解决方案1
6 已采纳 2020-04-30 03:40:30

解决方案2
4 2020-04-30 03:44:38

解决方案3
1 2020-04-30 04:14:36