简体   繁体   English

计算pandas dataframe中列表中的列是否存在值

[英]Compute if value exists in a column on lists in pandas dataframe

I have 2 columns in my dataframe 我的数据框中有2列

  1. product ID purchased by the customer "p" 客户购买的产品ID“p”
  2. list of products IDs purchased by similar customers "p_list" 类似客户购买的产品ID列表“p_list”

     df = pd.DataFrame({'p': [12, 4, 5, 6, 7, 7, 6,5],'p_list':[[12,1,5], [3,1],[8,9,11], [6,7,9], [7,1,2],[12,9,8], [6,1,15],[6,8,9,11]]}) 

I want to check if "p" exists on "p_list" or not, so I applied this code 我想检查“p_list”上是否存在“p”,所以我应用了这段代码

df["exist"]= df.apply(lambda r: 1 if r["p"] in r["p_list"] else 0, axis=1)

The problem is that I have around 50 million rows in this dataframe, so it takes very long time to execute. 问题是我在这个数据帧中有大约5000万行,因此执行需要很长时间。

Is there more efficient way to compute this column? 是否有更有效的方法来计算此列?

Thanks. 谢谢。

You can use list comprehension , last cast True, False values to int : 您可以使用list comprehension ,最后将True, False值转换为int

df["exist"] = [r[0] in r[1]  for r in zip(df["p"], df["p_list"])]
df["exist"] = df["exist"].astype(int)
print (df)
    p         p_list  exist
0  12     [12, 1, 5]      1
1   4         [3, 1]      0
2   5     [8, 9, 11]      0
3   6      [6, 7, 9]      1
4   7      [7, 1, 2]      1
5   7     [12, 9, 8]      0
6   6     [6, 1, 15]      1
7   5  [6, 8, 9, 11]      0

df["exist"] = [int(r[0] in r[1])  for r in zip(df["p"], df["p_list"])]
print (df)
    p         p_list  exist
0  12     [12, 1, 5]      1
1   4         [3, 1]      0
2   5     [8, 9, 11]      0
3   6      [6, 7, 9]      1
4   7      [7, 1, 2]      1
5   7     [12, 9, 8]      0
6   6     [6, 1, 15]      1
7   5  [6, 8, 9, 11]      0

Timings : 时间

#[8000 rows x 2 columns]
df = pd.concat([df]*1000).reset_index(drop=True)
print (df)

In [89]: %%timeit
    ...: df["exist2"] = [r[0] in r[1]  for r in zip(df["p"], df["p_list"])]
    ...: df["exist2"] = df["exist2"].astype(int)
    ...: 
100 loops, best of 3: 6.07 ms per loop

In [90]: %%timeit
    ...: df["exist"] = [1 if r[0] in r[1] else 0  for r in zip(df["p"], df["p_list"])]
    ...: 
100 loops, best of 3: 7.16 ms per loop

In [91]: %%timeit
    ...: df["exist"] = [int(r[0] in r[1])  for r in zip(df["p"], df["p_list"])]
    ...: 
100 loops, best of 3: 9.23 ms per loop

In [92]: %%timeit
    ...: df['exist1'] = df.apply(lambda x: x.p in x.p_list, axis=1).astype(int)
    ...: 
1 loop, best of 3: 370 ms per loop

In [93]: %%timeit
    ...: df["exist"]= df.apply(lambda r: 1 if r["p"] in r["p_list"] else 0, axis=1)
1 loop, best of 3: 310 ms per loop

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 列表的Pandas DataFrame列:删除特定值 - Pandas DataFrame Column of Lists: Remove a Specific Value 将新列作为增量计算为pandas数据框中的另一个值 - Compute a new column as delta to another value in pandas dataframe 熊猫:排序数据框是列值存在于另一个数据框中 - Pandas: Sort Dataframe is Column Value Exists in another Dataframe 对于Pandas数据框中的每一行,确定另一列中是否存在一列值 - For every row in Pandas dataframe determine if a column value exists in another column 熊猫根据索引向“数据框”列添加值,如果值存在,则追加 - Pandas Add value to Dataframe column based on index and if value exists then append Pandas DataFrame检查列值是否存在列值 - Pandas DataFrame check if column value exists in a group of columns 如何检查pandas数据帧中是否存在具有特定列值的行 - How to check if there exists a row with a certain column value in pandas dataframe Pandas - 检查列中的值是否存在于 MultiIndex 数据帧的任何索引中 - Pandas - Check if value from a column exists in any index of a MultiIndex dataframe Python pandas dataframe:删除列中的值存在于另一个中的行 - Python pandas dataframe: delete rows where value in column exists in another 切片pandas DataFrame,其中列的值存在于另一个数组中 - Slice pandas DataFrame where column's value exists in another array
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM