[英]Compute if value exists in a column on lists in pandas dataframe
I have 2 columns in my dataframe 我的数据框中有2列
list of products IDs purchased by similar customers "p_list" 类似客户购买的产品ID列表“p_list”
df = pd.DataFrame({'p': [12, 4, 5, 6, 7, 7, 6,5],'p_list':[[12,1,5], [3,1],[8,9,11], [6,7,9], [7,1,2],[12,9,8], [6,1,15],[6,8,9,11]]})
I want to check if "p" exists on "p_list" or not, so I applied this code 我想检查“p_list”上是否存在“p”,所以我应用了这段代码
df["exist"]= df.apply(lambda r: 1 if r["p"] in r["p_list"] else 0, axis=1)
The problem is that I have around 50 million rows in this dataframe, so it takes very long time to execute. 问题是我在这个数据帧中有大约5000万行,因此执行需要很长时间。
Is there more efficient way to compute this column? 是否有更有效的方法来计算此列?
Thanks. 谢谢。
You can use list comprehension
, last cast True, False
values to int
: 您可以使用
list comprehension
,最后将True, False
值转换为int
:
df["exist"] = [r[0] in r[1] for r in zip(df["p"], df["p_list"])]
df["exist"] = df["exist"].astype(int)
print (df)
p p_list exist
0 12 [12, 1, 5] 1
1 4 [3, 1] 0
2 5 [8, 9, 11] 0
3 6 [6, 7, 9] 1
4 7 [7, 1, 2] 1
5 7 [12, 9, 8] 0
6 6 [6, 1, 15] 1
7 5 [6, 8, 9, 11] 0
df["exist"] = [int(r[0] in r[1]) for r in zip(df["p"], df["p_list"])]
print (df)
p p_list exist
0 12 [12, 1, 5] 1
1 4 [3, 1] 0
2 5 [8, 9, 11] 0
3 6 [6, 7, 9] 1
4 7 [7, 1, 2] 1
5 7 [12, 9, 8] 0
6 6 [6, 1, 15] 1
7 5 [6, 8, 9, 11] 0
Timings : 时间 :
#[8000 rows x 2 columns]
df = pd.concat([df]*1000).reset_index(drop=True)
print (df)
In [89]: %%timeit
...: df["exist2"] = [r[0] in r[1] for r in zip(df["p"], df["p_list"])]
...: df["exist2"] = df["exist2"].astype(int)
...:
100 loops, best of 3: 6.07 ms per loop
In [90]: %%timeit
...: df["exist"] = [1 if r[0] in r[1] else 0 for r in zip(df["p"], df["p_list"])]
...:
100 loops, best of 3: 7.16 ms per loop
In [91]: %%timeit
...: df["exist"] = [int(r[0] in r[1]) for r in zip(df["p"], df["p_list"])]
...:
100 loops, best of 3: 9.23 ms per loop
In [92]: %%timeit
...: df['exist1'] = df.apply(lambda x: x.p in x.p_list, axis=1).astype(int)
...:
1 loop, best of 3: 370 ms per loop
In [93]: %%timeit
...: df["exist"]= df.apply(lambda r: 1 if r["p"] in r["p_list"] else 0, axis=1)
1 loop, best of 3: 310 ms per loop
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.