[英]Pandas: How do I assign a column value using multiple conditions, including comparing two columns of list objects?
I'm trying to set a boolean column based on a sequence of conditions in a dataframe.我正在尝试根据数据框中的一系列条件设置布尔列。 I'm testing multiple conditions, and where I experience the most problems are comparing two list objects.我正在测试多个条件,我遇到最多问题的地方是比较两个列表对象。 I know how to test whether or not all items in the smaller list are all in a larger list, and I know how to set a boolean column based on multiple conditions/columns across a dataframe.我知道如何测试较小列表中的所有项目是否都在较大列表中,并且我知道如何根据数据框中的多个条件/列设置布尔列。 What I can't seem to do is get them to play nice together--when I try to iterate through the lists, they end up iterating through the column rows, rather than the objects in the list in each row.我似乎无法让它们一起玩得很好——当我尝试遍历列表时,它们最终会遍历列行,而不是每行中列表中的对象。 I've already written a for loop to assign values per-row, but it is incredibly slow and I'll have a minimum of 1 million rows to iterate over, so I need speed.我已经编写了一个 for 循环来为每行分配值,但它非常慢,而且我至少要迭代 100 万行,所以我需要速度。
Here's a solution I've tested.这是我测试过的解决方案。 In this scenario, 'success' is attributed to event1 when...在这种情况下,“成功”归因于 event1,当...
I'll be expanding conditions to other values not listed in this problem, but bonus points awarded if your answer includes conditional logic checking for a single value in a list of values [or it could be a sequence of columns], and/or checking that all values in a list/sequence are not None/np.NaN.我会将条件扩展到此问题中未列出的其他值,但如果您的答案包括对值列表中的单个值进行条件逻辑检查 [或者它可能是一系列列] 和/或检查,则奖励积分列表/序列中的所有值都不是 None/np.NaN。
df['success'] = np.where((
(df.user_id==df.user_id.shift(-1)) &
(df.event_id==1) &
(df.event_id.shift(-1)==3) &
(len(df.event1_tags)>0) & # breaks because it's counting the rows in pd.Series
(all(e in df.event3_tags.shift(-1) for e in df.event1_tags)) # breaks because it iterates through both columns as Series
), 1, 0)
Here are two stackoverflow articles that have helped me, and a toy dataframe, followed by the desired output with the toy dataframe.这里有两篇对我有帮助的 stackoverflow 文章,以及一个玩具数据框,然后是带有玩具数据框的所需输出。
Pandas: How do I assign values based on multiple conditions for existing columns? Pandas:如何根据现有列的多个条件分配值?
Checking if List contains all items from another list 检查列表是否包含另一个列表中的所有项目
data = {'user_id' : [1, 1, 1, 2, 2, 2, 3, 3, 3],
'event_id' : [1, 1, 3, 1, 3, 3, 1, 3, 3],
'event1_tags' : [['tag1'], [], np.NaN, ['tag2', 'tag3'], np.NaN, np.NaN, ['tag2', 'tag4'], np.NaN, np.NaN],
'event3_tags' : [np.NaN, np.NaN, ['tag1', 'tag2', 'tag3'],
np.NaN, ['tag1', 'tag2', 'tag3'], ['tag1', 'tag2', 'tag3'],
np.NaN, ['tag1', 'tag2', 'tag3'], ['tag1', 'tag2', 'tag3']]}
df = pd.DataFrame(data)
df
user_id event_id event1_tags event3_tags
0 1 1 [tag1] NaN
1 1 1 [] NaN
2 1 3 NaN [tag1, tag2, tag3]
3 2 1 [tag2, tag3] NaN
4 2 3 NaN [tag1, tag2, tag3]
5 2 3 NaN [tag1, tag2, tag3]
6 3 1 [tag2, tag4] NaN
7 3 3 NaN [tag1, tag2, tag3]
8 3 3 NaN [tag1, tag2, tag3]
data = {'user_id' : [1, 1, 1, 2, 2, 2, 3, 3, 3],
'event_id' : [1, 1, 3, 1, 3, 3, 1, 3, 3],
'event1_tags' : [['tag1'], [], np.NaN, ['tag2', 'tag3'], np.NaN, np.NaN, ['tag2', 'tag4'], np.NaN, np.NaN],
'event3_tags' : [np.NaN, np.NaN, ['tag1', 'tag2', 'tag3'],
np.NaN, ['tag1', 'tag2', 'tag3'], ['tag1', 'tag2', 'tag3'],
np.NaN, ['tag1', 'tag2', 'tag3'], ['tag1', 'tag2', 'tag3']],
'success' : [0, 0, 0, 1, 0, 0, 0, 0, 0]}
df = pd.DataFrame(data)
df
user_id event_id event1_tags event3_tags success
0 1 1 [tag1] NaN 0
1 1 1 [] NaN 0
2 1 3 NaN [tag1, tag2, tag3] 0
3 2 1 [tag2, tag3] NaN 1
4 2 3 NaN [tag1, tag2, tag3] 0
5 2 3 NaN [tag1, tag2, tag3] 0
6 3 1 [tag2, tag4] NaN 0
7 3 3 NaN [tag1, tag2, tag3] 0
8 3 3 NaN [tag1, tag2, tag3] 0
This is my current solution.这是我目前的解决方案。 As it turns out, it's not as slow as I expected it to be (but I do know it's slow).事实证明,它并不像我预期的那么慢(但我知道它很慢)。 I'm still interested in faster solutions if anybody has tips.如果有人有提示,我仍然对更快的解决方案感兴趣。
def get_conversion(df):
event_dataframe['success'] = 0
for i in df.itertuples():
current_idx = i[0]
next_idx = i[0]+1
if ((next_idx in df.index)
and (df['user_id'][current_idx]==df['user_id'][next_idx])
and (df['event_id'][current_idx]==1)
and (df['event_id'][next_idx]==3)
and (len(df['event1_tags'][current_idx])!=0)
and (all(t in df['event3_tags'][next_idx] for t in df['event1_tags'][current_idx]))
):
df.loc[current_idx, ['success']] = 1
else:
pass
return df
df = get_conversion(df)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.