简体   繁体   English

Pandas:如何使用多个条件分配列值,包括比较两列列表对象?

[英]Pandas: How do I assign a column value using multiple conditions, including comparing two columns of list objects?

I'm trying to set a boolean column based on a sequence of conditions in a dataframe.我正在尝试根据数据框中的一系列条件设置布尔列。 I'm testing multiple conditions, and where I experience the most problems are comparing two list objects.我正在测试多个条件,我遇到最多问题的地方是比较两个列表对象。 I know how to test whether or not all items in the smaller list are all in a larger list, and I know how to set a boolean column based on multiple conditions/columns across a dataframe.我知道如何测试较小列表中的所有项目是否都在较大列表中,并且我知道如何根据数据框中的多个条件/列设置布尔列。 What I can't seem to do is get them to play nice together--when I try to iterate through the lists, they end up iterating through the column rows, rather than the objects in the list in each row.我似乎无法让它们一起玩得很好——当我尝试遍历列表时,它们最终会遍历列行,而不是每行中列表中的对象。 I've already written a for loop to assign values per-row, but it is incredibly slow and I'll have a minimum of 1 million rows to iterate over, so I need speed.我已经编写了一个 for 循环来为每行分配值,但它非常慢,而且我至少要迭代 100 万行,所以我需要速度。

Here's a solution I've tested.这是我测试过的解决方案。 In this scenario, 'success' is attributed to event1 when...在这种情况下,“成功”归因于 event1,当...

  1. The sequence of events are done by the same user.事件序列由同一用户完成。
  2. Event 1 is immediately followed by event 3.事件 1 紧跟在事件 3 之后。
  3. The tag list of event 1 is not empty.事件 1 的标签列表不为空。
  4. Every tag in the list of event 1 tags is with a list of tags in event 3.事件 1 标签列表中的每个标签都带有事件 3 中的标签列表。

I'll be expanding conditions to other values not listed in this problem, but bonus points awarded if your answer includes conditional logic checking for a single value in a list of values [or it could be a sequence of columns], and/or checking that all values in a list/sequence are not None/np.NaN.我会将条件扩展到此问题中未列出的其他值,但如果您的答案包括对值列表中的单个值进行条件逻辑检查 [或者它可能是一系列列] 和/或检查,则奖励积分列表/序列中的所有值都不是 None/np.NaN。

df['success'] = np.where((
    (df.user_id==df.user_id.shift(-1)) & 
    (df.event_id==1) & 
    (df.event_id.shift(-1)==3) &
    (len(df.event1_tags)>0) & # breaks because it's counting the rows in pd.Series
    (all(e in df.event3_tags.shift(-1) for e in df.event1_tags)) # breaks because it iterates through both columns as Series
                         ), 1, 0)

Here are two stackoverflow articles that have helped me, and a toy dataframe, followed by the desired output with the toy dataframe.这里有两篇对我有帮助的 stackoverflow 文章,以及一个玩具数据框,然后是带有玩具数据框的所需输出。

Pandas: How do I assign values based on multiple conditions for existing columns? Pandas:如何根据现有列的多个条件分配值?

Checking if List contains all items from another list 检查列表是否包含另一个列表中的所有项目

data = {'user_id' : [1, 1, 1, 2, 2, 2, 3, 3, 3],
        'event_id' : [1, 1, 3, 1, 3, 3, 1, 3, 3],
        'event1_tags' : [['tag1'], [], np.NaN, ['tag2', 'tag3'], np.NaN, np.NaN, ['tag2', 'tag4'], np.NaN, np.NaN],
        'event3_tags' : [np.NaN, np.NaN, ['tag1', 'tag2', 'tag3'], 
                         np.NaN, ['tag1', 'tag2', 'tag3'], ['tag1', 'tag2', 'tag3'], 
                         np.NaN, ['tag1', 'tag2', 'tag3'], ['tag1', 'tag2', 'tag3']]}
df = pd.DataFrame(data)
df

    user_id event_id    event1_tags     event3_tags
0   1       1           [tag1]          NaN
1   1       1           []              NaN
2   1       3           NaN             [tag1, tag2, tag3]
3   2       1           [tag2, tag3]    NaN
4   2       3           NaN             [tag1, tag2, tag3]
5   2       3           NaN             [tag1, tag2, tag3]
6   3       1           [tag2, tag4]    NaN
7   3       3           NaN             [tag1, tag2, tag3]
8   3       3           NaN             [tag1, tag2, tag3]
data = {'user_id' : [1, 1, 1, 2, 2, 2, 3, 3, 3],
        'event_id' : [1, 1, 3, 1, 3, 3, 1, 3, 3],
        'event1_tags' : [['tag1'], [], np.NaN, ['tag2', 'tag3'], np.NaN, np.NaN, ['tag2', 'tag4'], np.NaN, np.NaN],
        'event3_tags' : [np.NaN, np.NaN, ['tag1', 'tag2', 'tag3'], 
                         np.NaN, ['tag1', 'tag2', 'tag3'], ['tag1', 'tag2', 'tag3'], 
                         np.NaN, ['tag1', 'tag2', 'tag3'], ['tag1', 'tag2', 'tag3']],
        'success' : [0, 0, 0, 1, 0, 0, 0, 0, 0]}
df = pd.DataFrame(data)
df

    user_id event_id    event1_tags     event3_tags         success
0   1       1           [tag1]          NaN                 0
1   1       1           []              NaN                 0
2   1       3           NaN             [tag1, tag2, tag3]  0
3   2       1           [tag2, tag3]    NaN                 1
4   2       3           NaN             [tag1, tag2, tag3]  0
5   2       3           NaN             [tag1, tag2, tag3]  0
6   3       1           [tag2, tag4]    NaN                 0
7   3       3           NaN             [tag1, tag2, tag3]  0
8   3       3           NaN             [tag1, tag2, tag3]  0

This is my current solution.这是我目前的解决方案。 As it turns out, it's not as slow as I expected it to be (but I do know it's slow).事实证明,它并不像我预期的那么慢(但我知道它很慢)。 I'm still interested in faster solutions if anybody has tips.如果有人有提示,我仍然对更快的解决方案感兴趣。

def get_conversion(df):
    event_dataframe['success'] = 0
    for i in df.itertuples():
        current_idx = i[0]
        next_idx = i[0]+1
        if ((next_idx in df.index)
            and (df['user_id'][current_idx]==df['user_id'][next_idx]) 
            and (df['event_id'][current_idx]==1) 
            and (df['event_id'][next_idx]==3) 
            and (len(df['event1_tags'][current_idx])!=0) 
            and (all(t in df['event3_tags'][next_idx] for t in df['event1_tags'][current_idx]))
             ):
            df.loc[current_idx, ['success']] = 1
        else:
            pass
    return df

df = get_conversion(df)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM