简体   繁体   English

pandas:根据其他列将多行中一个单元格的值替换为一个特定行

[英]pandas: replace one cell's value from mutiple row by one particular row based on other columns

my aim:我的目标:

     uniqueIdentity    beginTime    progrNumber
0   2018-02-07-6253554  17:40:29    1
1   2018-02-07-6253554  17:40:29    2
2   2018-02-07-6253554  17:40:29    3
3   2018-02-07-6253554  17:40:29    4
4   2018-02-07-6253554  17:40:29    5
5   2018-02-07-5555333  17:48:29    2
6   2018-02-07-5555333  17:48:29    3
7   2018-02-07-5555333  17:48:29    4
8   2018-02-07-2345622  18:40:29    1
9   2018-02-07-2345622  18:40:29    2
10  2018-02-07-2345622  18:40:29    3
11  2018-02-07-2345622  18:40:29    4

my dataset now:我现在的数据集:

     uniqueIdentity    beginTime    progrNumber
0   2018-02-07-6253554  17:40:29    1
1   2018-02-07-6253554  17:41:15    2
2   2018-02-07-6253554  17:41:55    3
3   2018-02-07-6253554  17:42:54    4
4   2018-02-07-6253554  17:43:29    5
5   2018-02-07-5555333  17:49:15    2
6   2018-02-07-5555333  17:49:55    3
7   2018-02-07-5555333  17:50:54    4
8   2018-02-07-2345622  18:40:29    1
9   2018-02-07-2345622  18:41:15    2
10  2018-02-07-2345622  18:41:55    3
11  2018-02-07-2345622  18:42:54    4

That means: for rows having same 'uniqueIdentity', the 'beginTime' should be replaced by the value of cell which having the same'uniqueIdentity' and 'progrNumber' is the min 'progrNumber'.这意味着:对于具有相同“uniqueIdentity”的行,“beginTime”应替换为具有相同“uniqueIdentity”的单元格的值,而“progrNumber”是最小的“progrNumber”。

As you mention in the comments, the lowest progrNumber will also be the lowest beginTime .正如您在评论中提到的,最低的progrNumber也将是最低的beginTime This means you can just take the lowest beginTime per uniqueIdentity using groupby and transform .这意味着您可以使用groupbytransform获取每个uniqueIdentity的最低beginTime

Note if beginTime is of type string, this will only work if it has consistent formatting.请注意,如果beginTime是字符串类型,则仅当它具有一致的格式时才有效。 (eg '09:40:20' instead of '9:40:20') (例如“09:40:20”而不是“9:40:20”)

df['beginTime'] = df.groupby('uniqueIdentity').beginTime.transform('min')

        uniqueIdentity beginTime progrNumber
0   2018-02-07-6253554  17:40:29           1
1   2018-02-07-6253554  17:40:29           2
2   2018-02-07-5555333  17:48:29           3
3   2018-02-07-5555333  17:48:29           4
4   2018-02-07-6253554  17:40:29           3
5   2018-02-07-6253554  17:40:29           4
6   2018-02-07-5555333  17:48:29           1
7   2018-02-07-5555333  17:48:29           2
8   2018-02-07-2345622  18:40:29           1
9   2018-02-07-2345622  18:40:29           3
10  2018-02-07-2345622  18:40:29           4

Here's another option using a left join and some renaming这是使用左连接和一些重命名的另一个选项

    # find rows where progrNumber is 1 
    df_prog1=df[df.progrNumber==1]
    # do a left join on the original 
    df=df.merge(df_prog1,on='uniqueIdentity',how='left',suffixes=('','_y'))
    # keep only the beginTime from the right frame 
    df=df[['uniqueIdentity','beginTime_y','progrNumber']]
    # rename columns
    df=df.rename(columns={'beginTime_y':'beginTime'})
    print(df)

Results in:结果是:

        uniqueIdentity beginTime  progrNumber
0   2018-02-07-6253554  17:40:29            1
1   2018-02-07-6253554  17:40:29            2
2   2018-02-07-6253554  17:40:29            3
3   2018-02-07-6253554  17:40:29            4
4   2018-02-07-5555333  17:48:29            1
5   2018-02-07-5555333  17:48:29            2
6   2018-02-07-5555333  17:48:29            3
7   2018-02-07-5555333  17:48:29            4
8   2018-02-07-2345622  18:40:29            1
9   2018-02-07-2345622  18:40:29            2
10  2018-02-07-2345622  18:40:29            3
11  2018-02-07-2345622  18:40:29            4

if you're not sure which record within a uniqueIdentity will have the minimum time, you can use a groupby instead of selecting where progrNumber==1 :如果您不确定uniqueIdentity中的哪条记录的时间最短,您可以使用groupby而不是选择 where progrNumber==1

    df_prog1=df.groupby('uniqueIdentity')['beginTime'].min().reset_index()

And do the left join as above.并按照上面的方法进行左连接。

If the first beginTime for each user will always correspond to the minimum program number for each user, you can do:如果每个用户的第一个beginTime始终对应于每个用户的最小程序编号,您可以执行以下操作:

d = df.groupby('uniqueIdentity')['beginTime'].first().to_dict()
df['beginTime'] = df['uniqueIdentity'].map(d)

To be more explicit about getting the time where the program number is minimum (regardless of its position), you replace d in the above with:为了更明确地获取程序编号最小的时间(无论其位置如何),您将上面的d替换为:

d = df.groupby('uniqueIdentity').apply(lambda x: x['beginTime'][x['progrNumber'].idxmin()]).to_dict()

These two yield the same result for your example data, but they will differ if there are users where the first beginTime (or minimum beginTime per Hugolmn) does not correspond to the minimum progrNumber for the user这两个对您的示例数据产生相同的结果,但如果有用户的第一个beginTime (或每个 Hugolmn 的最小beginTime )不对应于用户的最小progrNumber ,它们会有所不同

Using groupby and map使用groupbymap

The hypothesis is that beginTime will always be minimal for a minimal progrNumber .假设是对于最小的progrNumber来说beginTime总是最小的。 This condition is true based on the question's comments.根据问题的评论,此条件为真。

In this answer, I collect the minimum beginTime of each uniqueIdentity and then map it to the original DataFrame based on uniqueIdentity .在这个答案中,我收集了每个uniqueIdentity的最小 beginTime ,然后将 map 收集到基于 uniqueIdentity 的原始uniqueIdentity

times = df.groupby('uniqueIdentity').beginTime.min()
df['beginTime'] = df.uniqueIdentity.map(times)

If we cannot assume that the min progrNumber is also the min beginTime , a more sophisiticated approach is required:如果我们不能假设 min progrNumber也是 min beginTime ,则需要更复杂的方法:

df['beginTime'] = (
     df.groupby('uniqueIdentity', as_index=False, group_keys=False)
       .apply(lambda s: pd.Series(s[s.progrNumber==s.progrNumber.min()]
              .beginTime.item(), index=s.index)
       )
)

df
#    uniqueIdentity beginTime   progrNumber
# 0  2018-02-07-6253554 17:40:29    1
# 1  2018-02-07-6253554 17:40:29    2
# 2  2018-02-07-6253554 17:40:29    3
# 3  2018-02-07-6253554 17:40:29    4
# 4  2018-02-07-6253554 17:40:29    5
# 5  2018-02-07-5555333 17:49:15    2
# 6  2018-02-07-5555333 17:49:15    3
# 7  2018-02-07-5555333 17:49:15    4
# 8  2018-02-07-2345622 18:40:29    1
# 9  2018-02-07-2345622 18:40:29    2
# 10 2018-02-07-2345622 18:40:29    3
# 11 2018-02-07-2345622 18:40:29    4

If you don't want a oneliner, an approach with map would be ideal如果您不想要单线器,则使用map的方法将是理想的

mapping  = (
     df.groupby('uniqueIdentity')
       .apply(lambda s: s[s.progrNumber==s.progrNumber.min()].beginTime.iloc[0])
 )

 df['beingTime'] = df.uniqueIdentity.map(mapping)

note: You can replace the iloc[0] by item() if you guarantee that only one value has the min progrNumber注意:如果您保证只有一个值具有最小progrNumber ,则可以将iloc[0]替换为item()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM