简体   繁体   English

append 序列号,使用 padas 填充零到系列

[英]append sequence number with padded zeroes to a series using padas

I have a dataframe like as shown below我有一个 dataframe 如下图所示

df = pd.DataFrame({'person_id': [101,101,101,101,202,202,202],
                        'login_date':['5/7/2013 09:27:00 AM','09/08/2013 11:21:00 AM','06/06/2014 08:00:00 AM','06/06/2014 05:00:00 AM','12/11/2011 10:00:00 AM','13/10/2012 12:00:00 AM','13/12/2012 11:45:00 AM']})
df.login_date = pd.to_datetime(df.login_date)
df['logout_date'] = df.login_date + pd.Timedelta(days=5)
df['login_id'] = [1,1,1,1,8,8,8]

As you can see in the sample dataframe, the login_id is the same even though login and logout dates are different for the person.正如您在示例 dataframe 中看到的那样,即使用户的loginlogout日期不同, login_id也是相同的。

For example, person = 101 , has logged in and out at 4 different timestamps.例如, person = 101以 4 个不同的时间戳登录和注销。 but he has got the same login_ids which is incorrect.但他有相同的 login_ids,这是不正确的。

Instead, I would like to generate a new login_id column where each person gets a new login_id but retains the 1st login_id information in their subsequent logins.相反,我想生成一个new login_id列,其中每个人都获得一个新的 login_id,但在随后的登录中保留第1st login_id信息。 So, we can know its a sequence所以,我们可以知道它是一个序列

I tried the below but it doesn't work well我尝试了以下方法,但效果不佳

df.groupby(['person_id','login_date','logout_date'])['login_id'].rank(method="first", ascending=True) + 100000

I expect my output to be like as shown below.我希望我的 output 如下所示。 You can see how 1 and 8 , the 1st login_id for each person is retained in their subsequent login_ids .您可以看到18 ,每个人的第一个 login_id 是如何保留在他们随后的login_ids中的。 We just add a sequence by adding 00001 and plus one based on number of rows.我们只是通过添加00001和根据行数加一来添加一个序列。

Please note I would like to apply this on a big data and the login_ids may not just be single digit in real data.请注意,我想将此应用于大数据,并且login_ids在实际数据中可能不仅仅是single digit For ex, 1st login_id could even be 576869578 etc kind of random number.例如,第一个 login_id 甚至可以是576869578等随机数。 In that case, the subsequent login id will be 57686957800001 .在这种情况下,后续登录 ID 将为57686957800001 Hope this helps.希望这可以帮助。 Whatever is the 1st login_id for that subject, add 00001 , 00002 etc based on the number of rows that person has.无论该主题的第一个login_id是什么,根据该人拥有的行数添加0000100002等。 Hope this helps希望这可以帮助

在此处输入图像描述

Update 2: Just realized my previous answers also added 100000 to the first index.更新 2:刚刚意识到我之前的答案也在第一个索引中添加了 100000。 Here is a version that uses GroupBy.transform() to add 100000 only to subsequent indexes:这是一个使用GroupBy.transform()仅将 100000 添加到后续索引的版本:

cumcount = df.groupby(['person_id','login_id']).login_id.cumcount()
df.login_id = df.groupby(['person_id','login_id']).login_id.transform(
    lambda x: x.shift().mul(100000).fillna(x.min())
).add(cumcount)

    person_id           login_date          logout_date  login_id
# 0       101  2013-05-07 09:27:00  2013-05-12 09:27:00         1
# 1       101  2013-09-08 11:21:00  2013-09-13 11:21:00    100001
# 2       101  2014-06-06 08:00:00  2014-06-11 08:00:00    100002
# 3       101  2014-06-06 05:00:00  2014-06-11 05:00:00    100003
# 4       202  2011-12-11 10:00:00  2011-12-16 10:00:00         8
# 5       202  2012-10-13 00:00:00  2012-10-18 00:00:00    800001
# 6       202  2012-12-13 11:45:00  2012-12-18 11:45:00    800002

Update: Faster option is to build the sequence with GroupBy.cumcount() :更新:更快的选择是使用GroupBy.cumcount()构建序列:

cumcount = df.groupby(['person_id','login_id']).login_id.cumcount()
df.login_id = df.login_id.mul(100000).add(cumcount)

#   person_id           login_date          logout_date  login_id
# 0       101  2013-05-07 09:27:00  2013-05-12 09:27:00    100000
# 1       101  2013-09-08 11:21:00  2013-09-13 11:21:00    100001
# 2       101  2014-06-06 08:00:00  2014-06-11 08:00:00    100002
# 3       101  2014-06-06 05:00:00  2014-06-11 05:00:00    100003
# 4       202  2011-12-11 10:00:00  2011-12-16 10:00:00    800000
# 5       202  2012-10-13 00:00:00  2012-10-18 00:00:00    800001
# 6       202  2012-12-13 11:45:00  2012-12-18 11:45:00    800002

You can build the sequence in a GroupBy.apply() :您可以在GroupBy.apply()中构建序列:

df.login_id = df.groupby(['person_id','login_id']).login_id.apply(
    lambda x: pd.Series([x.min()*100000+seq for seq in range(len(x))], x.index)
)
login_id = df.groupby('person_id').login_id.apply(list)
def modify_id(x):
    result= []
    for index,value in enumerate(x):
        if index > 0:
            value = (int(value) * 100000) + index
        result.append(value)
    return result

df['ogin_id'] = login_id.apply(lambda x : modify_id(x)).explode().to_list()

Will give output -会给output -

person_id person_id login_date登录日期 logout_date注销日期 login_id登录ID
101 101 2013-05-07 09:27:00 2013-05-07 09:27:00 2013-05-12 09:27:00 2013-05-12 09:27:00 1 1
101 101 2013-09-08 11:21:00 2013-09-08 11:21:00 2013-09-13 11:21:00 2013-09-13 11:21:00 100001 100001
101 101 2014-06-06 08:00:00 2014-06-06 08:00:00 2014-06-11 08:00:00 2014-06-11 08:00:00 100002 100002
101 101 2014-06-06 05:00:00 2014-06-06 05:00:00 2014-06-11 05:00:00 2014-06-11 05:00:00 100003 100003
202 202 2011-12-11 10:00:00 2011-12-11 10:00:00 2011-12-16 10:00:00 2011-12-16 10:00:00 8 8
202 202 2012-10-13 00:00:00 2012-10-13 00:00:00 2012-10-18 00:00:00 2012-10-18 00:00:00 800001 800001
202 202 2012-12-13 11:45:00 2012-12-13 11:45:00 2012-12-18 11:45:00 2012-12-18 11:45:00 800002 800002

You can make use of your original rank()你可以利用你原来的rank()

df['login_id'] = df['login_id'] * 100000 + df.groupby(['person_id'])['login_id'].rank(method="first") - 1
# print(df)
   person_id          login_date         logout_date  login_id
0        101 2013-05-07 09:27:00 2013-05-12 09:27:00  100000.0
1        101 2013-09-08 11:21:00 2013-09-13 11:21:00  100001.0
2        101 2014-06-06 08:00:00 2014-06-11 08:00:00  100002.0
3        101 2014-06-06 05:00:00 2014-06-11 05:00:00  100003.0
4        202 2011-12-11 10:00:00 2011-12-16 10:00:00  800000.0
5        202 2012-10-13 00:00:00 2012-10-18 00:00:00  800001.0
6        202 2012-12-13 11:45:00 2012-12-18 11:45:00  800002.0

Then changed the first row of each group然后改变了每组的第一行

def change_first(group):
    group.loc[group.index[0], 'login_id'] = group.iloc[0]['login_id'] / 100000
    return group

df['login_id'] = df.groupby(['person_id']).apply(lambda group: change_first(group))['login_id']
# print(df)

   person_id          login_date         logout_date  login_id
0        101 2013-05-07 09:27:00 2013-05-12 09:27:00       1.0
1        101 2013-09-08 11:21:00 2013-09-13 11:21:00  100001.0
2        101 2014-06-06 08:00:00 2014-06-11 08:00:00  100002.0
3        101 2014-06-06 05:00:00 2014-06-11 05:00:00  100003.0
4        202 2011-12-11 10:00:00 2011-12-16 10:00:00       8.0
5        202 2012-10-13 00:00:00 2012-10-18 00:00:00  800001.0
6        202 2012-12-13 11:45:00 2012-12-18 11:45:00  800002.0

Or make use of where() to only update the row where condition is False.或者使用where()仅更新条件为 False 的行。

df_ = df['login_id'] * 100000 + df.groupby(['person_id'])['login_id'].rank(method="first") - 1

firsts = df.groupby(['person_id']).head(1).index

df['login_id'] = df['login_id'].where(df.index.isin(firsts), df_)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 零序列 - Sequence of zeroes 如何在小尾数中打印十六进制数字,填充零,并且在python中不显示'0x'? - How to print hex number in little endian, padded zeroes, and no '0x' in python? 使用字符串和(填充的)数字格式化字符串 - format a string using a string and (padded) number Keras:将mask_zero与带填充序列相比使用单序列非带填充训练 - Keras: using mask_zero with padded sequences versus single sequence non padded training Pytorch 和 CUDA 在使用 pack_padded_sequence 时抛出 RuntimeError - Pytorch with CUDA throws RuntimeError when using pack_padded_sequence 使用padas dataframe中的for,If语句计算持续时间 - Using of for, If statements in padas dataframe to calculate duration 使用 pack_padded_sequence - pad_packed_sequence 时训练精度下降和损失增加 - Training accuracy decrease and loss increase when using pack_padded_sequence - pad_packed_sequence 如何获取 pandas 系列中索引的序号? - How to get the sequence number of an index in pandas Series? 使用 numpy 操作从每行填充 numpy 数组(不包括填充)和未填充值的数量中获取 Select 的最快方法 - Fastest way to Select a random number from each row padded numpy array (excluding the pad) and number of non padded values, using numpy operations 如何使用熊猫生成零填充数字序列到给定的限制? - How to generate a zero padded sequence of numbers upto a given limit using pandas?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM