[英]How to set the values of a row based on similar rows in pandas dataframe?
I have a dataframe where I want to add a column based on duplicate values in the 1st column.Here is my dataframe:我有一个数据框,我想根据第一列中的重复值添加一列。这是我的数据框:
df
col1 col2 col3
data1 s1 k1
data1 s2 k2
data2 s4 k4
data2 s5 k5
data3 s6 k6
data3 s7 k7
data1 s8 k8
data1 s9 k9
Output I want is我想要的输出是
col1 col2 col3 newcol
data1 s1 k1 10
data1 s2 k2 20
data2 s4 k4 10
data2 s5 k5 20
data3 s6 k6 10
data3 s7 k7 20
data1 s8 k8 30
data1 s9 k9 40
So in row :7 data1 again comes & is already there in row :2 so i get set it to 30 (10 increment).所以在第 7 行 data1 再次出现并且已经在第 2 行中,所以我将其设置为 30(增量为 10)。 I tried something like
我试过类似的东西
outputdf["code"] = [i for i in range(10,10+len(outputdf),10)]
but it doesn't work, please help me how to achieve the output.但它不起作用,请帮助我如何实现输出。
db_df = made a dataframe from the database
col1 col2 col3 newcol
data1 s1 k1 30
data1 s2 k2 40
data2 s4 k4 10
In this db_df i already have data : col1(data1,data1,data2) of newcol(30,40,10) , when I create newcol in df , I want the data1 to become 40+10 & data2 10+10( 40,10 are the max value of newcol in data1 & data2 rows of db_df).在这个 db_df 中,我已经有了数据:col1(data1,data1,data2) of newcol(30,40,10),当我在 df 中创建 newcol 时,我希望 data1 变为 40+10 & data2 10+10( 40, 10 是 db_df 的 data1 和 data2 行中 newcol 的最大值)。 I want to compare the df with db_df, if data1 is not there in db_df then create data1 rows 10/20... else existing max newcol value + 10, example: if db_df exists then out should be
我想将 df 与 db_df 进行比较,如果 db_df 中不存在 data1,则创建 data1 行 10/20...否则现有的最大 newcol 值 + 10,例如:如果 db_df 存在,则 out 应该是
col1 col2 col3 newcol
data1 s1 k1 50
data1 s2 k2 60
data2 s4 k4 20
data2 s5 k5 30
data3 s6 k6 10
data3 s7 k7 20
data1 s8 k8 70
data1 s9 k9 80
Now what is happening is , it is not checking whether data1 or data2 is present in db_df , so instead of Row(data1,data1,data2,data2 -- 50,60,20,30) I am getting Row(data1,data1,data2,data2 -- 10,20,10,20)现在发生的事情是,它不检查 db_df 中是否存在 data1 或 data2,所以不是 Row(data1,data1,data2,data2 -- 50,60,20,30) 我得到的是 Row(data1,data1,数据2,数据2 -- 10,20,10,20)
my output after edit code is
0 data1 s1 k1 40
1 data1 s2 k2 50
2 data2 s4 k4 20
3 data2 s5 k5 30
4 data3 s6 k6 10
5 data3 s7 k7 20
6 data1 s8 k8 60
7 data1 s9 k9 70
Expecting this期待这个
data1 s1 k1 50
data1 s2 k2 60
data2 s4 k4 20
data2 s5 k5 30
data3 s6 k6 10
data3 s7 k7 20
data1 s8 k8 70
data1 s9 k9 80
.transform('first')
returns the first non NaN value, I want to start counting from the largest value of 'newcol' in db_df , is there anyway, i tried df['newcol'] = (df.groupby('col1')['newcol'].transform(max) + (df.groupby('col1').cumcount()+ 1) * 10) but not working. .transform('first')
返回第一个非 NaN 值,我想从 db_df 中 'newcol' 的最大值开始计数,无论如何,我试过 df['newcol'] = (df.groupby('col1 ')['newcol'].transform(max) + (df.groupby('col1').cumcount()+ 1) * 10) 但不起作用。
Largest values of newcol for the rows data1 is 40 & data2 is 10 , so i want to start from 50 for data1 & 20 for data2行 data1 的 newcol 的最大值是 40 & data2 是 10 ,所以我想从 50 开始 data1 & 20 开始 data2
1 last help, this works only when 1st dataframe's(df) col2 & col3 values are same as 2nd dataframe's(db_df) col2 & col3, if i change values of col2 & col3 for df_df to something else , i think it will not work? 1 最后的帮助,这仅在第一个数据帧的(df)col2 和 col3 值与第二个数据帧的(db_df)col2 和 col3 相同时才有效,如果我将 df_df 的 col2 和 col3 的值更改为其他值,我认为它不起作用? please have a look
请看一看
when db_df =
col1 col2 col3 newcol
0 data1 m1 n1 20
1 data1 m2 n2 90
2 data2 m4 m4 50
then it's not giving the output using .transform(max).那么它不会使用 .transform(max) 给出输出。 Will it only work when each row have same value in the col2 & col3 column of both the DataFrame?
只有当 DataFrame 的 col2 和 col3 列中的每一行都具有相同的值时,它才有效吗? Kindly verify
请核实
Use groupby
(on the first column) + cumcount
, add 1 (since we start counting at zero), and multiply by 10:使用
groupby
(在第一列上)+ cumcount
,加 1(因为我们从零开始计数),然后乘以 10:
df['newcol'] = (df.groupby('col1').cumcount() + 1) * 10
col1 col2 col3 newcol
0 data1 s1 k1 10
1 data1 s2 k2 20
2 data2 s4 k4 10
3 data2 s5 k5 20
4 data3 s6 k6 10
5 data3 s7 k7 20
6 data1 s8 k8 30
7 data1 s9 k9 40
EDIT (After Question Update).编辑(问题更新后)。 You have to merge in the original database dataframe, so that you can know where to start counting with
(df.groupby('col1')['newcol'].transform('first')
and then add it to my first solution:您必须合并原始数据库数据帧,以便您知道从哪里开始计数
(df.groupby('col1')['newcol'].transform('first')
然后将其添加到我的第一个解决方案中:
df = df.merge(db_df, on=['col1', 'col2', 'col3'], how='left')
df['newcol'] = df['newcol'].fillna(0).astype(int)
df['newcol'] = (df.groupby('col1')['newcol'].transform('max')
+ (df.groupby('col1').cumcount()+ 1) * 10)
df
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.