如何根据熊猫数据框中的相似行设置行的值？

Question

I have a dataframe where I want to add a column based on duplicate values in the 1st column.Here is my dataframe:我有一个数据框，我想根据第一列中的重复值添加一列。这是我的数据框：

df

col1    col2   col3

data1    s1     k1
data1    s2     k2
data2    s4     k4
data2    s5     k5
data3    s6     k6
data3    s7     k7
data1    s8     k8
data1    s9     k9

Output I want is我想要的输出是

col1    col2   col3  newcol

data1    s1     k1    10
data1    s2     k2    20
data2    s4     k4    10
data2    s5     k5    20
data3    s6     k6    10
data3    s7     k7    20
data1    s8     k8    30
data1    s9     k9    40

So in row :7 data1 again comes & is already there in row :2 so i get set it to 30 (10 increment).所以在第 7 行 data1 再次出现并且已经在第 2 行中，所以我将其设置为 30（增量为 10）。 I tried something like我试过类似的东西

outputdf["code"] = [i for i in range(10,10+len(outputdf),10)]

but it doesn't work, please help me how to achieve the output.但它不起作用，请帮助我如何实现输出。

db_df = made a dataframe from the database

col1    col2   col3  newcol

data1    s1     k1    30
data1    s2     k2    40
data2    s4     k4    10

In this db_df i already have data : col1(data1,data1,data2) of newcol(30,40,10) , when I create newcol in df , I want the data1 to become 40+10 & data2 10+10( 40,10 are the max value of newcol in data1 & data2 rows of db_df).在这个 db_df 中，我已经有了数据：col1(data1,data1,data2) of newcol(30,40,10)，当我在 df 中创建 newcol 时，我希望 data1 变为 40+10 & data2 10+10( 40, 10 是 db_df 的 data1 和 data2 行中 newcol 的最大值）。 I want to compare the df with db_df, if data1 is not there in db_df then create data1 rows 10/20... else existing max newcol value + 10, example: if db_df exists then out should be我想将 df 与 db_df 进行比较，如果 db_df 中不存在 data1，则创建 data1 行 10/20...否则现有的最大 newcol 值 + 10，例如：如果 db_df 存在，则 out 应该是

col1    col2   col3  newcol

data1    s1     k1    50 
data1    s2     k2    60 
data2    s4     k4    20
data2    s5     k5    30
data3    s6     k6    10
data3    s7     k7    20
data1    s8     k8    70
data1    s9     k9    80

Now what is happening is , it is not checking whether data1 or data2 is present in db_df , so instead of Row(data1,data1,data2,data2 -- 50,60,20,30) I am getting Row(data1,data1,data2,data2 -- 10,20,10,20)现在发生的事情是，它不检查 db_df 中是否存在 data1 或 data2，所以不是 Row(data1,data1,data2,data2 -- 50,60,20,30) 我得到的是 Row(data1,data1,数据2,数据2 -- 10,20,10,20)

my output after edit code is
0  data1   s1   k1      40
1  data1   s2   k2      50
2  data2   s4   k4      20
3  data2   s5   k5      30
4  data3   s6   k6      10
5  data3   s7   k7      20
6  data1   s8   k8      60
7  data1   s9   k9      70

Expecting this期待这个

data1    s1     k1    50 
data1    s2     k2    60 
data2    s4     k4    20
data2    s5     k5    30
data3    s6     k6    10
data3    s7     k7    20
data1    s8     k8    70
data1    s9     k9    80

.transform('first') returns the first non NaN value, I want to start counting from the largest value of 'newcol' in db_df , is there anyway, i tried df['newcol'] = (df.groupby('col1')['newcol'].transform(max) + (df.groupby('col1').cumcount()+ 1) * 10) but not working. .transform('first')返回第一个非 NaN 值，我想从 db_df 中 'newcol' 的最大值开始计数，无论如何，我试过 df['newcol'] = (df.groupby('col1 ')['newcol'].transform(max) + (df.groupby('col1').cumcount()+ 1) * 10) 但不起作用。

Largest values of newcol for the rows data1 is 40 & data2 is 10 , so i want to start from 50 for data1 & 20 for data2行 data1 的 newcol 的最大值是 40 & data2 是 10 ，所以我想从 50 开始 data1 & 20 开始 data2

1 last help, this works only when 1st dataframe's(df) col2 & col3 values are same as 2nd dataframe's(db_df) col2 & col3, if i change values of col2 & col3 for df_df to something else , i think it will not work? 1 最后的帮助，这仅在第一个数据帧的（df）col2 和 col3 值与第二个数据帧的（db_df）col2 和 col3 相同时才有效，如果我将 df_df 的 col2 和 col3 的值更改为其他值，我认为它不起作用？ please have a look请看一看

when db_df = 
col1 col2 col3 newcol
0  data1   m1   n1     20
1  data1   m2   n2     90
2  data2   m4   m4     50

then it's not giving the output using .transform(max).那么它不会使用 .transform(max) 给出输出。 Will it only work when each row have same value in the col2 & col3 column of both the DataFrame?只有当 DataFrame 的 col2 和 col3 列中的每一行都具有相同的值时，它才有效吗？ Kindly verify请核实

Answer 1

Use groupby (on the first column) + cumcount , add 1 (since we start counting at zero), and multiply by 10:使用groupby （在第一列上）+ cumcount ，加 1（因为我们从零开始计数），然后乘以 10：

df['newcol'] = (df.groupby('col1').cumcount() + 1) * 10

    col1 col2 col3  newcol
0  data1   s1   k1      10
1  data1   s2   k2      20
2  data2   s4   k4      10
3  data2   s5   k5      20
4  data3   s6   k6      10
5  data3   s7   k7      20
6  data1   s8   k8      30
7  data1   s9   k9      40

EDIT (After Question Update).编辑（问题更新后）。 You have to merge in the original database dataframe, so that you can know where to start counting with (df.groupby('col1')['newcol'].transform('first') and then add it to my first solution:您必须合并原始数据库数据帧，以便您知道从哪里开始计数(df.groupby('col1')['newcol'].transform('first')然后将其添加到我的第一个解决方案中：

df = df.merge(db_df, on=['col1', 'col2', 'col3'], how='left')
df['newcol'] = df['newcol'].fillna(0).astype(int)
df['newcol'] = (df.groupby('col1')['newcol'].transform('max') 
             + (df.groupby('col1').cumcount()+ 1) * 10)
df

如何根据熊猫数据框中的相似行设置行的值？

问题描述

1 个解决方案

解决方案1
2 已采纳 2021-06-29 15:41:34

如何根据熊猫数据框中的相似行设置行的值？

问题描述

1 个解决方案

解决方案1 2 已采纳 2021-06-29 15:41:34

解决方案1
2 已采纳 2021-06-29 15:41:34