简体   繁体   English

比较数据框中的行

[英]comparing rows in dataframe

i have trouble with one of my task. 我的任务之一很麻烦。 in my first case , i need to comparing some variables in my dataframe, then if they are the same, it will return a same value of identifier column. 在我的第一种情况下 ,我需要比较数据框中的一些变量,然后如果它们相同,它将返回相同的标识符列值。

here's my multiple sorted dataframe looks like 这是我的多个排序数据框看起来像

| no | age| gender | income_group | cars
| 1  | 15 |  male  |       0      | ford
| 2  | 15 |  male  |       0      | renault
| 3  | 15 |  female|       1      | bmw
| 4  | 16 |  female|       1      | bmw
| 5  | 16 |  female|       1      | mercedes
| 6  | 16 |  female|       1      | honda

i want some code that will compare each rows at this sorted dataframe and if [age, gender, income_group] identically the same for some rows, it will copying the first [no] columns value to replace the others 我想要一些代码来比较此排序后的数据帧的每一行,如果某些行的[年龄,性别,收入组]完全相同,它将复制第一个[no]列的值来替换其他的列

the code will make my dataframe looks like this 该代码将使我的数据框看起来像这样

| no | age| gender | income_group | cars
| 1  | 15 |  male  |       0      | ford
| 1  | 15 |  male  |       0      | renault
| 3  | 15 |  female|       1      | bmw
| 4  | 16 |  female|       1      | bmw
| 4  | 16 |  female|       1      | mercedes
| 4  | 16 |  female|       1      | honda

is there any possible way to do like this in python? 有没有可能在python中这样做呢?

Edited: my second case get more complicated where i find some identical [age, gender, income_group] variables but has the same [cars] value, i want it to be consider as different individual in this case different [no] values 编辑: 我的第二种情况变得更加复杂,因为我发现一些相同的[年龄,性别,收入组]变量但具有相同的[汽车]值,因此我希望在这种情况下将其视为不同的个体,而使用不同的[否]值

if expand the dataframe and get a colomn looks like this 如果扩展数据框并获得一个列看起来像这样

| no | age| gender | income_group | cars
| 1  | 15 |  male  |       0      | ford
| 2  | 15 |  male  |       0      | renault
| 3  | 15 |  female|       1      | bmw
| 4  | 16 |  female|       1      | bmw
| 5  | 16 |  female|       1      | mercedes
| 6  | 16 |  female|       1      | honda

| 7  | 17 |  male  |       0      | bmw
| 8  | 17 |  male  |       0      | honda
| 9  | 17 |  male  |       0      | bmw
| 10 | 17 |  male  |       0      | honda
| 11 | 17 |  male  |       0      | renault

one person can't has the same cars value, the code will make the df: 一个人不能拥有相同的汽车价值,该代码将使df:

| 7  | 17 |  male  |       0      | bmw
| 7  | 17 |  male  |       0      | honda
| 9  | 17 |  male  |       0      | bmw
| 9  | 17 |  male  |       0      | honda
| 9  | 17 |  male  |       0      | renault

whit jezrael solution: 白色jezrael解决方案:

df['a'] = df.duplicated(['age','gender','income_group', 'cars'], keep=False).cumsum()

df['no'] = df.groupby(['age','gender','income_group','a'], sort=False)['no'].transform('first')
df = df.drop('a', axis=1)

i get: 我得到:

no  age  gender  income_group      cars  a
 0   15    male             0      ford  0
 0   15    male             0   renault  0
 2   15  female             1       bmw  0
 3   16  female             1       bmw  0
 3   16  female             1  mercedes  0
 3   16  female             1     honda  0
 6   17    male             0       bmw  1
 7   17    male             0     honda  2
 8   17    male             0       bmw  3
 9   17    male             0     honda  4
 9   17    male             0   reanult  4

Use GroupBy.transform with GroupBy.first : GroupBy.transformGroupBy.first GroupBy.transform使用:

df['no'] = df.groupby(['age','gender','income_group'], sort=False)['no'].transform('first')
print (df)
   no  age  gender  income_group      cars
0   1   15    male             0      ford
1   1   15    male             0   renault
2   3   15  female             1       bmw
3   4   16  female             1       bmw
4   4   16  female             1  mercedes
5   4   16  female             1     honda

Or get first values by DataFrame.duplicated and then forward filling missing values: 或者通过DataFrame.duplicated获取第一个值,然后向前填充缺失值:

df['no'] = df.loc[(~df.duplicated(['age','gender','income_group'])), 'no']
df['no'] = df['no'].ffill().astype(int)
print (df)
   no  age  gender  income_group      cars
0   1   15    male             0      ford
1   1   15    male             0   renault
2   3   15  female             1       bmw
3   4   16  female             1       bmw
4   4   16  female             1  mercedes
5   4   16  female             1     honda

EDIT: 编辑:

df['a'] = df.duplicated(['age','gender','income_group', 'cars'])
mask = df.groupby(['age','gender','income_group'])['a'].transform('any')

df.loc[mask, 'no'] = df.groupby(df.loc[mask].groupby('cars').cumcount(ascending=False))['no'].transform('first')
df = df.drop('a', axis=1)              
print (df)
     no  age  gender  income_group      cars
0   1.0   15    male             0      ford
1   2.0   15    male             0   renault
2   3.0   15  female             1       bmw
3   4.0   16  female             1       bmw
4   5.0   16  female             1  mercedes
5   6.0   16  female             1     honda
6   7.0   17    male             0       bmw
7   7.0   17    male             0     honda
8   9.0   17    male             0       bmw
9   9.0   17    male             0     honda
10  9.0   17    male             0   reanult

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM