[英]pandas: Create new column by comparing DataFrame rows with columns of another DataFrame
Assume I have df1
:假设我有df1
:
df1= pd.DataFrame({'alligator_apple': range(1, 11),
'barbadine': range(11, 21),
'capulin_cherry': range(21, 31)})
alligator_apple barbadine capulin_cherry
0 1 11 21
1 2 12 22
2 3 13 23
3 4 14 24
4 5 15 25
5 6 16 26
6 7 17 27
7 8 18 28
8 9 19 29
9 10 20 30
And a df2
:还有一个df2
:
df2= pd.DataFrame({'alligator_apple': [6, 7, 15, 5],
'barbadine': [3, 19, 25, 12],
'capulin_cherry': [1, 9, 15, 27]})
alligator_apple barbadine capulin_cherry
0 6 3 1
1 7 19 9
2 15 25 15
3 5 12 27
I'm looking for a way to create a new column in df2
that gets number of rows based on a condition where all columns in df1
has values greater than their counterparts in df2
for each row.我正在寻找一种在df2
中创建新列的方法,该列根据条件df1
中的所有列的值大于df2
中每一行的对应列的值来获取行数。 For example:例如:
alligator_apple barbadine capulin_cherry greater
0 6 3 1 4
1 7 19 9 1
2 15 25 15 0
3 5 12 27 3
To elaborate, at row 0 of df2
, df1.alligator_apple
has 4 rows which values are higher than df2.alligator_apple
with the value of 6. df1.barbadine
has 10 rows which values are higher than df2.barbadine
with value of 3, while similarly df1.capulin_cherry
has 10 rows.详细说明,在df2
第 0 行, df1.alligator_apple
有 4 行,其值高于df2.alligator_apple
的值为df1.barbadine
有 10 行,其值高于df2.barbadine
的值为 3,而类似df1.capulin_cherry
有 10 行。
Finally, apply an 'and' condition to all aforementioned conditions to get the number '4' of df2.greater
of first row.最后,将“and”条件应用于所有上述条件,以获得第一行df2.greater
的数字“4”。 Repeat for the rest of rows in df2
.对df2
的其余行重复此操作。
Is there a simple way to do this?有没有一种简单的方法可以做到这一点?
I believe this does what you want:我相信这可以满足您的需求:
df2['greater'] = df2.apply(
lambda row:
(df1['alligator_apple'] > row['alligator_apple']) &
(df1['barbadine'] > row['barbadine']) &
(df1['capulin_cherry'] > row['capulin_cherry']),
axis=1,
).sum(axis=1)
print(df2)
output:输出:
alligator_apple barbadine capulin_cherry greater
0 6 3 1 4
1 7 19 9 1
2 15 25 15 0
3 5 12 27 3
Edit: if you want to generalize and apply this logic for a given column set, we can use functools.reduce
together with operator.and_
:编辑:如果您想对给定的列集概括和应用此逻辑,我们可以将functools.reduce
与operator.and_
一起使用:
import functools
import operator
columns = ['alligator_apple', 'barbadine', 'capulin_cherry']
df2['greater'] = df2.apply(
lambda row: functools.reduce(
operator.and_,
(df1[column] > row[column] for column in columns),
),
axis=1,
).sum(axis=1)
There's a general solution to this that should work well.有一个通用的解决方案应该可以很好地工作。
def gt_mask(row,df):
mask = True
for key,val in row.items():
mask &= df[key] > val
return len(df[mask])
df2['greater'] = df2.apply(gt_mask,df=df1,axis=1)
Output df2输出 df2
,alligator_apple,barbadine,capulin_cherry,greater
0,6,3,1,4
1,7,19,9,1
2,15,25,15,0
3,5,12,27,3
This creates a mask, iterating through the key/val pairs for a given row.这将创建一个掩码,遍历给定行的键/值对。
Edit this answer was a big help: Masking a DataFrame on multiple column conditions - inside a loop编辑此答案有很大帮助: 在多列条件上屏蔽数据帧 - 在循环内
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.