[英]Python Pandas: Efficiently compute count of rows with value greater than or equal to a set of values, grouped by key columns
Suppose I have two Pandas DataFrames:假设我有两个 Pandas 数据帧:
df1
with columns k1
( str
), k2
( str
), and v
( float
), and df1
具有列k1
( str
)、 k2
( str
) 和v
( float
),以及df2
with a column w
( float
). df2
有一列w
( float
)。 I can assume that the rows df1
are sorted, first by k1
, then by k2
, and finally by v
.我可以假设行
df1
已排序,首先按k1
,然后按k2
,最后按v
。 I can assume that the values of w
in df2
are unique and sorted.我可以假设
df2
中w
的值是唯一且已排序的。
My goal is to create a new DataFrame df3
with columns k1
, k2
, w
, and count_ge
.我的目标是创建一个新的 DataFrame
df3
,其中包含k1
、 k2
、 w
和count_ge
。 The DataFrame df3
should have one row for each unique combination of k1
, k2
, and w
; DataFrame
df3
应该为k1
、 k2
和w
的每个唯一组合有一行; the column count_ge
should be the number of rows in df1
that have the same values of k1
and k2
, and a value of v
that is greater than or equal to the value of w
.列
count_ge
应该是df1
中具有相同k1
和k2
值的行数,并且v
的值大于或等于w
的值。
The following code is a naive implementation that seems to do what I want.以下代码是一个幼稚的实现,似乎可以满足我的要求。 Is there an efficient way to carry out the same operation?
有没有一种有效的方法来执行相同的操作? Ideally, the code should also generalize to more than two keys in
df1
.理想情况下,代码还应该推广到
df1
中的两个以上的键。
import pandas as pd
# Generate some example data.
df1 = pd.DataFrame(
(
('A', 'A', 1),
('A', 'A', 1),
('A', 'A', 3),
('A', 'A', 4),
('B', 'C', 2),
('B', 'C', 6),
),
columns=('k1', 'k2', 'v'),
)
df2 = pd.DataFrame(
(0, 2, 5),
columns=('w',),
)
# Get all unique combinations of k1, k2, and w.
# In Pandas 1.2.0, we can use `merge(how='cross')` for this instead.
df3 = (
df1[['k1', 'k2']]
.drop_duplicates()
.assign(_key=1)
.merge(df2.assign(_key=1), on='_key')
.drop(columns='_key')
)
# For each row in df3, count the number of rows in df1 that have the same values of k1 and k2,
# and a value of v that is greater than or equal to w.
df3['count_ge'] = 0
for i, (k1, k2, w, _) in df3.iterrows():
df3.loc[i, 'count_ge'] = len(df1.query(f'k1 == {k1!r} and k2 == {k2!r} and v >= {w!r}'))
df3
Initialize df3
using a cross merge
:使用交叉
merge
初始化df3
:
df3 = df1[["k1", "k2"]].drop_duplicates().merge(df2, how='cross')
>>> df3
k1 k2 w
0 A A 0
1 A A 2
2 A A 5
3 B C 0
4 B C 2
5 B C 5
Then for the count_ge
column, you could use a lambda
function like so:然后对于
count_ge
列,您可以像这样使用lambda
function :
df3['count_ge'] = df3.apply(lambda x: df1[(df1["k1"]==x["k1"])&(df1["k2"]==x["k2"])&(df1["v"]>=x["w"])].shape[0], axis=1)
>>> df3
k1 k2 w count_ge
0 A A 0 4
1 A A 2 2
2 A A 5 0
3 B C 0 2
4 B C 2 2
5 B C 5 1
Another possible approach is to use np.histogram
.另一种可能的方法是使用
np.histogram
。 This method seems fairly clean, but has the potential drawback of copying the DataFrames in pd.concat
.这种方法看起来相当干净,但存在复制数据帧的潜在缺点
pd.concat
。 Other suggestions are still welcome.仍然欢迎其他建议。
import numpy as np
import pandas as pd
# Generate some example data.
df1 = pd.DataFrame(
(
('A', 'A', 1),
('A', 'A', 1),
('A', 'A', 3),
('A', 'A', 4),
('B', 'C', 2),
('B', 'C', 6),
),
columns=('k1', 'k2', 'v'),
)
df2 = pd.DataFrame(
(0, 2, 5),
columns=('w',),
)
# For each unique combination of (k1, k2, w), count the number of rows in df1 that have the same values of k1 and k2,
# and a value of v that is greater than or equal to w.
# such
v_bins = np.concatenate((df2['w'], [np.inf]))
df3s = []
for (k1, k2), v in df1.groupby(['k1', 'k2'])['v']:
df = df2.copy()
df['count_ge'] = np.histogram(a=v, bins=v_bins)[0][::-1].cumsum()[::-1]
df['k1'] = k1
df['k2'] = k2
df3s.append(df[['k1', 'k2', 'w', 'count_ge']])
pd.concat(df3s)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.