简体   繁体   English

Python Pandas:有效计算值大于或等于一组值的行数,按键列分组

[英]Python Pandas: Efficiently compute count of rows with value greater than or equal to a set of values, grouped by key columns

Suppose I have two Pandas DataFrames:假设我有两个 Pandas 数据帧:

  • df1 with columns k1 ( str ), k2 ( str ), and v ( float ), and df1具有列k1 ( str )、 k2 ( str ) 和v ( float ),以及
  • df2 with a column w ( float ). df2有一列w ( float )。

I can assume that the rows df1 are sorted, first by k1 , then by k2 , and finally by v .我可以假设行df1已排序,首先按k1 ,然后按k2 ,最后按v I can assume that the values of w in df2 are unique and sorted.我可以假设df2w的值是唯一且已排序的。

My goal is to create a new DataFrame df3 with columns k1 , k2 , w , and count_ge .我的目标是创建一个新的 DataFrame df3 ,其中包含k1k2wcount_ge The DataFrame df3 should have one row for each unique combination of k1 , k2 , and w ; DataFrame df3应该为k1k2w的每个唯一组合有一行; the column count_ge should be the number of rows in df1 that have the same values of k1 and k2 , and a value of v that is greater than or equal to the value of w .count_ge应该是df1中具有相同k1k2值的行数,并且v的值大于或等于w的值。

The following code is a naive implementation that seems to do what I want.以下代码是一个幼稚的实现,似乎可以满足我的要求。 Is there an efficient way to carry out the same operation?有没有一种有效的方法来执行相同的操作? Ideally, the code should also generalize to more than two keys in df1 .理想情况下,代码还应该推广到df1中的两个以上的键。

import pandas as pd

# Generate some example data.
df1 = pd.DataFrame(
    (
        ('A', 'A', 1),
        ('A', 'A', 1),
        ('A', 'A', 3),
        ('A', 'A', 4),
        ('B', 'C', 2),
        ('B', 'C', 6),
    ),
    columns=('k1', 'k2', 'v'),
)

df2 = pd.DataFrame(
    (0, 2, 5),
    columns=('w',),
)

# Get all unique combinations of k1, k2, and w.
# In Pandas 1.2.0, we can use `merge(how='cross')` for this instead.
df3 = (
    df1[['k1', 'k2']]
    .drop_duplicates()
    .assign(_key=1)
    .merge(df2.assign(_key=1), on='_key')
    .drop(columns='_key')
)

# For each row in df3, count the number of rows in df1 that have the same values of k1 and k2,
# and a value of v that is greater than or equal to w.
df3['count_ge'] = 0
for i, (k1, k2, w, _) in df3.iterrows():
    df3.loc[i, 'count_ge'] = len(df1.query(f'k1 == {k1!r} and k2 == {k2!r} and v >= {w!r}'))
df3

Initialize df3 using a cross merge :使用交叉merge初始化df3

df3 = df1[["k1", "k2"]].drop_duplicates().merge(df2, how='cross')
>>> df3
  k1 k2  w
0  A  A  0
1  A  A  2
2  A  A  5
3  B  C  0
4  B  C  2
5  B  C  5

Then for the count_ge column, you could use a lambda function like so:然后对于count_ge列,您可以像这样使用lambda function :

df3['count_ge'] = df3.apply(lambda x: df1[(df1["k1"]==x["k1"])&(df1["k2"]==x["k2"])&(df1["v"]>=x["w"])].shape[0], axis=1)
>>> df3
  k1 k2  w  count_ge
0  A  A  0         4
1  A  A  2         2
2  A  A  5         0
3  B  C  0         2
4  B  C  2         2
5  B  C  5         1

Another possible approach is to use np.histogram .另一种可能的方法是使用np.histogram This method seems fairly clean, but has the potential drawback of copying the DataFrames in pd.concat .这种方法看起来相当干净,但存在复制数据帧的潜在缺点pd.concat Other suggestions are still welcome.仍然欢迎其他建议。

import numpy as np
import pandas as pd

# Generate some example data.
df1 = pd.DataFrame(
    (
        ('A', 'A', 1),
        ('A', 'A', 1),
        ('A', 'A', 3),
        ('A', 'A', 4),
        ('B', 'C', 2),
        ('B', 'C', 6),
    ),
    columns=('k1', 'k2', 'v'),
)

df2 = pd.DataFrame(
    (0, 2, 5),
    columns=('w',),
)

# For each unique combination of (k1, k2, w), count the number of rows in df1 that have the same values of k1 and k2,
# and a value of v that is greater than or equal to w.
# such
v_bins = np.concatenate((df2['w'], [np.inf]))
df3s = []
for (k1, k2), v in df1.groupby(['k1', 'k2'])['v']:
    df = df2.copy()
    df['count_ge'] = np.histogram(a=v, bins=v_bins)[0][::-1].cumsum()[::-1]
    df['k1'] = k1
    df['k2'] = k2
    df3s.append(df[['k1', 'k2', 'w', 'count_ge']])
pd.concat(df3s)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python-Pandas Dataframe:计数值大于或等于 dataframe 中的值 - Python-Pandas Dataframe: count values greater than or equal to a value in the dataframe 从 pandas dataframe 中提取至少一个值大于或等于数组值的行 - Extract rows from pandas dataframe with at least one value greater than or equal to values from array 大熊猫对行中的值进行计数,该值大于另一列中的另一个值 - Pandas Count values across rows that are greater than another value in a different column 仅保留所有值的value_count大于某个阈值的pandas列; - Only keep pandas columns where value_count of all values greater than some threshold; 过滤一组中超过 1 个值的行并计算其出现次数 pandas python - Filter rows with more than 1 value in a set and count their occurrence pandas python Pandas - 计算列值大于阈值限制的连续行 - Pandas - Count consecutive rows with column values greater than a threshold limit Pandas计数值大于最后n行中的当前行 - Pandas count values greater than current row in the last n rows 如何使用 Python 过滤 Pandas 数据帧中所有或部分行值大于 0 的列? - How to filter columns whose all or some rows values are greater than 0 in Pandas data-frame using Python? 寻找一种将行组合在一起的方法,以便在 Python/Pandas 中列匹配并且一列的数字大于或等于零 - Looking for a way to group together rows so that columns match and numbers of one column are greater than or equal to zero in Python/Pandas 如果 dataframe 中的列大于另一个值,则计算跨列的行数 - Count rows across columns in a dataframe if they are greater than another value
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM