Python Pandas：有效計算值大於或等於一組值的行數，按鍵列分組

Question

假設我有兩個 Pandas 數據幀：

df1具有列k1 ( str )、 k2 ( str ) 和v ( float )，以及
df2有一列w ( float )。

我可以假設行df1已排序，首先按k1 ，然后按k2 ，最后按v 。 我可以假設df2中w的值是唯一且已排序的。

我的目標是創建一個新的 DataFrame df3 ，其中包含k1 、 k2 、 w和count_ge 。 DataFrame df3應該為k1 、 k2和w的每個唯一組合有一行； 列count_ge應該是df1中具有相同k1和k2值的行數，並且v的值大於或等於w的值。

以下代碼是一個幼稚的實現，似乎可以滿足我的要求。 有沒有一種有效的方法來執行相同的操作？ 理想情況下，代碼還應該推廣到df1中的兩個以上的鍵。

import pandas as pd

# Generate some example data.
df1 = pd.DataFrame(
    (
        ('A', 'A', 1),
        ('A', 'A', 1),
        ('A', 'A', 3),
        ('A', 'A', 4),
        ('B', 'C', 2),
        ('B', 'C', 6),
    ),
    columns=('k1', 'k2', 'v'),
)

df2 = pd.DataFrame(
    (0, 2, 5),
    columns=('w',),
)

# Get all unique combinations of k1, k2, and w.
# In Pandas 1.2.0, we can use `merge(how='cross')` for this instead.
df3 = (
    df1[['k1', 'k2']]
    .drop_duplicates()
    .assign(_key=1)
    .merge(df2.assign(_key=1), on='_key')
    .drop(columns='_key')
)

# For each row in df3, count the number of rows in df1 that have the same values of k1 and k2,
# and a value of v that is greater than or equal to w.
df3['count_ge'] = 0
for i, (k1, k2, w, _) in df3.iterrows():
    df3.loc[i, 'count_ge'] = len(df1.query(f'k1 == {k1!r} and k2 == {k2!r} and v >= {w!r}'))
df3

Answer 1

使用交叉merge初始化df3 ：

df3 = df1[["k1", "k2"]].drop_duplicates().merge(df2, how='cross')
>>> df3
  k1 k2  w
0  A  A  0
1  A  A  2
2  A  A  5
3  B  C  0
4  B  C  2
5  B  C  5

然后對於count_ge列，您可以像這樣使用lambda function ：

df3['count_ge'] = df3.apply(lambda x: df1[(df1["k1"]==x["k1"])&(df1["k2"]==x["k2"])&(df1["v"]>=x["w"])].shape[0], axis=1)
>>> df3
  k1 k2  w  count_ge
0  A  A  0         4
1  A  A  2         2
2  A  A  5         0
3  B  C  0         2
4  B  C  2         2
5  B  C  5         1

Answer 2

另一種可能的方法是使用np.histogram 。 這種方法看起來相當干凈，但存在復制數據幀的潛在缺點pd.concat 。 仍然歡迎其他建議。

import numpy as np
import pandas as pd

# Generate some example data.
df1 = pd.DataFrame(
    (
        ('A', 'A', 1),
        ('A', 'A', 1),
        ('A', 'A', 3),
        ('A', 'A', 4),
        ('B', 'C', 2),
        ('B', 'C', 6),
    ),
    columns=('k1', 'k2', 'v'),
)

df2 = pd.DataFrame(
    (0, 2, 5),
    columns=('w',),
)

# For each unique combination of (k1, k2, w), count the number of rows in df1 that have the same values of k1 and k2,
# and a value of v that is greater than or equal to w.
# such
v_bins = np.concatenate((df2['w'], [np.inf]))
df3s = []
for (k1, k2), v in df1.groupby(['k1', 'k2'])['v']:
    df = df2.copy()
    df['count_ge'] = np.histogram(a=v, bins=v_bins)[0][::-1].cumsum()[::-1]
    df['k1'] = k1
    df['k2'] = k2
    df3s.append(df[['k1', 'k2', 'w', 'count_ge']])
pd.concat(df3s)

Python Pandas：有效計算值大於或等於一組值的行數，按鍵列分組

問題描述

2 個解決方案

解決方案1
0 2021-06-04 23:06:04

解決方案2
0 2021-06-05 02:55:18

Python Pandas：有效計算值大於或等於一組值的行數，按鍵列分組

問題描述

2 個解決方案

解決方案1 0 2021-06-04 23:06:04

解決方案2 0 2021-06-05 02:55:18

解決方案1
0 2021-06-04 23:06:04

解決方案2
0 2021-06-05 02:55:18