简体   繁体   中英

Python Pandas: Efficiently compute count of rows with value greater than or equal to a set of values, grouped by key columns

Suppose I have two Pandas DataFrames:

  • df1 with columns k1 ( str ), k2 ( str ), and v ( float ), and
  • df2 with a column w ( float ).

I can assume that the rows df1 are sorted, first by k1 , then by k2 , and finally by v . I can assume that the values of w in df2 are unique and sorted.

My goal is to create a new DataFrame df3 with columns k1 , k2 , w , and count_ge . The DataFrame df3 should have one row for each unique combination of k1 , k2 , and w ; the column count_ge should be the number of rows in df1 that have the same values of k1 and k2 , and a value of v that is greater than or equal to the value of w .

The following code is a naive implementation that seems to do what I want. Is there an efficient way to carry out the same operation? Ideally, the code should also generalize to more than two keys in df1 .

import pandas as pd

# Generate some example data.
df1 = pd.DataFrame(
    (
        ('A', 'A', 1),
        ('A', 'A', 1),
        ('A', 'A', 3),
        ('A', 'A', 4),
        ('B', 'C', 2),
        ('B', 'C', 6),
    ),
    columns=('k1', 'k2', 'v'),
)

df2 = pd.DataFrame(
    (0, 2, 5),
    columns=('w',),
)

# Get all unique combinations of k1, k2, and w.
# In Pandas 1.2.0, we can use `merge(how='cross')` for this instead.
df3 = (
    df1[['k1', 'k2']]
    .drop_duplicates()
    .assign(_key=1)
    .merge(df2.assign(_key=1), on='_key')
    .drop(columns='_key')
)

# For each row in df3, count the number of rows in df1 that have the same values of k1 and k2,
# and a value of v that is greater than or equal to w.
df3['count_ge'] = 0
for i, (k1, k2, w, _) in df3.iterrows():
    df3.loc[i, 'count_ge'] = len(df1.query(f'k1 == {k1!r} and k2 == {k2!r} and v >= {w!r}'))
df3

Initialize df3 using a cross merge :

df3 = df1[["k1", "k2"]].drop_duplicates().merge(df2, how='cross')
>>> df3
  k1 k2  w
0  A  A  0
1  A  A  2
2  A  A  5
3  B  C  0
4  B  C  2
5  B  C  5

Then for the count_ge column, you could use a lambda function like so:

df3['count_ge'] = df3.apply(lambda x: df1[(df1["k1"]==x["k1"])&(df1["k2"]==x["k2"])&(df1["v"]>=x["w"])].shape[0], axis=1)
>>> df3
  k1 k2  w  count_ge
0  A  A  0         4
1  A  A  2         2
2  A  A  5         0
3  B  C  0         2
4  B  C  2         2
5  B  C  5         1

Another possible approach is to use np.histogram . This method seems fairly clean, but has the potential drawback of copying the DataFrames in pd.concat . Other suggestions are still welcome.

import numpy as np
import pandas as pd

# Generate some example data.
df1 = pd.DataFrame(
    (
        ('A', 'A', 1),
        ('A', 'A', 1),
        ('A', 'A', 3),
        ('A', 'A', 4),
        ('B', 'C', 2),
        ('B', 'C', 6),
    ),
    columns=('k1', 'k2', 'v'),
)

df2 = pd.DataFrame(
    (0, 2, 5),
    columns=('w',),
)

# For each unique combination of (k1, k2, w), count the number of rows in df1 that have the same values of k1 and k2,
# and a value of v that is greater than or equal to w.
# such
v_bins = np.concatenate((df2['w'], [np.inf]))
df3s = []
for (k1, k2), v in df1.groupby(['k1', 'k2'])['v']:
    df = df2.copy()
    df['count_ge'] = np.histogram(a=v, bins=v_bins)[0][::-1].cumsum()[::-1]
    df['k1'] = k1
    df['k2'] = k2
    df3s.append(df[['k1', 'k2', 'w', 'count_ge']])
pd.concat(df3s)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM