Pandas groupby - 将不同的功能应用于每组中一半的记录

Question

I have something like the following dataframe, where I have non-unique combinations of street address ranges and street names. 我有类似下面的数据框，我有街道地址范围和街道名称的非唯一组合。

import pandas as pd
df=pd.DataFrame()
df['BlockRange']=['100-150','100-150','100-150','100-150','200-300','200-300','300-400','300-400','300-400']
df['Street']=['Main','Main','Main','Main','Spruce','Spruce','2nd','2nd','2nd']
df
  BlockRange  Street
0    100-150    Main
1    100-150    Main
2    100-150    Main
3    100-150    Main
4    200-300  Spruce
5    200-300  Spruce
6    300-400     2nd
7    300-400     2nd
8    300-400     2nd

Within each of the 3 'groups' - (100-150, Main), (200-300, Spruce), and (300-400, 2nd) - I want half of the records in each group to get a block number equal to the midpoint of the block range and half of the records to get a block number equal to the midpoint of the block range plus 1 (as to put it on the other side of the street). 在每个3'组' - （100-150，Main），（200-300，Spruce）和（300-400，2nd）中 - 我希望每组中的一半记录得到一个等于块范围的中点和一半的记录使得块编号等于块范围的中点加1（将其放在街道的另一侧）。

I know this should be able to be done using groupby transform, but I can't figure out how to do so (I'm having trouble applying a function to the groupby key, 'BlockRange'). 我知道这应该可以使用groupby转换来完成，但我无法弄清楚如何这样做（我在将函数应用于groupby键时遇到了麻烦，'BlockRange'）。

I'm able to get the result I'm looking for only by looping through each unique group, which will take a while when run on my full dataset. 我只能通过循环遍历每个唯一的组来获得我正在寻找的结果，这将在我的完整数据集上运行时需要一段时间。 See below for my current solution and the end result I'm looking for: 请参阅下面的我当前的解决方案和我正在寻找的最终结果：

groups=df.groupby(['BlockRange','Street'])

#Write function that calculates the mid point of the block range
def get_mid(x):
    block_nums=[int(y) for y in x.split('-')]
    return sum(block_nums)/len(block_nums)

final=pd.DataFrame()
for groupkey,group in groups:
    block_mid=get_mid(groupkey[0])
    halfway_point=len(group)/2
    group['Block']=0
    group.iloc[:halfway_point]['Block']=block_mid
    group.iloc[halfway_point:]['Block']=block_mid+1
    final=final.append(group)

final
  BlockRange  Street  Block
0    100-150    Main    125
1    100-150    Main    125
2    100-150    Main    126
3    100-150    Main    126
4    200-300  Spruce    250
5    200-300  Spruce    251
6    300-400     2nd    350
7    300-400     2nd    351
8    300-400     2nd    351

Any suggestions as to how I can do this more efficiently? 关于如何更有效地做到这一点的任何建议？ Perhaps using groupby transform? 也许使用groupby转换？

Answer 1

You can use apply with custom function f : 您可以使用apply自定义功能f ：

def f(x):
    df = pd.DataFrame([y.split('-') for y in x['BlockRange'].tolist()])
    df = df.astype(int)
    block_nums = df.sum(axis=1) / 2
    x['Block'] = block_nums[0]
    halfway_point=len(x)/2
    x.iloc[halfway_point:, 2] = block_nums[0] + 1
    return x

print df.groupby(['BlockRange','Street']).apply(f)

  BlockRange  Street  Block
0    100-150    Main    125
1    100-150    Main    125
2    100-150    Main    126
3    100-150    Main    126
4    200-300  Spruce    250
5    200-300  Spruce    251
6    300-400     2nd    350
7    300-400     2nd    351
8    300-400     2nd    351

Timings: 时序：

In [32]: %timeit orig(df)
__main__:26: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
__main__:27: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
__main__:28: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
1 loops, best of 3: 290 ms per loop

In [33]: %timeit new(df)
100 loops, best of 3: 10.2 ms per loop

Testing: 测试：

print df
df1 = df.copy()

def orig(df):
    groups=df.groupby(['BlockRange','Street'])

    #Write function that calculates the mid point of the block range
    def get_mid(x):
        block_nums=[int(y) for y in x.split('-')]
        return sum(block_nums)/len(block_nums)
    final=pd.DataFrame()

    for groupkey,group in groups:
        block_mid=get_mid(groupkey[0])
        halfway_point=len(group)/2
        group['Block']=0
        group.iloc[:halfway_point]['Block']=block_mid
        group.iloc[halfway_point:]['Block']=block_mid+1
        final=final.append(group)
    return final    

def new(df):
    def f(x):
        df = pd.DataFrame([y.split('-') for y in x['BlockRange'].tolist() ])
        df = df.astype(int)
        block_nums = df.sum(axis=1) / 2
        x['Block'] = block_nums[0]
        halfway_point=len(x)/2
        x.iloc[halfway_point:, 2] = block_nums[0] + 1
        return x

    return df.groupby(['BlockRange','Street']).apply(f)

print orig(df)
print new(df1)

Answer 2

For comparison, note that you can do this without apply : 为了进行比较，请注意您可以不apply而执行此操作：

ss = df["BlockRange"].str.split("-")
midnum = (ss.str[1].astype(float) + ss.str[0].astype(float))//2
grouped = df.groupby(["BlockRange", "Street"])
df["Block"] = midnum + (grouped.cumcount()>= grouped["Street"].transform(len) // 2)

which gives me 这给了我

>>> df
  BlockRange  Street  Block
0    100-150    Main    125
1    100-150    Main    125
2    100-150    Main    126
3    100-150    Main    126
4    200-300  Spruce    250
5    200-300  Spruce    251
6    300-400     2nd    350
7    300-400     2nd    351
8    300-400     2nd    351

This works because cumcount and transform(len) give us the pieces we need: 这是有效的，因为cumcount和transform(len)为我们提供了我们需要的部分：

>>> grouped.cumcount()
0    0
1    1
2    2
3    3
4    0
5    1
6    0
7    1
8    2
dtype: int64
>>> grouped.transform(len)
   Block
0      4
1      4
2      4
3      4
4      2
5      2
6      3
7      3
8      3

Pandas groupby - 将不同的功能应用于每组中一半的记录

问题描述

2 个解决方案

解决方案1
4 已采纳 2016-02-09 18:07:52

解决方案2
1 2016-02-09 19:00:14

Pandas groupby - 将不同的功能应用于每组中一半的记录

问题描述

2 个解决方案

解决方案1 4 已采纳 2016-02-09 18:07:52

解决方案2 1 2016-02-09 19:00:14

解决方案1
4 已采纳 2016-02-09 18:07:52

解决方案2
1 2016-02-09 19:00:14