[英]Pandas groupby - apply different functions to half the records in each group
I have something like the following dataframe, where I have non-unique combinations of street address ranges and street names. 我有类似下面的数据框,我有街道地址范围和街道名称的非唯一组合。
import pandas as pd
df=pd.DataFrame()
df['BlockRange']=['100-150','100-150','100-150','100-150','200-300','200-300','300-400','300-400','300-400']
df['Street']=['Main','Main','Main','Main','Spruce','Spruce','2nd','2nd','2nd']
df
BlockRange Street
0 100-150 Main
1 100-150 Main
2 100-150 Main
3 100-150 Main
4 200-300 Spruce
5 200-300 Spruce
6 300-400 2nd
7 300-400 2nd
8 300-400 2nd
Within each of the 3 'groups' - (100-150, Main), (200-300, Spruce), and (300-400, 2nd) - I want half of the records in each group to get a block number equal to the midpoint of the block range and half of the records to get a block number equal to the midpoint of the block range plus 1 (as to put it on the other side of the street). 在每个3'组' - (100-150,Main),(200-300,Spruce)和(300-400,2nd)中 - 我希望每组中的一半记录得到一个等于块范围的中点和一半的记录使得块编号等于块范围的中点加1(将其放在街道的另一侧)。
I know this should be able to be done using groupby transform, but I can't figure out how to do so (I'm having trouble applying a function to the groupby key, 'BlockRange'). 我知道这应该可以使用groupby转换来完成,但我无法弄清楚如何这样做(我在将函数应用于groupby键时遇到了麻烦,'BlockRange')。
I'm able to get the result I'm looking for only by looping through each unique group, which will take a while when run on my full dataset. 我只能通过循环遍历每个唯一的组来获得我正在寻找的结果,这将在我的完整数据集上运行时需要一段时间。 See below for my current solution and the end result I'm looking for:
请参阅下面的我当前的解决方案和我正在寻找的最终结果:
groups=df.groupby(['BlockRange','Street'])
#Write function that calculates the mid point of the block range
def get_mid(x):
block_nums=[int(y) for y in x.split('-')]
return sum(block_nums)/len(block_nums)
final=pd.DataFrame()
for groupkey,group in groups:
block_mid=get_mid(groupkey[0])
halfway_point=len(group)/2
group['Block']=0
group.iloc[:halfway_point]['Block']=block_mid
group.iloc[halfway_point:]['Block']=block_mid+1
final=final.append(group)
final
BlockRange Street Block
0 100-150 Main 125
1 100-150 Main 125
2 100-150 Main 126
3 100-150 Main 126
4 200-300 Spruce 250
5 200-300 Spruce 251
6 300-400 2nd 350
7 300-400 2nd 351
8 300-400 2nd 351
Any suggestions as to how I can do this more efficiently? 关于如何更有效地做到这一点的任何建议? Perhaps using groupby transform?
也许使用groupby转换?
You can use apply
with custom function f
: 您可以使用
apply
自定义功能f
:
def f(x):
df = pd.DataFrame([y.split('-') for y in x['BlockRange'].tolist()])
df = df.astype(int)
block_nums = df.sum(axis=1) / 2
x['Block'] = block_nums[0]
halfway_point=len(x)/2
x.iloc[halfway_point:, 2] = block_nums[0] + 1
return x
print df.groupby(['BlockRange','Street']).apply(f)
BlockRange Street Block
0 100-150 Main 125
1 100-150 Main 125
2 100-150 Main 126
3 100-150 Main 126
4 200-300 Spruce 250
5 200-300 Spruce 251
6 300-400 2nd 350
7 300-400 2nd 351
8 300-400 2nd 351
Timings: 时序:
In [32]: %timeit orig(df)
__main__:26: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
__main__:27: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
__main__:28: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
1 loops, best of 3: 290 ms per loop
In [33]: %timeit new(df)
100 loops, best of 3: 10.2 ms per loop
Testing: 测试:
print df
df1 = df.copy()
def orig(df):
groups=df.groupby(['BlockRange','Street'])
#Write function that calculates the mid point of the block range
def get_mid(x):
block_nums=[int(y) for y in x.split('-')]
return sum(block_nums)/len(block_nums)
final=pd.DataFrame()
for groupkey,group in groups:
block_mid=get_mid(groupkey[0])
halfway_point=len(group)/2
group['Block']=0
group.iloc[:halfway_point]['Block']=block_mid
group.iloc[halfway_point:]['Block']=block_mid+1
final=final.append(group)
return final
def new(df):
def f(x):
df = pd.DataFrame([y.split('-') for y in x['BlockRange'].tolist() ])
df = df.astype(int)
block_nums = df.sum(axis=1) / 2
x['Block'] = block_nums[0]
halfway_point=len(x)/2
x.iloc[halfway_point:, 2] = block_nums[0] + 1
return x
return df.groupby(['BlockRange','Street']).apply(f)
print orig(df)
print new(df1)
For comparison, note that you can do this without apply
: 为了进行比较,请注意您可以不
apply
而执行此操作:
ss = df["BlockRange"].str.split("-")
midnum = (ss.str[1].astype(float) + ss.str[0].astype(float))//2
grouped = df.groupby(["BlockRange", "Street"])
df["Block"] = midnum + (grouped.cumcount()>= grouped["Street"].transform(len) // 2)
which gives me 这给了我
>>> df
BlockRange Street Block
0 100-150 Main 125
1 100-150 Main 125
2 100-150 Main 126
3 100-150 Main 126
4 200-300 Spruce 250
5 200-300 Spruce 251
6 300-400 2nd 350
7 300-400 2nd 351
8 300-400 2nd 351
This works because cumcount
and transform(len)
give us the pieces we need: 这是有效的,因为
cumcount
和transform(len)
为我们提供了我们需要的部分:
>>> grouped.cumcount()
0 0
1 1
2 2
3 3
4 0
5 1
6 0
7 1
8 2
dtype: int64
>>> grouped.transform(len)
Block
0 4
1 4
2 4
3 4
4 2
5 2
6 3
7 3
8 3
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.