简体   繁体   English

Python自定义聚合-需要更有效的解决方案

[英]Python custom aggregates - need a more efficient solution

I'm a new learner to Python and I'm playing with a dataset of interest to help with my learning, in particular trying to get a better understanding of pandas and numpy. 我是Python的新手,我正在研究一个感兴趣的数据集,以帮助我进行学习,尤其是试图更好地理解熊猫和numpy。

My dataframe has over a million rows and I'm trying to create a custom bucket so I can find more interesting insights. 我的数据框有超过一百万行,我正在尝试创建一个自定义存储桶,以便我可以找到更多有趣的见解。 My dataset looks like the following: 我的数据集如下所示:

My DataTable: 我的数据表:

Price    Postal_area    Purchase_Month
123000   SE22           2018_01
240000   GU22           2017_02
.
.
.

I want to group the data into price buckets of < 100000, 200k - 300k, 300k - 500k, 500k+ I then want to group by the price buckets, month and postal area. 我想将数据分组为<100000、200k-300k,300k-500k,500k +的价格段,然后我想按价格段,月份和邮政地区进行分组。 I'm getting stumped at creating the custom price bucket. 我为创建自定义价格时段而感到困惑。

What I've tried to do is create a custom function: 我试图做的是创建一个自定义函数:

def price_range(Price):
    if (Price <= 100000):
        return ("Low Value")
    elif (100000 < Price < 200000):
        return ("Medium Value")
    elif (200001 < Price < 500000):
        return ("Medium High")
    elif (Price > 500001):
        return ("High")
    else:
        return ("Undefined")


And then I am creating a new column in my dataset as follows: 然后,我在数据集中创建一个新列,如下所示:

for val in (my_table.Price):
    my_table["price_range"] = (price_range(val))

I should be able to create an agg from this but its an extrememly slow process - already running at over 30 mins on a million or so rows and still running! 我应该能够从中创建一个agg,但是它的过程极其缓慢-已经在一百万个左右的行上运行了30分钟以上,并且仍在运行!

I have tried to play with creating custom buckets of data using numpy and pandas (pivot table, groupby, lambdas) but not been able to figure out how to incorporate the custom bucket logic. 我曾尝试使用numpy和pandas(数据透视表,groupby,lambdas)创建自定义存储桶,但无法弄清楚如何合并自定义存储桶逻辑。

I looked at a few other answers like the one below but it didn't cover my particular custom needs: Efficient way to assign values from another column pandas df 我看了下面的其他答案,但没有满足我的特定自定义需求: 从另一列pandas df赋值的有效方法

Any help much appreciated! 任何帮助,不胜感激!

Use the apply function to apply your custom function price_range to my_table 使用apply函数将自定义函数price_range应用于my_table

my_table['price_range']=my_table['Price'].apply(price_range)

If you want bins with equal range: 如果要使垃圾箱具有相等的范围:

my_table['price_range']=pd.cut(my_table['Price'], bins = 4, labels = ['Low Value', 'Medium Value', 'Medium High', 'High'])

You can try of using pd.cut to cut the value in ranges and specify the labels on what to assign df 您可以尝试使用pd.cut削减范围内的值,并指定分配df的标签

    Price
0   12300
1   24000
2   232455
3   343434343


pd.cut(df.Price,[0,100000,200000,500000,np.inf],labels=['Low_value','Medium Value','High','Undefined'])

Out: 日期:

0    Medium Value
1            High
2            High
3       Undefined
Name: Price, dtype: category
Categories (4, object): [Low_value < Medium Value < High < Undefined]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM