简体   繁体   中英

Best way to calculate over a large data frame?

I am trying to find best way to handle a dataset around 80 million rows. I need to make some calculations over this data. I am trying for loops but takes like forever.

I have data as below (individual taxi trips from one area to another, resolution of 15 minutes):

timestamp,        origin_area, destination_area

2014-01-27 11:00:00, 28.0,        32.0

2014-01-27 11:00:00, 28.0,        32.0

2013-01-01 01:00:00, 28.0,        1.0

2013-01-01 01:15:00, 28.0,        2.0

I need to convert this data into some columns like this:

timestamp, origin_area, destination_area, (sum of trips for distinct origin-destination couples in that timestamp), (sum of all trips from origin area in that timestamp)

What are my options to fastly handle these calculations and creating additional columns as above?

Thank you

I got groupby() and size() to do this.

df.groupby(['timestamp', 'origin_area','destination_area']).size().reset_index(name='Count').sort_values(by="timestamp", 
                                                                            ascending=False).reset_index(drop=True)

在此处输入图片说明

;)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM