Best way to calculate over a large data frame?

Question

I am trying to find best way to handle a dataset around 80 million rows. I need to make some calculations over this data. I am trying for loops but takes like forever.

I have data as below (individual taxi trips from one area to another, resolution of 15 minutes):

timestamp,        origin_area, destination_area

2014-01-27 11:00:00, 28.0,        32.0

2014-01-27 11:00:00, 28.0,        32.0

2013-01-01 01:00:00, 28.0,        1.0

2013-01-01 01:15:00, 28.0,        2.0

I need to convert this data into some columns like this:

timestamp, origin_area, destination_area, (sum of trips for distinct origin-destination couples in that timestamp), (sum of all trips from origin area in that timestamp)

What are my options to fastly handle these calculations and creating additional columns as above?

Thank you

Answer 1

I got groupby() and size() to do this.

df.groupby(['timestamp', 'origin_area','destination_area']).size().reset_index(name='Count').sort_values(by="timestamp", 
                                                                            ascending=False).reset_index(drop=True)

;)

Best way to calculate over a large data frame?

Question

1 answers

solution1
0 2018-12-25 13:38:34

Best way to calculate over a large data frame?

Question

1 answers

solution1 0 2018-12-25 13:38:34

solution1
0 2018-12-25 13:38:34