I am trying to find best way to handle a dataset around 80 million rows. I need to make some calculations over this data. I am trying for
loops but takes like forever.
I have data as below (individual taxi trips from one area to another, resolution of 15 minutes):
timestamp, origin_area, destination_area
2014-01-27 11:00:00, 28.0, 32.0
2014-01-27 11:00:00, 28.0, 32.0
2013-01-01 01:00:00, 28.0, 1.0
2013-01-01 01:15:00, 28.0, 2.0
I need to convert this data into some columns like this:
timestamp, origin_area, destination_area, (sum of trips for distinct origin-destination couples in that timestamp), (sum of all trips from origin area in that timestamp)
What are my options to fastly handle these calculations and creating additional columns as above?
Thank you
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.