计算大型数据框的最佳方法？

Question

I am trying to find best way to handle a dataset around 80 million rows.我试图找到处理大约 8000 万行数据集的最佳方法。 I need to make some calculations over this data.我需要对这些数据进行一些计算。 I am trying for loops but takes like forever.我试图for循环，但需要像永远。

I have data as below (individual taxi trips from one area to another, resolution of 15 minutes):我有以下数据（从一个区域到另一个区域的单独出租车行程，分辨率为 15 分钟）：

timestamp,        origin_area, destination_area

2014-01-27 11:00:00, 28.0,        32.0

2014-01-27 11:00:00, 28.0,        32.0

2013-01-01 01:00:00, 28.0,        1.0

2013-01-01 01:15:00, 28.0,        2.0

I need to convert this data into some columns like this:我需要将这些数据转换成这样的一些列：

timestamp, origin_area, destination_area, (sum of trips for distinct origin-destination couples in that timestamp), (sum of all trips from origin area in that timestamp) timestamp, origin_area, destination_area, （该时间戳中不同起点 - 目的地夫妇的行程总和），（该时间戳中来自起点区域的所有行程的总和）

What are my options to fastly handle these calculations and creating additional columns as above?我有哪些选项可以快速处理这些计算并创建上述附加列？

Thank you谢谢

Answer 1

I got groupby() and size() to do this.我得到了groupby()和size()来做到这一点。

df.groupby(['timestamp', 'origin_area','destination_area']).size().reset_index(name='Count').sort_values(by="timestamp", 
                                                                            ascending=False).reset_index(drop=True)

;) ;)

计算大型数据框的最佳方法？

问题描述

1 个解决方案

解决方案1
0 2018-12-25 13:38:34

计算大型数据框的最佳方法？

问题描述

1 个解决方案

解决方案1 0 2018-12-25 13:38:34

解决方案1
0 2018-12-25 13:38:34