繁体   English   中英

一种快速、有效的方法来计算 pandas 中的行组之间的时间差?

[英]A fast, efficient way to calculate time differences between groups of rows in pandas?

假设我在 DataFrame 中有这张桌子,日期是几辆车已重新装满:

+-------+-------------+
| carId | refill_date |
+-------+-------------+
|     1 |  2020-03-01 |
+-------+-------------+
|     1 |  2020-03-12 |
+-------+-------------+
|     1 |  2020-04-04 |
+-------+-------------+
|     2 |  2020-03-07 |
+-------+-------------+
|     2 |  2020-03-26 |
+-------+-------------+
|     2 |  2020-04-01 |
+-------+-------------+

我想添加第三列time_elapsed ,其中包含每次重新填充之间的持续时间。

+-------+-------------+--------------+
| carId | refill_date | time_elapsed |
+-------+-------------+--------------+
|     1 |  2020-03-01 |              |
+-------+-------------+--------------+
|     1 |  2020-03-12 |           11 |
+-------+-------------+--------------+
|     1 |  2020-04-04 |           23 |
+-------+-------------+--------------+
|     2 |  2020-03-07 |              |
+-------+-------------+--------------+
|     2 |  2020-03-26 |           19 |
+-------+-------------+--------------+
|     2 |  2020-04-01 |            6 |
+-------+-------------+--------------+

所以这就是我所做的:

import pandas as pd
df = pd.DataFrame

data = [
    {
        "carId": 1,
        "refill_date": "2020-3-1"
    },
    {
        "carId": 1,
        "refill_date": "2020-3-12"
    },
    {
        "carId": 1,
        "refill_date": "2020-4-4"
    },
    {
        "carId": 2,
        "refill_date": "2020-3-7"
    },
    {
        "carId": 2,
        "refill_date": "2020-3-26"
    },
    {
        "carId": 2,
        "refill_date": "2020-4-1"
    }
]

df = pd.DataFrame(data)

df['refill_date'] = pd.to_datetime(df['refill_date'])

for c in df['carId'].unique():
    df.loc[df['carId'] == c, 'time_elapsed'] = df.loc[df['carId'] == c,
                                                      'refill_date'].diff()

返回预期结果:

+---+-------+-------------+--------------+
|   | carId | refill_date | time_elapsed |
+---+-------+-------------+--------------+
| 0 |     1 |  2020-03-01 |          NaT |
+---+-------+-------------+--------------+
| 1 |     1 |  2020-03-12 |      11 days |
+---+-------+-------------+--------------+
| 2 |     1 |  2020-04-04 |      23 days |
+---+-------+-------------+--------------+
| 3 |     2 |  2020-03-07 |          NaT |
+---+-------+-------------+--------------+
| 4 |     2 |  2020-03-26 |      19 days |
+---+-------+-------------+--------------+
| 5 |     2 |  2020-04-01 |       6 days |
+---+-------+-------------+--------------+

所以,一切看起来都不错,但有一个问题:在我的现实生活中,我的 dataframe 包含 350 万行,处理需要很长时间,即使它是一个完全数字的内存计算,“只有”1711 个组循环遍历.

有没有其他更快的方法?

谢谢!

在 df.groupby 上使用本机df.groupby方法应该比“本机 python”循环显着提高性能:

df['time_elapsed'] = df.groupby('carId')['refill_date'].diff()

这是一个小型基准测试(在我的笔记本电脑上,YMMV ...),使用 100 辆汽车,每辆汽车 31 天,性能提升近10 倍

import pandas as pd
import timeit

data = [{"carId": carId, "refill_date": "2020-3-"+str(day)} for carId in range(1,100) for day in range(1,32)]
df = pd.DataFrame(data)
df['refill_date'] = pd.to_datetime(df['refill_date'])

def original_method():
    for c in df['carId'].unique():
        df.loc[df['carId'] == c, 'time_elapsed'] = df.loc[df['carId'] == c,
                                                          'refill_date'].diff()

def using_groupby():
    df['time_elapsed'] = df.groupby('carId')['refill_date'].diff()

time1 = timeit.timeit('original_method()', globals=globals(), number=100)
time2 = timeit.timeit('using_groupby()', globals=globals(), number=100)

print(time1)
print(time2)
print(time1/time2)

Output:

16.6183732
1.7910263000000022
9.278687420726307

你只需要使用.groupby

df['time_elapsed'] = df.groupby('carId').diff()

output:

  refill_date
0         NaT
1     11 days
2     23 days
3         NaT
4     19 days
5      6 days

通过使用shift并从 refill_date 中减去来获取 time_elapsed

(
    df.assign(
        refill_date=pd.to_datetime(df.refill_date),
        time_shift=lambda x: x.groupby("carId").refill_date.shift(),
        time_elapsed=lambda x: x.time_shift.sub(x.refill_date).abs(),
    )
)

使用diff的其他答案更好,因为它更简洁,我想相信更快。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM