在 Pandas 中，如何使用 Regex 将人类可读的时间格式分解为不同的单位，如天、小时、分钟和秒？

Question

在数据帧上，我有一列人类可读格式的持续时间，例如“29 天 4 小时 32 分 1 秒”。 我想通过将天数、小时数、分钟数、秒数列与从持续时间列派生的值来分解它们。 像 29 天，4 小时，32 分钟，1 秒。 我已经使用过这个，但它不能正常工作：

# Use regex to extract time values into their respective columns
new_df = df['duration'].str.extract(r'(?P<days>\d+(?= day))|(?P<hours>\d+(?= hour))|(?P<minutes>\d+(?= min))|(?P<seconds>\d+(?= sec))')

例如，

import pandas as pd
import re

list = {'id': ['123','124','125','126','127'],
        'date': ['1/1/2018', '1/2/2018', '1/3/2018', '1/4/2018','1/5/2018'],
        'duration': ['29 days 4 hours 32 minutes',
                     '1 hour 23 minutes',
                     '3 hours 2 minutes 1 second',
                     '4 hours 46 minutes 22 seconds',
                     '2 hours 1 minute']}

df = pd.DataFrame(list)

# Use regex to extract time values into their respective columns
new_df = df['duration'].str.extract(r'(?P<days>\d+(?= day))|(?P<hours>\d+(?= hour))|(?P<minutes>\d+(?= min))|(?P<seconds>\d+(?= sec))')

结果在以下数据框中：

新的数据框只有第一个值，其余的没有。 它捕获了 29 天，以及 1、3、4、2 分钟，但随后的列值为 NaN。

理想情况下，数据框应如下所示：

我感觉我的正则表达式有问题。 我应该不使用“|”吗？ 分开组？ 对正确方向的任何帮助或推动表示赞赏。

Answer 1

您的字符串格式与pd.Timedelta字符串规范匹配。 直接转换成Timedelta ，调用它的属性components

df_final = (df.duration.map(pd.Timedelta)
              .dt.components[['days','hours','minutes','seconds']])

或者

df_final = (pd.to_timedelta(df.duration)
              .dt.components[['days','hours','minutes','seconds']])

Out[258]:
   days  hours  minutes  seconds
0    29      4       32        0
1     0      1       23        0
2     0      3        2        1
3     0      4       46       22
4     0      2        1        0

Answer 2

这是我使用extractall而不是extract ：

# same pattern as yours
# can replace this with a for loop
pattern = ( '(?P<days>\d+)(?= days?\s*)|'        # days
          + '(?P<hours>\d+)(?= hours?\s*)|'      # hours
          + '(?P<minutes>\d+)(?= minutes?\s*)|'  # minutes
          + '(?P<seconds>\d+)(?= seconds?\s*)'   # seconds
          )

(df.duration.str.extractall(pattern)   # extract all with regex
  .reset_index('match',drop=True)      # merge the matches of the same row
  .stack()
  .unstack(level=-1, fill_value=0)     # remove fill_value if you want NaN instead of 0
)

输出：

  days hours minutes seconds
0   29     4      32       0
1    0    12      23       0
2    0     3       2       1
3    0     4      46      22
4    0     2       1       0

在 Pandas 中，如何使用 Regex 将人类可读的时间格式分解为不同的单位，如天、小时、分钟和秒？

问题描述

2 个解决方案

解决方案1
3 2020-03-08 20:42:01

解决方案2
1 已采纳 2020-03-08 20:34:48

在 Pandas 中，如何使用 Regex 将人类可读的时间格式分解为不同的单位，如天、小时、分钟和秒？

问题描述

2 个解决方案

解决方案1 3 2020-03-08 20:42:01

解决方案2 1 已采纳 2020-03-08 20:34:48

解决方案1
3 2020-03-08 20:42:01

解决方案2
1 已采纳 2020-03-08 20:34:48