Pandas 时间和日期时间混合列的问题

Question

I have a column that came from Excel, that is supposed to contain durations (in hours) - example: 02:00:00 -我有一个来自 Excel 的列，它应该包含持续时间（以小时为单位） - 例如：02:00: 02:00:00 -
It works well if all this durations are less than 24:00 but if one is more than that, it appears in pandas as 1900-01-03 08:00:00 (so datetime) as a result the datatype is dtype('O').如果所有这些持续时间都小于 24:00，则效果很好，但如果超过这个时间，它会在 pandas 中显示为1900-01-03 08:00:00 （因此是日期时间），因此数据类型为 dtype('O '）。

df = pd.DataFrame({'duration':[datetime.time(2, 0), datetime.time(2, 0),
       datetime.datetime(1900, 1, 3, 8, 0),
       datetime.datetime(1900, 1, 3, 8, 0),
       datetime.datetime(1900, 1, 3, 8, 0),
       datetime.datetime(1900, 1, 3, 8, 0),
       datetime.datetime(1900, 1, 3, 8, 0),
       datetime.datetime(1900, 1, 3, 8, 0), datetime.time(1, 0),
       datetime.time(1, 0)]})

# Output
    duration
0   02:00:00
1   02:00:00
2   1900-01-03 08:00:00
3   1900-01-03 08:00:00
4   1900-01-03 08:00:00
5   1900-01-03 08:00:00
6   1900-01-03 08:00:00
7   1900-01-03 08:00:00
8   01:00:00
9   01:00:00

But if I try to convert to either time or datetime I always get an error.但是，如果我尝试转换为时间或日期时间，我总是会出错。

TypeError: <class 'datetime.time'> is not convertible to datetime TypeError: <class 'datetime.time'> 不能转换为日期时间

Today if I don't fix this, all the duration greater than 24:00 are gone.今天如果我不解决这个问题，所有大于 24:00 的持续时间都消失了。

Answer 1

IIUC, use pd.to_timedelta : IIUC，使用pd.to_timedelta ：

Setup a MRE :设置MRE ：

df = pd.DataFrame({'duration': ['43:24:57', '22:12:52', '-', '78:41:33']})
print(df)

# Output
   duration
0  43:24:57
1  22:12:52
2         -
3  78:41:33

df['duration'] = pd.to_timedelta(df['duration'], errors='coerce')
print(df)

# Output
         duration
0 1 days 19:24:57
1 0 days 22:12:52
2             NaT
3 3 days 06:41:33

Answer 2

Your problem lies in the engine that reads the Excel file.您的问题在于读取 Excel 文件的引擎。 It converts cells that have a certain format (eg [h]:mm:ss or hh:mm:ss ) to datetime.datetime or datetime.time objects.它将具有特定格式（例如[h]:mm:ss或hh:mm:ss ）的单元格转换为datetime.datetime或datetime.time对象。 Those then get transferred into the pandas DataFrame, so it's not actually a pandas problem.然后这些被转移到 pandas DataFrame 中，所以这实际上不是 pandas 问题。

Before you start hacking the excel reader engine, it might be easier to tackle the issue in Excel.在开始破解 excel 阅读器引擎之前，解决 Excel 中的问题可能更容易。 Here's a small sample file;这是一个小示例文件；

You can download it here .你可以在这里下载。 duration is auto-formatted by Excel, duration_text is what you get if you set the column format to 'text' before you enter the values, duration_to_text is what you get if you change the format to text after Excel auto-formatted the values (first column). duration由 Excel 自动格式化，如果在输入值之前将列格式设置为“文本”，则会得到duration_text ，如果在 Excel 自动格式化值之后将格式更改为文本，则会得到duration_to_text （首先柱子）。

Now you have everything you need after import with pandas:现在您拥有使用 pandas 导入后所需的一切：

df = pd.read_excel('path_to_file')

df
              duration duration_text  duration_to_text
0             12:30:00      12:30:00          0.520833
1  1900-01-01 00:30:00      24:30:00          1.020833

# now you can parse to timedelta:
pd.to_timedelta(df['duration_text'])
0   0 days 12:30:00
1   1 days 00:30:00
Name: duration_text, dtype: timedelta64[ns]

# or
pd.to_timedelta(df['duration_to_text'], unit='d') 
0   0 days 12:29:59.999971200                     # note the precision issue ;-)
1   1 days 00:29:59.999971200
Name: duration_to_text, dtype: timedelta64[ns]

Another viable option could be to save the Excel file as a csv and import that to a pandas DataFrame.另一个可行的选择是将 Excel 文件保存为csv并将其导入 pandas ZBA834BA059A9A3788E459C。 The sample xlsx used above would then look like this for example.例如，上面使用的示例 xlsx 将如下所示。

If you have no other option than to re-process in pandas, an option could be to treat datetime.time objects and datetime.datetime objects specifically, eg如果除了在 pandas 中重新处理之外别无选择，则可以选择专门处理 datetime.time 对象和 datetime.datetime 对象，例如

import datetime

# where you have datetime (incorrect from excel)
m = (isinstance(i, datetime.datetime) for i in df['duration'])

# convert to timedelta where it's possible
df['timedelta'] = pd.to_timedelta(df['duration'].astype(str), errors='coerce')

# where you have datetime, some special treatment is needed...
df.loc[m, 'timedelta'] = df.loc[m, 'duration'].apply(lambda t: pd.Timestamp(str(t)) - pd.Timestamp('1899-12-31'))

df['timedelta'] 
0   0 days 12:30:00
1   1 days 00:30:00
Name: timedelta, dtype: timedelta64[ns]

Pandas 时间和日期时间混合列的问题

问题描述

2 个解决方案

解决方案1
1 2022-01-26 03:50:59

解决方案2
0 2022-01-26 08:27:06

Pandas 时间和日期时间混合列的问题

问题描述

2 个解决方案

解决方案1 1 2022-01-26 03:50:59

解决方案2 0 2022-01-26 08:27:06

解决方案1
1 2022-01-26 03:50:59

解决方案2
0 2022-01-26 08:27:06