[英]Pandas problem with a column with mixed time and date time
I have a column that came from Excel, that is supposed to contain durations (in hours) - example: 02:00:00
-我有一个来自 Excel 的列,它应该包含持续时间(以小时为单位) - 例如:02:00:
02:00:00
-
It works well if all this durations are less than 24:00 but if one is more than that, it appears in pandas as 1900-01-03 08:00:00
(so datetime) as a result the datatype is dtype('O').如果所有这些持续时间都小于 24:00,则效果很好,但如果超过这个时间,它会在 pandas 中显示为
1900-01-03 08:00:00
(因此是日期时间),因此数据类型为 dtype('O ')。
df = pd.DataFrame({'duration':[datetime.time(2, 0), datetime.time(2, 0),
datetime.datetime(1900, 1, 3, 8, 0),
datetime.datetime(1900, 1, 3, 8, 0),
datetime.datetime(1900, 1, 3, 8, 0),
datetime.datetime(1900, 1, 3, 8, 0),
datetime.datetime(1900, 1, 3, 8, 0),
datetime.datetime(1900, 1, 3, 8, 0), datetime.time(1, 0),
datetime.time(1, 0)]})
# Output
duration
0 02:00:00
1 02:00:00
2 1900-01-03 08:00:00
3 1900-01-03 08:00:00
4 1900-01-03 08:00:00
5 1900-01-03 08:00:00
6 1900-01-03 08:00:00
7 1900-01-03 08:00:00
8 01:00:00
9 01:00:00
But if I try to convert to either time or datetime I always get an error.但是,如果我尝试转换为时间或日期时间,我总是会出错。
TypeError: <class 'datetime.time'> is not convertible to datetime
TypeError: <class 'datetime.time'> 不能转换为日期时间
Today if I don't fix this, all the duration greater than 24:00 are gone.今天如果我不解决这个问题,所有大于 24:00 的持续时间都消失了。
IIUC, use pd.to_timedelta
: IIUC,使用
pd.to_timedelta
:
df = pd.DataFrame({'duration': ['43:24:57', '22:12:52', '-', '78:41:33']})
print(df)
# Output
duration
0 43:24:57
1 22:12:52
2 -
3 78:41:33
df['duration'] = pd.to_timedelta(df['duration'], errors='coerce')
print(df)
# Output
duration
0 1 days 19:24:57
1 0 days 22:12:52
2 NaT
3 3 days 06:41:33
Your problem lies in the engine that reads the Excel file.您的问题在于读取 Excel 文件的引擎。 It converts cells that have a certain format (eg
[h]:mm:ss
or hh:mm:ss
) to datetime.datetime
or datetime.time
objects.它将具有特定格式(例如
[h]:mm:ss
或hh:mm:ss
)的单元格转换为datetime.datetime
或datetime.time
对象。 Those then get transferred into the pandas DataFrame, so it's not actually a pandas problem.然后这些被转移到 pandas DataFrame 中,所以这实际上不是 pandas 问题。
Before you start hacking the excel reader engine, it might be easier to tackle the issue in Excel.在开始破解 excel 阅读器引擎之前,解决 Excel 中的问题可能更容易。 Here's a small sample file;
这是一个小示例文件;
You can download it here .你可以在这里下载。
duration
is auto-formatted by Excel, duration_text
is what you get if you set the column format to 'text' before you enter the values, duration_to_text
is what you get if you change the format to text after Excel auto-formatted the values (first column). duration
由 Excel 自动格式化,如果在输入值之前将列格式设置为“文本”,则会得到duration_text
,如果在 Excel 自动格式化值之后将格式更改为文本,则会得到duration_to_text
(首先柱子)。
Now you have everything you need after import with pandas:现在您拥有使用 pandas 导入后所需的一切:
df = pd.read_excel('path_to_file')
df
duration duration_text duration_to_text
0 12:30:00 12:30:00 0.520833
1 1900-01-01 00:30:00 24:30:00 1.020833
# now you can parse to timedelta:
pd.to_timedelta(df['duration_text'])
0 0 days 12:30:00
1 1 days 00:30:00
Name: duration_text, dtype: timedelta64[ns]
# or
pd.to_timedelta(df['duration_to_text'], unit='d')
0 0 days 12:29:59.999971200 # note the precision issue ;-)
1 1 days 00:29:59.999971200
Name: duration_to_text, dtype: timedelta64[ns]
Another viable option could be to save the Excel file as a csv
and import that to a pandas DataFrame.另一个可行的选择是将 Excel 文件保存为
csv
并将其导入 pandas ZBA834BA059A9A3788E459C。 The sample xlsx used above would then look like this for example.例如,上面使用的示例 xlsx 将如下所示。
If you have no other option than to re-process in pandas, an option could be to treat datetime.time objects and datetime.datetime objects specifically, eg如果除了在 pandas 中重新处理之外别无选择,则可以选择专门处理 datetime.time 对象和 datetime.datetime 对象,例如
import datetime
# where you have datetime (incorrect from excel)
m = (isinstance(i, datetime.datetime) for i in df['duration'])
# convert to timedelta where it's possible
df['timedelta'] = pd.to_timedelta(df['duration'].astype(str), errors='coerce')
# where you have datetime, some special treatment is needed...
df.loc[m, 'timedelta'] = df.loc[m, 'duration'].apply(lambda t: pd.Timestamp(str(t)) - pd.Timestamp('1899-12-31'))
df['timedelta']
0 0 days 12:30:00
1 1 days 00:30:00
Name: timedelta, dtype: timedelta64[ns]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.