[英]How can I replace a timestamp in a dataframe with a string if the column is not all timestamps?
I am trying to build a machine learning model using an excel spreadsheet that cannot be edited.我正在尝试使用无法编辑的 excel 电子表格构建机器学习 model。 The a few of the columns in the.xls have formatting issues so some of the data is displayed as a datetime stamp instead of an str or int.
.xls 中的一些列存在格式问题,因此某些数据显示为日期时间戳,而不是 str 或 int。 Here is an example from the pd dataframe:
这是来自 pd dataframe 的示例:
0 40-49 premeno 15-19 0-2 yes 3
1 50-59 ge40 15-19 0-2 no 1
2 50-59 ge40 35-39 0-2 no 2
3 40-49 premeno 35-39 0-2 yes 3
4 40-49 premeno 30-34 **2019-05-03 00:00:00** yes 2
In line 4, the value of 3-5 has been accidentally formatted as a date (shown as 03 May in the xls) and so is assigned as a datetime stamp in the dataframe.在第 4 行中,3-5 的值被意外格式化为日期(在 xls 中显示为 03 May),因此在 dataframe 中被指定为日期时间戳。 I have tried many methods to replace
2019-05-03 00:00:00
with 3-5
including:我尝试了很多方法来替换
2019-05-03 00:00:00
为3-5
,包括:
df['column'] = df['column'].replace([('2019-05-03 00:00:00')], '3-5')
and using Timestamp.replace but neither seem to work.并使用 Timestamp.replace 但似乎都不起作用。 Any ideas of how to replace this mis formatted data points with the correct data?
关于如何用正确的数据替换这些格式错误的数据点的任何想法?
There might be a simpler way but you may need to apply re.search
with positive lookarounds.可能有更简单的方法,但您可能需要应用
re.search
和积极的环视。
import re
pat1 = '(?<=\*{2}\d{4}-0\d-0)(\d)(?= 00:00:00\*\*)'
pat2 = '(?<=\*{2}\d{4}-0)(\d)(?=-0\d 00:00:00\*\*)'
df['column'] = df['column'].astype(str).apply(
lambda x: (re.search(pat2, '**2019-05-03 00:00:00**').group()
+'-'+re.search(pat1, '**2019-05-03 00:00:00**').group())
if '**' in x else x
)
You can iterate the column with an apply
and check if the element is an instance of pd.Timestamp
;您可以使用
apply
迭代列并检查元素是否是pd.Timestamp
的实例; if so, extract a string "day-month", otherwise leave as it is.如果是,则提取一个字符串“day-month”,否则保持原样。
Ex:前任:
import pandas as pd
# what you have is something like (mixed datatype column/Series)
df = pd.DataFrame({'label': ['0-2', '1-3', pd.Timestamp('2019-05-03')]})
# iterate the column with an apply, extract day-month string if pd.Timestamp
df['label1'] = df['label'].apply(lambda x: f"{x.day}-{x.month}" if isinstance(x, pd.Timestamp) else x)
# ... to get
df['label1']
0 0-2
1 1-3
2 3-5
Name: label1, dtype: object
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.