如果列不是所有时间戳，如何用字符串替换 dataframe 中的时间戳？

Question

I am trying to build a machine learning model using an excel spreadsheet that cannot be edited.我正在尝试使用无法编辑的 excel 电子表格构建机器学习 model。 The a few of the columns in the.xls have formatting issues so some of the data is displayed as a datetime stamp instead of an str or int. .xls 中的一些列存在格式问题，因此某些数据显示为日期时间戳，而不是 str 或 int。 Here is an example from the pd dataframe:这是来自 pd dataframe 的示例：

0     40-49   premeno      15-19                  0-2       yes          3   
1     50-59      ge40      15-19                  0-2        no          1   
2     50-59      ge40      35-39                  0-2        no          2   
3     40-49   premeno      35-39                  0-2       yes          3   
4     40-49   premeno      30-34  **2019-05-03 00:00:00**       yes          2

In line 4, the value of 3-5 has been accidentally formatted as a date (shown as 03 May in the xls) and so is assigned as a datetime stamp in the dataframe.在第 4 行中，3-5 的值被意外格式化为日期（在 xls 中显示为 03 May），因此在 dataframe 中被指定为日期时间戳。 I have tried many methods to replace 2019-05-03 00:00:00 with 3-5 including:我尝试了很多方法来替换2019-05-03 00:00:00为3-5 ，包括：

df['column'] = df['column'].replace([('2019-05-03 00:00:00')], '3-5')

and using Timestamp.replace but neither seem to work.并使用 Timestamp.replace 但似乎都不起作用。 Any ideas of how to replace this mis formatted data points with the correct data?关于如何用正确的数据替换这些格式错误的数据点的任何想法？

Answer 1

There might be a simpler way but you may need to apply re.search with positive lookarounds.可能有更简单的方法，但您可能需要应用re.search和积极的环视。

import re

pat1 = '(?<=\*{2}\d{4}-0\d-0)(\d)(?= 00:00:00\*\*)'

pat2 = '(?<=\*{2}\d{4}-0)(\d)(?=-0\d 00:00:00\*\*)'

df['column'] = df['column'].astype(str).apply(
        lambda x: (re.search(pat2, '**2019-05-03 00:00:00**').group()
                   +'-'+re.search(pat1, '**2019-05-03 00:00:00**').group())
                   if '**' in x else x
     )

Answer 2

You can iterate the column with an apply and check if the element is an instance of pd.Timestamp ;您可以使用apply迭代列并检查元素是否是pd.Timestamp的实例； if so, extract a string "day-month", otherwise leave as it is.如果是，则提取一个字符串“day-month”，否则保持原样。

Ex:前任：

import pandas as pd

# what you have is something like (mixed datatype column/Series)
df = pd.DataFrame({'label': ['0-2', '1-3', pd.Timestamp('2019-05-03')]})

# iterate the column with an apply, extract day-month string if pd.Timestamp
df['label1'] = df['label'].apply(lambda x: f"{x.day}-{x.month}" if isinstance(x, pd.Timestamp) else x)

# ... to get
df['label1'] 
0    0-2
1    1-3
2    3-5
Name: label1, dtype: object

see also: Python pandas: how to obtain the datatypes of objects in a mixed-datatype column?另请参阅： Python pandas：如何获取混合数据类型列中对象的数据类型？

如果列不是所有时间戳，如何用字符串替换 dataframe 中的时间戳？

问题描述

2 个解决方案

解决方案1
0 2021-04-21 00:08:44

解决方案2
0 2021-04-21 06:26:35

如果列不是所有时间戳，如何用字符串替换 dataframe 中的时间戳？

问题描述

2 个解决方案

解决方案1 0 2021-04-21 00:08:44

解决方案2 0 2021-04-21 06:26:35

解决方案1
0 2021-04-21 00:08:44

解决方案2
0 2021-04-21 06:26:35