简体   繁体   中英

How can I replace a timestamp in a dataframe with a string if the column is not all timestamps?

I am trying to build a machine learning model using an excel spreadsheet that cannot be edited. The a few of the columns in the.xls have formatting issues so some of the data is displayed as a datetime stamp instead of an str or int. Here is an example from the pd dataframe:

0     40-49   premeno      15-19                  0-2       yes          3   
1     50-59      ge40      15-19                  0-2        no          1   
2     50-59      ge40      35-39                  0-2        no          2   
3     40-49   premeno      35-39                  0-2       yes          3   
4     40-49   premeno      30-34  **2019-05-03 00:00:00**       yes          2

In line 4, the value of 3-5 has been accidentally formatted as a date (shown as 03 May in the xls) and so is assigned as a datetime stamp in the dataframe. I have tried many methods to replace 2019-05-03 00:00:00 with 3-5 including:

df['column'] = df['column'].replace([('2019-05-03 00:00:00')], '3-5') 

and using Timestamp.replace but neither seem to work. Any ideas of how to replace this mis formatted data points with the correct data?

There might be a simpler way but you may need to apply re.search with positive lookarounds.

import re

pat1 = '(?<=\*{2}\d{4}-0\d-0)(\d)(?= 00:00:00\*\*)'

pat2 = '(?<=\*{2}\d{4}-0)(\d)(?=-0\d 00:00:00\*\*)'

df['column'] = df['column'].astype(str).apply(
        lambda x: (re.search(pat2, '**2019-05-03 00:00:00**').group()
                   +'-'+re.search(pat1, '**2019-05-03 00:00:00**').group())
                   if '**' in x else x
     )

You can iterate the column with an apply and check if the element is an instance of pd.Timestamp ; if so, extract a string "day-month", otherwise leave as it is.

Ex:

import pandas as pd

# what you have is something like (mixed datatype column/Series)
df = pd.DataFrame({'label': ['0-2', '1-3', pd.Timestamp('2019-05-03')]})

# iterate the column with an apply, extract day-month string if pd.Timestamp
df['label1'] = df['label'].apply(lambda x: f"{x.day}-{x.month}" if isinstance(x, pd.Timestamp) else x)

# ... to get
df['label1'] 
0    0-2
1    1-3
2    3-5
Name: label1, dtype: object

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM