I am trying to build a machine learning model using an excel spreadsheet that cannot be edited. The a few of the columns in the.xls have formatting issues so some of the data is displayed as a datetime stamp instead of an str or int. Here is an example from the pd dataframe:
0 40-49 premeno 15-19 0-2 yes 3
1 50-59 ge40 15-19 0-2 no 1
2 50-59 ge40 35-39 0-2 no 2
3 40-49 premeno 35-39 0-2 yes 3
4 40-49 premeno 30-34 **2019-05-03 00:00:00** yes 2
In line 4, the value of 3-5 has been accidentally formatted as a date (shown as 03 May in the xls) and so is assigned as a datetime stamp in the dataframe. I have tried many methods to replace 2019-05-03 00:00:00
with 3-5
including:
df['column'] = df['column'].replace([('2019-05-03 00:00:00')], '3-5')
and using Timestamp.replace but neither seem to work. Any ideas of how to replace this mis formatted data points with the correct data?
There might be a simpler way but you may need to apply re.search
with positive lookarounds.
import re
pat1 = '(?<=\*{2}\d{4}-0\d-0)(\d)(?= 00:00:00\*\*)'
pat2 = '(?<=\*{2}\d{4}-0)(\d)(?=-0\d 00:00:00\*\*)'
df['column'] = df['column'].astype(str).apply(
lambda x: (re.search(pat2, '**2019-05-03 00:00:00**').group()
+'-'+re.search(pat1, '**2019-05-03 00:00:00**').group())
if '**' in x else x
)
You can iterate the column with an apply
and check if the element is an instance of pd.Timestamp
; if so, extract a string "day-month", otherwise leave as it is.
Ex:
import pandas as pd
# what you have is something like (mixed datatype column/Series)
df = pd.DataFrame({'label': ['0-2', '1-3', pd.Timestamp('2019-05-03')]})
# iterate the column with an apply, extract day-month string if pd.Timestamp
df['label1'] = df['label'].apply(lambda x: f"{x.day}-{x.month}" if isinstance(x, pd.Timestamp) else x)
# ... to get
df['label1']
0 0-2
1 1-3
2 3-5
Name: label1, dtype: object
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.