I am just starting out with data science, so apologies if this is a bone question with a simple answer, but I have been scanning google for hours and have tried multiple solutions to no avail.
Basically, my dataset has automatically adjusted some values such as 3-5 to 03-May. I am not able to simply change the values in Excel, rather I need to clean the data in Python. My first thought was simply to use the replace tool ie df = df.replace('2019-05-03 00:00:00', '3-5')
but it doesn't work, presumably as the dtype is different between the timestamp and the str(?) - it works if I adjust the code ie df = df.replace('0-2', '3-5')
.
I can't simply add that data as a missing value either as it is simply an error in formatting rather than a spurious entry.
Is there a simple way of doing this?
Listed below is an example snippet of the data I am working with:
PSB for code:
#Dependencies
import pytest
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
import numpy as np
from google.colab import drive
import io
#Import data
from google.colab import files
upload = files.upload()
df = pd.read_excel(io.BytesIO(upload['breast-cancer.xls']))
df
#Clean Data
df.types
#Correcting tumor-size and inv-nodes values
'''def clean_data(dataset):
for i in dataset:
dataset = dataset.replace('2019-05-03 00:00:00','3-5')
dataset = dataset.replace('2019-08-06 00:00:00','6-8')
dataset = dataset.replace('2019-09-11 00:00:00','9-11')
dataset = dataset.replace('2014-12-01 00:00:00','12-14')
dataset = dataset.replace('2014-10-01 00:00:00','10-14')
dataset = dataset.replace('2019-09-05 00:00:00','5-9')
return dataset
cleaned_dataset = dataset.apply(clean_data)
cleaned_dataset'''
df = df.replace('2019-05-03 00:00:00', '3-5')
df
#Check for duplicates
df.duplicated()
df[['tumor-size', 'inv-nodes']] = df[['tumor-size', 'inv-nodes']].astype(str)
That line of code saved the day.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.