What is the best way to replace the format of data in a large dataset?

Question

I am just starting out with data science, so apologies if this is a bone question with a simple answer, but I have been scanning google for hours and have tried multiple solutions to no avail.

Basically, my dataset has automatically adjusted some values such as 3-5 to 03-May. I am not able to simply change the values in Excel, rather I need to clean the data in Python. My first thought was simply to use the replace tool ie df = df.replace('2019-05-03 00:00:00', '3-5') but it doesn't work, presumably as the dtype is different between the timestamp and the str(?) - it works if I adjust the code ie df = df.replace('0-2', '3-5') .

I can't simply add that data as a missing value either as it is simply an error in formatting rather than a spurious entry.

Is there a simple way of doing this?

Listed below is an example snippet of the data I am working with:

GitHub public gist

PSB for code:

#Dependencies
import pytest
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
import numpy as np
from google.colab import drive
import io

#Import data
from google.colab import files
upload = files.upload()
df = pd.read_excel(io.BytesIO(upload['breast-cancer.xls']))

df

#Clean Data
df.types

#Correcting tumor-size and inv-nodes values
'''def clean_data(dataset):
      for i in dataset:
         dataset = dataset.replace('2019-05-03 00:00:00','3-5')
         dataset = dataset.replace('2019-08-06 00:00:00','6-8')
         dataset = dataset.replace('2019-09-11 00:00:00','9-11')
         dataset = dataset.replace('2014-12-01 00:00:00','12-14')
         dataset = dataset.replace('2014-10-01 00:00:00','10-14')
         dataset = dataset.replace('2019-09-05 00:00:00','5-9')
      return dataset

   cleaned_dataset = dataset.apply(clean_data)
   cleaned_dataset'''

df = df.replace('2019-05-03 00:00:00', '3-5')
df

#Check for duplicates
df.duplicated()

Answer 1

df[['tumor-size', 'inv-nodes']] = df[['tumor-size', 'inv-nodes']].astype(str)

That line of code saved the day.

What is the best way to replace the format of data in a large dataset?

Question

1 answers

solution1
0 2020-07-19 14:48:52

What is the best way to replace the format of data in a large dataset?

Question

1 answers

solution1 0 2020-07-19 14:48:52

solution1
0 2020-07-19 14:48:52