简体   繁体   中英

Converting string to date format 0000-00-00 before extracting year and month information

I have a string column (object type):

Date
2020-06-15
2019-07-23
Data non available

How can I convert the string Data non available and/or any missing values to the format 0000-00-00 , before extracting the month and the year?

I can convert the Date column to datetime , and extract information on year and month from the other rows, but then I get this error:

ParserError: Unknown string format: Data not available

My code:

df['Date'] = pd.to_datetime(df['Date'])
df['Date_Year'], df['Date_month'] = df['Date'].dt.year, df['Date'].dt.month

Expected output:

Date             Date_Year        Date_month
2020-06-15          2020              06
2019-07-23          2019              07
Data non available  0000              00

I'd replace Data not available with NaT ( N ot A T ime - NaN equivalent for dates/times) before you call to_datetime :

df['Date'] = df['Date'].replace({'Data non available': pd.NaT})

Then

df['Date'] = pd.to_datetime(df['Date'])
df['Date_Year'], df['Date_month'] = df['Date'].dt.year, df['Date'].dt.month

Output:

>>> df
        Date  Date_Year  Date_month
0 2020-06-15     2020.0         6.0
1 2019-07-23     2019.0         7.0
2        NaT        NaN         NaN

Do you really want to convert the 'Data non available' to the format '0000-00-00' ? pandas.to_datetime will not like that. I'm assuming that you just want to mark them as missing.

In that case you can use pandas.to_datetime with errors='coerce' . From the docs

errors {'ignore', 'raise', 'coerce'}, default 'raise'

  • If 'raise', then invalid parsing will raise an exception.
  • If 'coerce', then invalid parsing will be set as NaT.
  • If 'ignore', then invalid parsing will return the input.
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df['Date_Year'], df['Date_month'] = df['Date'].dt.year, df['Date'].dt.month

>>> df

        Date  Date_Year  Date_month
0 2020-06-15     2020.0         6.0
1 2019-07-23     2019.0         7.0
2        NaT        NaN         NaN 

However, you may have dates that are not 'Data non available' but couldn't be correctly parsed for some reason. By using errors = 'coerce' you wouldn't know that and couldn't identify the problem because they would be automatically converted to NaTs (Not a Time). To make sure you only convert 'Data non available' to NaTs, you can mask only the remaining values before converting to datetime

mask = df['Date'] == 'Data non available'
df['Date'] = pd.to_datetime(df['Date'].mask(mask))
df['Date_Year'], df['Date_month'] = df['Date'].dt.year, df['Date'].dt.month

>>> df

        Date  Date_Year  Date_month
0 2020-06-15     2020.0         6.0
1 2019-07-23     2019.0         7.0
2        NaT        NaN         NaN 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM