简体   繁体   中英

Python read csv to Dataframe, stuggeling with date columns

Python read csv to Dataframe, stuggeling with date columns

Hi all,

I have problems by reading in a csv which looks like:

col_A;col_B;col_C;Col_Date_1;Col_Date_2;Col_Date_3
57;-;60;03.02.2020;-;06.07.2020
126;8;-;03.02.2020;04.03.2020;06.07.2020
-;45;-;30.01.2020;29.02.2020;29.06.2020
106;83;189;-;29.02.2020;29.06.2020
-;12;84;30.01.2020;29.02.2020;-

|col_A|col_B|col_C|Col_Date_1 |Col_Date_2 |Col_Date_3|
----------------------------------------------------
|57   |-    |60   |03.02.2020 |-          |06.07.2020|
|126  |8    |-    |03.02.2020 |04.03.2020 |06.07.2020|
|-    |45   |-    |30.01.2020 |29.02.2020 |29.06.2020|
|106  |83   |189  |-          |29.02.2020 |29.06.2020|
|-    |12   |84   |30.01.2020 |29.02.2020 |-         |

Here is how I tried to read in the CSV.

import pandas as pd
df_puma = pd.read_csv(test.csv, sep=";",dayfirst=True, parse_dates=['Col_Date_1','Col_Date_2','Col_Date_3'], encoding='latin-1')

Unfortunately, both kinds of columns (the first 3 integers and the last 3 with dates) are not automatically in the right type.

df.info()
----------
col_A                404 non-null object
col_B                404 non-null object
col_C                404 non-null object
Col_Date_1           404 non-null object
Col_Date_2           404 non-null object
Col_Date_3           404 non-null object

Well, I hoped at least the date columns should be recognized as a kind of date, unfortunately not:(. Like:

df.info()
----------
col_A                404 non-null int64
col_B                404 non-null int64
col_C                404 non-null int64
Col_Date_1           404 non-null datetime64[ns]
Col_Date_2           404 non-null datetime64[ns]
Col_Date_3           404 non-null datetime64[ns]

Could someone give me a hint, how to to get the data in the right type? In my mind would it be like:

col_A;col_B;col_C;Col_Date_1;Col_Date_2;Col_Date_3
57;NaN;60;03.02.2020;NaT;06.07.2020
126;8;NaN;03.02.2020;04.03.2020;06.07.2020
NaN;45;NaN;30.01.2020;29.02.2020;29.06.2020
106;83;189;NaT;29.02.2020;29.06.2020
NaN;12;84;30.01.2020;29.02.2020;NaT

|col_A|col_B|col_C|Col_Date_1 |Col_Date_2 |Col_Date_3|
----------------------------------------------------
|57   |NaN  |60   |03.02.2020 |NaT        |06.07.2020|
|126  |8    |NaN  |03.02.2020 |04.03.2020 |06.07.2020|
|NaN  |45   |NaN  |30.01.2020 |29.02.2020 |29.06.2020|
|106  |83   |189  |NaT        |29.02.2020 |29.06.2020|
|NaN  |12   |84   |30.01.2020 |29.02.2020 |NaT       |

Do I have to iterate through all the columns and rows and clean up the "-" entities? I still on a quiet newbie level in Python and don't know what is the best solution...

Hope you guys can help me.

Replace your - values with nan and then parse the dates

from io import StringIO
import pandas as pd

s = """col_A;col_B;col_C;Col_Date_1;Col_Date_2;Col_Date_3
57;-;60;03.02.2020;-;06.07.2020
126;8;-;03.02.2020;04.03.2020;06.07.2020
-;45;-;30.01.2020;29.02.2020;29.06.2020
106;83;189;-;29.02.2020;29.06.2020
-;12;84;30.01.2020;29.02.2020;-"""

df = pd.read_csv(StringIO(s), sep=';', na_values='-',
                 parse_dates=[3,4,5], dayfirst=True)

   col_A  col_B  col_C Col_Date_1 Col_Date_2 Col_Date_3
0   57.0    NaN   60.0 2020-02-03        NaT 2020-07-06
1  126.0    8.0    NaN 2020-02-03 2020-03-04 2020-07-06
2    NaN   45.0    NaN 2020-01-30 2020-02-29 2020-06-29
3  106.0   83.0  189.0        NaT 2020-02-29 2020-06-29
4    NaN   12.0   84.0 2020-01-30 2020-02-29        NaT

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM