简体   繁体   English

Python 将 csv 读取到 Dataframe,与日期列发生冲突

[英]Python read csv to Dataframe, stuggeling with date columns

Python read csv to Dataframe, stuggeling with date columns Python 将 csv 读取到 Dataframe,与日期列发生冲突

Hi all,大家好,

I have problems by reading in a csv which looks like:我在阅读 csv 时遇到问题,它看起来像:

col_A;col_B;col_C;Col_Date_1;Col_Date_2;Col_Date_3
57;-;60;03.02.2020;-;06.07.2020
126;8;-;03.02.2020;04.03.2020;06.07.2020
-;45;-;30.01.2020;29.02.2020;29.06.2020
106;83;189;-;29.02.2020;29.06.2020
-;12;84;30.01.2020;29.02.2020;-

|col_A|col_B|col_C|Col_Date_1 |Col_Date_2 |Col_Date_3|
----------------------------------------------------
|57   |-    |60   |03.02.2020 |-          |06.07.2020|
|126  |8    |-    |03.02.2020 |04.03.2020 |06.07.2020|
|-    |45   |-    |30.01.2020 |29.02.2020 |29.06.2020|
|106  |83   |189  |-          |29.02.2020 |29.06.2020|
|-    |12   |84   |30.01.2020 |29.02.2020 |-         |

Here is how I tried to read in the CSV.这是我尝试阅读 CSV 的方法。

import pandas as pd
df_puma = pd.read_csv(test.csv, sep=";",dayfirst=True, parse_dates=['Col_Date_1','Col_Date_2','Col_Date_3'], encoding='latin-1')

Unfortunately, both kinds of columns (the first 3 integers and the last 3 with dates) are not automatically in the right type.不幸的是,这两种列(前 3 个整数和最后 3 个带日期的列)都不是自动正确的类型。

df.info()
----------
col_A                404 non-null object
col_B                404 non-null object
col_C                404 non-null object
Col_Date_1           404 non-null object
Col_Date_2           404 non-null object
Col_Date_3           404 non-null object

Well, I hoped at least the date columns should be recognized as a kind of date, unfortunately not:(. Like:好吧,我希望至少日期列应该被识别为一种日期,不幸的是不是:(。像:

df.info()
----------
col_A                404 non-null int64
col_B                404 non-null int64
col_C                404 non-null int64
Col_Date_1           404 non-null datetime64[ns]
Col_Date_2           404 non-null datetime64[ns]
Col_Date_3           404 non-null datetime64[ns]

Could someone give me a hint, how to to get the data in the right type?有人可以给我一个提示,如何获取正确类型的数据? In my mind would it be like:在我看来会是这样的:

col_A;col_B;col_C;Col_Date_1;Col_Date_2;Col_Date_3
57;NaN;60;03.02.2020;NaT;06.07.2020
126;8;NaN;03.02.2020;04.03.2020;06.07.2020
NaN;45;NaN;30.01.2020;29.02.2020;29.06.2020
106;83;189;NaT;29.02.2020;29.06.2020
NaN;12;84;30.01.2020;29.02.2020;NaT

|col_A|col_B|col_C|Col_Date_1 |Col_Date_2 |Col_Date_3|
----------------------------------------------------
|57   |NaN  |60   |03.02.2020 |NaT        |06.07.2020|
|126  |8    |NaN  |03.02.2020 |04.03.2020 |06.07.2020|
|NaN  |45   |NaN  |30.01.2020 |29.02.2020 |29.06.2020|
|106  |83   |189  |NaT        |29.02.2020 |29.06.2020|
|NaN  |12   |84   |30.01.2020 |29.02.2020 |NaT       |

Do I have to iterate through all the columns and rows and clean up the "-" entities?我是否必须遍历所有列和行并清理“-”实体? I still on a quiet newbie level in Python and don't know what is the best solution...我在 Python 中仍然处于安静的新手级别,不知道什么是最好的解决方案......

Hope you guys can help me.希望你们能帮助我。

Replace your - values with nan and then parse the dates用 nan 替换您的-值,然后解析日期

from io import StringIO
import pandas as pd

s = """col_A;col_B;col_C;Col_Date_1;Col_Date_2;Col_Date_3
57;-;60;03.02.2020;-;06.07.2020
126;8;-;03.02.2020;04.03.2020;06.07.2020
-;45;-;30.01.2020;29.02.2020;29.06.2020
106;83;189;-;29.02.2020;29.06.2020
-;12;84;30.01.2020;29.02.2020;-"""

df = pd.read_csv(StringIO(s), sep=';', na_values='-',
                 parse_dates=[3,4,5], dayfirst=True)

   col_A  col_B  col_C Col_Date_1 Col_Date_2 Col_Date_3
0   57.0    NaN   60.0 2020-02-03        NaT 2020-07-06
1  126.0    8.0    NaN 2020-02-03 2020-03-04 2020-07-06
2    NaN   45.0    NaN 2020-01-30 2020-02-29 2020-06-29
3  106.0   83.0  189.0        NaT 2020-02-29 2020-06-29
4    NaN   12.0   84.0 2020-01-30 2020-02-29        NaT

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM