简体   繁体   English

data.dropna() 不适用于我的 data.csv 文件,我仍然得到包含 NaN 元素的数据

[英]data.dropna() doesnt work for my data.csv file and i still get a data with NaN elements

I'm studying Pandas from Python.我正在研究 Python 的 Pandas。

I'm trying to remove NaN elements from my data.csv file with data.dropna() and it isn't removing.我正在尝试使用 data.dropna() 从我的 data.csv 文件中删除 NaN 元素,但它没有删除。

import pandas as pd

data = pd.read_csv('data.csv')

new_data = data.dropna()

print(new_data)

This is data.csv content.这是data.csv内容。

Duration          Date  Pulse  Maxpulse  Calories
      60  '2020/12/01'    110       130     409.1
      60  '2020/12/02'    117       145     479.0
      60  '2020/12/03'    103       135     340.0
      45  '2020/12/04'    109       175     282.4
      45  '2020/12/05'    117       148     406.0
      60  '2020/12/06'    102       127     300.0
      60  '2020/12/07'    110       136     374.0
     450  '2020/12/08'    104       134     253.3
      30  '2020/12/09'    109       133     195.1
      60  '2020/12/10'     98       124     269.0
      60  '2020/12/11'    103       147     329.3
      60  '2020/12/12'    100       120     250.7
      60  '2020/12/12'    100       120     250.7
      60  '2020/12/13'    106       128     345.3
      60  '2020/12/14'    104       132     379.3
      60  '2020/12/15'     98       123     275.0
      60  '2020/12/16'     98       120     215.2
      60  '2020/12/17'    100       120     300.0
      45  '2020/12/18'     90       112       NaN
      60  '2020/12/19'    103       123     323.0
      45  '2020/12/20'     97       125     243.0
      60  '2020/12/21'    108       131     364.2
      45           NaN    100       119     282.0
      60  '2020/12/23'    130       101     300.0
      45  '2020/12/24'    105       132     246.0
      60  '2020/12/25'    102       126     334.5
      60    2020/12/26    100       120     250.0
      60  '2020/12/27'     92       118     241.0
      60  '2020/12/28'    103       132       NaN
      60  '2020/12/29'    100       132     280.0
      60  '2020/12/30'    102       129     380.3
      60  '2020/12/31'     92       115     243.0

My guess is that data.csv is written incorrect?我的猜测是data.csv 写错了?

The data.csv file is written wrong, to fix it need to add commas. data.csv 文件写错了,要修正它需要加逗号。

Corrected format: data.csv更正格式: data.csv

Duration,Date,Pulse,Maxpulse,Calories
60,2020/12/01',110,130,409.1
60,2020/12/02',117,145,479.0
60,2020/12/03',103,135,340.0
45,2020/12/04',109,175,282.4
45,2020/12/05',117,148,406.0
60,2020/12/06',102,127,300.0
60,2020/12/07',110,136,374.0
450,2020/12/08',104,134,253.3
30,2020/12/09',109,133,195.1
60,2020/12/10',98,124,269.0
60,2020/12/11',103,147,329.3
60,2020/12/12',100,120,250.7
60,2020/12/12',100,120,250.7
60,2020/12/13',106,128,345.3
60,2020/12/14',104,132,379.3
60,2020/12/15',98,123,275.0
60,2020/12/16',98,120,215.2
60,2020/12/17',100,120,300.0
45,2020/12/18',90,112,
60,2020/12/19',103,123,323.0
45,2020/12/20',97,125,243.0
60,2020/12/21',108,131,364.2
45,,100,119,282.0
60,2020/12/23',130,101,300.0
45,2020/12/24',105,132,246.0
60,2020/12/25',102,126,334.5
60,20201226,100,120,250.0
60,2020/12/27',92,118,241.0
60,2020/12/28',103,132,
60,2020/12/29',100,132,280.0
60,2020/12/30',102,129,380.3
60,2020/12/31',92,115,243.0

TL,DR: Try this: TL,DR:试试这个:

new_data = df.fillna(pd.NA).dropna() new_data = df.fillna(pd.NA).dropna()

or或者

import numpy as np new_data = df.fillna(np.NaN).dropna()导入 numpy 作为 np new_data = df.fillna(np.NaN).dropna()

That's the real csv file?那是真正的 csv 文件吗? I don't think so.我不这么认为。

There isn't any specification of missing values in csv doc [1]. csv doc [1] 中没有任何缺失值的规范。 From my experience, missing values in csv are represented by nothing between two separators (if the separator is a comma, it looks like,,).根据我的经验,csv 中的缺失值在两个分隔符之间没有任何表示(如果分隔符是逗号,它看起来像,,)。

From pandas doc[2], the pandas.read_csv contains an argument "na_values":从 pandas doc[2],pandas.read_csv 包含一个参数“na_values”:

na_values: scalar, str, list-like, or dict, optional na_values:标量、str、类似列表或 dict,可选

Additional strings to recognize as NA/NaN.要识别为 NA/NaN 的附加字符串。 If dict passed, specific per-column NA values.如果 dict 通过,特定的每列 NA 值。 By default the following values are interpreted as NaN: '', '#N/A', '#N/AN/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', '', 'N/A', 'NA', 'NULL', 'NaN', 'n/a', 'nan', 'null'.默认情况下,以下值被解释为 NaN:''、'#N/A'、'#N/AN/A'、'#NA'、'-1.#IND'、'-1.#QNAN'、 '-NaN'、'-nan'、'1.#IND'、'1.#QNAN'、''、'N/A'、'NA'、'NULL'、'NaN'、'n/a' ,“南”,“空”。

If your csv file contains 'NaN', pandas are capable to infer and read as NaN, but you can pass the parameter as you need.如果您的 csv 文件包含“NaN”,则 pandas 能够推断和读取为 NaN,但您可以根据需要传递参数。

Also, you can use (consider i as the number of row and j for column):此外,您可以使用(将 i 视为行数,将 j 视为列数):

type(df.iloc[i,j])类型(df.iloc[i,j])

Compare with:与之比较:

type(np.NaN) # numpy NaN类型(np.NaN)#numpy NaN

float漂浮

type(pd.NA) # pandas NaN类型(pd.NA)#pandas NaN

pandas._libs.missing.NAType pandas._libs.missing.NAType

[1] https://datatracker.ietf.org/doc/html/rfc4180 [1] https://datatracker.ietf.org/doc/html/rfc4180

[2] https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html [2] https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM