简体   繁体   English

Pandas 为“NA”字符串填充异常

[英]Pandas fillna exception for 'NA' string

sample_file.txt样本文件.txt

6|test|3|4
5|test||8
9|test|NA|12

Script脚本

import pandas as pd
df = pd.read_csv('sample_file.txt', dtype='str', sep='|', names=['upc_cd', 'chr_typ', 'chr_vl','chr_vl_typ'])
df["chr_vl"].fillna("NOT AVLBL", inplace = True)
print(df)

Current output当前 output

upc_cd chr_typ     chr_vl chr_vl_typ
0      6    test          3          4
1      5    test  NOT AVLBL          8
2      9    test  NOT AVLBL         12

Required output需要 output

upc_cd chr_typ     chr_vl chr_vl_typ
0      6    test          3          4
1      5    test  NOT AVLBL          8
2      9    test         NA         12

Basically I need NA as it is in the output same time it should replace null values with the specific text 'NOT AVLBL' Tried replace method as well, but couldn't get the desired output基本上我需要 NA,因为它在 output 中,同时它应该用特定文本“NOT AVLBL”替换 null 值也尝试过替换方法,但无法获得所需的 Z78E6221F63989F14CE666

Pandas read_csv functiomn already defines a set of strings that will be interpreted as NaNs when you load a csv file. Pandas read_csv 函数已经定义了一组字符串,当您加载 csv 文件时,这些字符串将被解释为 NaN。 Here you have the option to either extend that list with other strings or to also completely overwrite it.在这里,您可以选择使用其他字符串扩展该列表或完全覆盖它。 In your case you have to overwrite it, as NA is one of the default values used by pandas.在您的情况下,您必须覆盖它,因为 NA 是 pandas 使用的默认值之一。 To do so, you could try something like为此,您可以尝试类似

df = pd.read_csv('sample_file.txt', dtype='str', sep='|',
                 names=['upc_cd', 'chr_typ', 'chr_vl','chr_vl_typ'],
                 na_values=[''], keep_default_na=False)
...

This will only interpret the empty string as NA as we have set keep_default_na to False and have only given '' as a NA value with na_values argument.这只会将空字符串解释为 NA,因为我们已将keep_default_na设置为False并且仅将''作为带有na_values参数的 NA 值。 If you want to learn more, have a look at the pandas docs .如果您想了解更多信息,请查看 pandas 文档

Pandas read_csv is a bit too clever here. Pandas read_csv在这里有点太聪明了。 The problem is that many strings are commonly used to identify missing values in CSV files.问题是许多字符串通常用于识别 CSV 文件中的缺失值。

According to official documentation根据官方文档

... By default the following values are interpreted as NaN: '', '#N/A', '#N/AN/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', '', 'N/A', 'NA', 'NULL', 'NaN', 'n/a', 'nan', 'null'. ... 默认情况下,以下值被解释为 NaN:''、'#N/A'、'#N/AN/A'、'#NA'、'-1.#IND'、'-1.# QNAN'、'-NaN'、'-nan'、'1.#IND'、'1.#QNAN'、''、'N/A'、'NA'、'NULL'、'NaN'、'n /a'、'nan'、'null'。

So your dataframe does contain an NaN and fillna normally fills it.所以你的 dataframe确实包含一个 NaN 并且fillna通常会填充它。

To only accept the empty string as NaN, you have to both set na_values to '' and keep_default_na to false:要只接受空字符串作为 NaN,您必须将na_values设置为''并将keep_default_na为 false:

df = pd.read_csv('sample_file.txt', dtype='str', sep='|',
                 names=['upc_cd', 'chr_typ', 'chr_vl','chr_vl_typ'],
                 na_values='', keep_default_na=False)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM