简体   繁体   中英

Is it legit having NaN for an empty string in Pandas Dataframe?

I am reading a csv.gz file from S3 having a string column with empty values. Once I read that file using pandas.read_csv() method ,

pandas.read_csv(io.BytesIO(csv_data['Body'].read()), sep='|',compression='gzip',
                                          engine='python', error_bad_lines=False, warn_bad_lines=True,
                                          encoding='iso-8859-1',
                                          escapechar='\\',
                                          quoting=1)

I am getting NaN values in dataframe instead of empty/blank in string column.Couple of questions?

i) Do NaN applies to where type is object?

ii) Do NaN only applied to Numbers (integers, floats) and not to strings

Any help would be appreciated. Thanks. Below is the input and actual output I am getting.

Input:

    "Obj_ID"|"Value"|"TimeStamp"\n
"ID-1"|"val"| "2020-03-12 00:00:00"
"ID-2"|"v"| "2020-03-12 00:00:00"
"ID-3"|"value-3"| "2020-03-12 00:00:00"
"ID-4"|"value-4"| "2020-03-12 00:00:00"
"ID-5"|""| "2020-03-12 00:00:00"

Actual Output:

     Obj_ID    Value               TimeStamp
0   ID-1      val   "2020-03-12 00:00:00"
1   ID-2        v   "2020-03-12 00:00:00"
2   ID-3  value-3   "2020-03-12 00:00:00"
3   ID-4  value-4   "2020-03-12 00:00:00"
4   ID-5      NaN   "2020-03-12 00:00:00"

Desired output without manipulation of Dataframe should be :

     Obj_ID    Value               TimeStamp
0   ID-1      val   "2020-03-12 00:00:00"
1   ID-2        v   "2020-03-12 00:00:00"
2   ID-3  value-3   "2020-03-12 00:00:00"
3   ID-4  value-4   "2020-03-12 00:00:00"
4   ID-5      ''   "2020-03-12 00:00:00"

From pandas documentation on read_csv :

na_values : scalar, str, list-like, or dict, optional

Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: '', [...]

This explains why the empty string is interpreted as NaN .

keep_default_na : bool, default True

Whether or not to include the default NaN values when parsing the data. Depending on whether na_values is passed in, the behavior is as follows: [...]

If keep_default_na is False, and na_values are not specified, no strings will be parsed as NaN.

So just adding keep_default_na=False as a parameter to read_csv should do what you need.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM