简体   繁体   中英

Distinguish between “” and empty value when reading csv file using python

CSV file contains values such as "","ab,abc",,"abc". Note, I am referring to empty value ,, as in unknown value. This is different from "", where a value has not been set yet. I am treating these two values differently. I need a way to read "" and empty value ,, and distinguish between the two. I am mapping data to numbers such that "" is mapped to 0 and ,, is mapped to NaN. Note, I am not having a parsing issue and field such as "ab,abc" is being parsed just fine with comma as the delimiter. The issue is python reads "" and empty value,, as empty string such as ' '. And these two values are not same and should not be grouped into empty string.

Not only this, but I also need to write csv file such that "" is written as "" and not ,, and NaN should be written as ,, (empty value).

I have looked into csv Dialects such as doublequote, escapechar, quotechar, quoting. This is NOT what I want. These are all cases where delimiter appears within data ie "ab,abc" and as I mentioned, parsing with special characters is not an issue.

I don't want to use Pandas. The only thing I can think of is regex? But that's an overhead if I have millions of lines to process.

The behaviour I want is this:

a = "\"\"" (or it could be a="" or a="ab,abc")
if (a=="\"\""):
    map[0]=0
elif(a==""):
    map[0]=np.nan
else:
    map[0] = a

My csv reader is as follows:

import csv
f = open(filepath, 'r')
csvreader = csv.reader(f)
for row in csvreader:
        print(row)

I want above behaviour when reading csv files though. currently only two values are read: ' ' (empty string) or 'ab,abc'.

I want 3 different values to be read. ' ' empty string, '""' string with double quotes, and actual string 'ab,abc'

looking through the csv module in CPython source (search for IN_QUOTED_FIELD ), it doesn't have any internal state that would let you do this. for example, parsing:

"a"b"c"d

is parsed as: 'ab"c"d' , which might not be what you expect. eg:

import csv
from io import StringIO

[row] = csv.reader(StringIO(
    '"a"b"c"d'))

print(row)

specifically, quotes are only handled specially at the beginning of fields, and all characters are just added to the field as they are encountered, rather than any allowing any special behaviour to be triggered when "un-quote"ing fields

The solution I figured is this:

If I change the input file such that quoted strings have escapechar '\\' , below is input file:

col1,col2,col3
"",a,b
\cde \,f,g
,h,i
\j,kl\,mno,p

Then double-quoted empty field and unquoted empty field are separable

csvreader = csv.reader(f, quotechar='\\')
    for row in csvreader:
        print(row)

That's my best solution so far...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM