简体   繁体   中英

Python Pandas Reading CSV file with Specific Line Terminators

I am trying to create a dataframe from the below sample csv I've been given but I am getting Error tokenizing data. C error: EOF inside string starting at line 0. I haven't had very much practise with treating bad lines but would really like to learn the best way to handle something like this. I have attempted many different options in read_csv such as error_bad_line=False but that has not worked either.

CParserError: Error tokenizing data. C error: EOF inside string starting at line 0

I am guessing that the line terminators of ," are causing the issue and I am guessing that the best way is to loop through each line and process so I came up with the below generator with help from a different and was hoping I am close. Would really like to learn how to use a generator and yield for this also.

Sample data:

"USNC3255","27","US","NC","LANDS END","72305006","KNJM","KNCA","KNKT","T72305006","","","NCC031","NCZ095","","545","28594","America/New_York","34.65266","-77.07661","7","RDU","893727","
"USNC3256","27","US","NC","LANDSDOWN","72314058","KEHO","KAKH","KIPJ","T72314058","","","NCC045","NCZ068","sc007","517","28150","America/New_York","35.29374","-81.46537","797","CLT","317845","

I have crafted the below which removes last two characters but not sure hot to produce a dataframe from the lines:

def big_table_generator(filename):
    with open(filename, 'rt') as f:
        for line in f:
            yield line[:-3]

gen = big_table_generator('../data/test_sun_file.csv')
pd.DataFrame(gen)

I had a similar error. Fixed it by using the option quoting=csv.QUOTE_NONE in read_csv.

For example:

df = pd.read_csv(csvfile, header = None, delimiter="\t", quoting=csv.QUOTE_NONE, encoding='utf-8')

Some info about why in the second comment here: https://github.com/pydata/pandas/issues/5500

Here is the solution I came up with but I really wanted to avoid using list and append and take advantage of a generator instead but not yet comfortable enough working with generators.

def parse_file(filename):

    newline = []

    with open(filename, 'rb') as f:
        reader = csv.reader(f, quoting=csv.QUOTE_NONE)
        for row in reader:
            newline.append([s.strip('"') for s in row[:-1]])
    df = pd.DataFrame(newline)
    df = df.applymap(lambda x: nan if len(x) == 0 else x).astype(object)
    return df

df = parse_file(filename)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM