简体   繁体   中英

How to import csv files in Python containing badly formatted quote marks?

I'm trying to load the following test.csv file:

R1C1    R1C2    R1C3
R2C1    R2C2    R2C3
R3C1    "R3C2   R3C3
R4C1    R4C2    R4C3

... Using this Python script:

import csv


 with open("test.csv") as f:
      for row in csv.reader(f, delimiter='\t'):
          print(row)

The result I got was the following:

['R1C1', 'R1C2', 'R1C3']
['R2C1', 'R2C2', 'R2C3']
['R3C1', 'R3C2\tR3C3\nR4C1\tR4C2\tR4C3\n']

It turns out that when Python finds a field whose first character is a quotation mark and there is no closing quotation mark, it will include all of the following content as part of the same field.

My question: What is the best approach for all rows in the file to be read properly? Please consider I'm using Python 3.8.5 and the script should be able to read huge files (2gb or more), so memory usage and performance issues should be also considered.

Thanks!

Honestly, if you're dealing with that much data, it'd be best to go in and clean it first. And if possible, fix whatever process is producing your bad data in the first place.

I haven't tested with a large file, but you may just be able to replace " characters as you read lines, assuming there's never a case where they're valid characters:

import csv


with open("test.csv") as f:
    line_generator = (line.replace('"', '') for line in f)
    for row in csv.reader(line_generator, delimiter='\t'):
        print(row)

Output:

['R1C1', 'R1C2', 'R1C3']
['R2C1', 'R2C2', 'R2C3']
['R3C1', 'R3C2', 'R3C3']
['R4C1', 'R4C2', 'R4C3']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM