简体   繁体   English

如何导入 Python 中包含格式错误的引号的 csv 个文件?

[英]How to import csv files in Python containing badly formatted quote marks?

I'm trying to load the following test.csv file:我正在尝试加载以下 test.csv 文件:

R1C1    R1C2    R1C3
R2C1    R2C2    R2C3
R3C1    "R3C2   R3C3
R4C1    R4C2    R4C3

... Using this Python script: ...使用这个 Python 脚本:

import csv


 with open("test.csv") as f:
      for row in csv.reader(f, delimiter='\t'):
          print(row)

The result I got was the following:我得到的结果如下:

['R1C1', 'R1C2', 'R1C3']
['R2C1', 'R2C2', 'R2C3']
['R3C1', 'R3C2\tR3C3\nR4C1\tR4C2\tR4C3\n']

It turns out that when Python finds a field whose first character is a quotation mark and there is no closing quotation mark, it will include all of the following content as part of the same field.原来,当Python找到第一个字符是引号且没有右引号的字段时,它会把后面的所有内容作为同一个字段的一部分。

My question: What is the best approach for all rows in the file to be read properly?我的问题:正确读取文件中所有行的最佳方法是什么? Please consider I'm using Python 3.8.5 and the script should be able to read huge files (2gb or more), so memory usage and performance issues should be also considered.请考虑我使用的是 Python 3.8.5 并且脚本应该能够读取大文件(2GB 或更多),因此还应考虑 memory 的使用和性能问题。

Thanks!谢谢!

Honestly, if you're dealing with that much data, it'd be best to go in and clean it first.老实说,如果你要处理那么多数据,最好先输入 go 并清理它。 And if possible, fix whatever process is producing your bad data in the first place.如果可能,首先修复任何产生错误数据的进程。

I haven't tested with a large file, but you may just be able to replace " characters as you read lines, assuming there's never a case where they're valid characters:我没有测试过大文件,但你可以在阅读行时替换"字符,假设它们永远不会是有效字符:

import csv


with open("test.csv") as f:
    line_generator = (line.replace('"', '') for line in f)
    for row in csv.reader(line_generator, delimiter='\t'):
        print(row)

Output: Output:

['R1C1', 'R1C2', 'R1C3']
['R2C1', 'R2C2', 'R2C3']
['R3C1', 'R3C2', 'R3C3']
['R4C1', 'R4C2', 'R4C3']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM