简体   繁体   English

带有随机双引号的CSV文件

[英]CSV file with random double quotes

I have a CSV file that has a double quote character in some fields. 我有一个在某些字段中带有双引号字符的CSV文件。 When parsing with Python, it begins ignoring the delimiter in between these quotes. 使用Python解析时,它开始忽略这些引号之间的分隔符。 For instance: 例如:

5695|258|03/21/2012| 15:16:02.000|info|Microsoft-Windows-Defrag|shrink estimation, (C:)|36|"6ybSr: c{q6: |Application|WKS-WIN732test.test.local|http://schemas.microsoft.com/win/2004/08/events/event|0x0080000000000000|0|0||0|0|C:\Users\test\EventLog\win7-32-test-c-drive\Application.evtx
5770|258|03/24/2012| 04:21:02.000|info|Microsoft-Windows-Defrag|boot optimization, (C:)|36|00 00 00 00 d3 03 00 00 ae 03 00 00 00 00 00 00 22 b6 30 df 64 79 c7 f6 e2 6c 1c 00 00 00 00 00 00 00 00 00|Application|WKS-WIN732test.test.local|http://schemas.microsoft.com/win/2004/08/events/event|0x0080000000000000|0|0||0|0|C:\Users\test\EventLog\win7-32-test-c-drive\Application.evtx
5843|258|03/27/2012| 07:38:36.000|info|Microsoft-Windows-Defrag|boot optimization, (C:)|36|jbg54t5t"gfb:*&hgfh|Application|WKS-WIN732test.test.local|http://schemas.microsoft.com/win/2004/08/events/event|0x0080000000000000|0|0||0|0|C:\Users\test\EventLog\win7-32-test-c-drive\Application.evtx

As such, it reads everything between the two double quotes as a single field: 这样,它会将两个双引号之间的所有内容作为一个字段读取:

5695|258|03/21/2012| 15:16:02.000|info|Microsoft-Windows-Defrag|shrink estimation, (C:)|36|"6ybSr: c{q6: |Application|WKS-WIN732test.test.local|http://schemas.microsoft.com/win/2004/08/events/event|0x0080000000000000|0|0||0|0|C:\Users\test\EventLog\win7-32-test-c-drive\Application.evtx
                                                                                           ^

5770|258|03/24/2012| 04:21:02.000|info|Microsoft-Windows-Defrag|boot optimization, (C:)|36|00 00 00 00 d3 03 00 00 ae 03 00 00 00 00 00 00 22 b6 30 df 64 79 c7 f6 e2 6c 1c 00 00 00 00 00 00 00 00 00|Application|WKS-WIN732test.test.local|http://schemas.microsoft.com/win/2004/08/events/event|0x0080000000000000|0|0||0|0|C:\Users\test\EventLog\win7-32-test-c-drive\Application.evtx

5843|258|03/27/2012| 07:38:36.000|info|Microsoft-Windows-Defrag|boot optimization, (C:)|36|jbg54t5t"gfb:*&hgfh|Application|WKS-WIN732test.test.local|http://schemas.microsoft.com/win/2004/08/events/event|0x0080000000000000|0|0||0|0|C:\Users\test\EventLog\win7-32-test-c-drive\Application.evtx
                                                                                                   ^

(see the carets ( ^ ) in above example). (请参见上例中的插入符号( ^ ))。

How do I get it to ignore the double quote? 如何忽略双引号?

CAVEAT: I do not want to read the entire file into RAM and replace the character. CAVEAT:我不想将整个文件读入RAM并替换字符。 The solution must work while iterating through rows from the reader. 该解决方案必须在遍历读取器的行时起作用。

The delimiter is the pipe. 分隔符是管道。 I read it using standard CSV techniques and decode it with known encoding: 我使用标准CSV技术阅读并使用已知编码对其进行解码:

import csv
known_encoding = 'utf-8'  # for mwe, real code fetches for each file

with open(self.current_file.file_path, 'rb') as f:
    reader = csv.reader(f, delimiter='|')
    for row in reader:
        row = [s.decode(known_encoding) for s in row]
        # do stuff with data in row

您的CSV文件从不包含带引号的字段,因此您可以使用quoting参数将其关闭:

csv.reader(f, delimiter='|', quoting=csv.QUOTE_NONE)

You can set quoting to csv.QUOTE_NONE as such: 您可以这样设置对csv.QUOTE_NONE quoting

import csv

with open('my_file', 'r') as f:
    csvreader = csv.reader(f, delimiter='|', quoting=csv.QUOTE_NONE)
    ....

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM