Python CSV：逗號，列內的單引號和雙引號

Question

我正在嘗試使用DictWriter編寫csv文件，但是像這樣的列：

2,2' ，2" - （六氫-1,3,5-三嗪-1,3,5-三基）三乙醇| 1,3,5-三（2-羥乙基）六氫-1,3,5-三嗪

破壞一切。 標頭是：

"#","Index no.","EC / List no.","CAS no.","Name","Page ID","Link"

上面的列應該在Name列中，但是在這里，當我嘗試編寫此行時得到了什么：

OrderedDict([('\ufeff "#"', '756'), ('Index no.', '613-114-00-6'), 
             ('EC / List no.', '225-208-0'), ('CAS no.', '4719-04-4'),
             # most of the following should be the value to 'Name' 
             # `PageId` should be '122039' and 'Link' should be the 'https...' text
             ('Name', "2,2',2-(hexahydro-1"), ('Page ID', '3'), 
             ('Link', '5-triazine-1'), 
             (None, ['3', '5-triyl)triethanol|1', '3', 
                     '5-tris(2-hydroxyethyl)hexahydro-1', '3', 
                     '5-triazine"', '122039',
                     'https://echa.europa.eu/information-on-chemicals/cl-inventory-database/-/discli/details/122039'])

我嘗試了DictWriter參數的所有可能組合

quotechar='"', doublequote=False, delimiter=',', quoting=csv.QUOTE_ALL, skipinitialspace=True, escapechar='\\'

沒有任何幫助。

最小，完整和可驗證的示例

old.csv

"#","Index no.","EC / List no.","CAS no.","Name","Page ID"
"756","613-114-00-6","225-208-0","4719-04-4","2,2',2"-(hexahydro-1,3,5-triazine-1,3,5-triyl)triethanol|1,3,5-tris(2-hydroxyethyl)hexahydro-1,3,5-triazine","122039"

碼：

import csv

    with open('old.csv') as f, open('new.csv', 'w') as ff:
            reader = csv.DictReader(f)
            result = csv.DictWriter(ff, fieldnames=reader.fieldnames)
            for line in reader:
                result.writerow(line)

Answer 1

你old.csv格式錯誤-它並沒有逃脫"正常（NOR雙打吧）：

"756","613-114-00-6","225-208-0","4719-04-4","2,2',2"-(hexahydro-1,3,5-triazine-1,3,5-triyl)triethanol|1,3,5-tris(2-hydroxyethyl)hexahydro-1,3,5-triazine","122039"
----------------------------------------------------^ here is the not escaped "

該行應如下所示：

"756","613-114-00-6","225-208-0","4719-04-4","2,2',2\"-(hexahydro-1,3,5-triazine-1,3,5-triyl)triethanol|1,3,5-tris(2-hydroxyethyl)hexahydro-1,3,5-triazine","122039","https://echa.europa.eu/information-on-chemicals/cl-inventory-database/-/discli/details/122039"
----------------------------------------------------^^ escaped "

使用doublequote=True將需要將字段內的"翻倍： "tata""tata"表示tata"tata您的源數據不會：加倍或轉義。

這可以完美地工作：

from collections import OrderedDict

fieldn = ["#","Index no.","EC / List no.","CAS no.","Name","Page ID","Link"]
od = OrderedDict(
    [('#', '756'), ('Index no.', '613-114-00-6'), 
     ('EC / List no.', '225-208-0'), ('CAS no.', '4719-04-4'),
     ('Name', '''2,2',2"-(hexahydro-1,3,5-triazine-1,3,5-triyl)triethanol|1,3,5-tris(2-hydroxyethyl)hexahydro-1,3,5-triazine'''), 
     ('Page ID', '122039'), 
     ('Link', 'https://echa.europa.eu/information-on-chemicals/cl-inventory-database/-/discli/details/122039')])

print(od)  # see: Input to writer:

import csv 

# write the ordered dict    
with open("file.txt", "w",newline = "") as f:
    writer = csv.DictWriter(f, quotechar='"', doublequote=False, delimiter=',', quoting=csv.QUOTE_ALL, skipinitialspace=True, escapechar= '\\', fieldnames=fieldn)
    writer.writeheader()  # remove if you do not want the header in as well
    writer.writerow(od)

# read it back in and print it
with open ("file.txt") as r:
    reader = csv.DictReader(r, quotechar='"', doublequote=False, delimiter=',', quoting=csv.QUOTE_ALL, skipinitialspace=True, escapechar= '\\', fieldnames=fieldn)
    for row in reader:
        print(row)        # see Output after reading in written stuff

輸入給作者：

OrderedDict([('#', '756'), ('Index no.', '613-114-00-6'), ('EC / List no.', '225-208-0'), ('CAS no.', '4719-04-4'), ('Name', '2,2\',2"-(hexahydro-1,3,5-triazine-1,3,5-triyl)triethanol|1,3,5-tris(2-hydroxyethyl)hexahydro-1,3,5-triazine'), ('Page ID', '122039'), ('Link', 'https://echa.europa.eu/information-on-chemicals/cl-inventory-database/-/discli/details/122039')])

讀完書面內容后的輸出（也寫入標頭-因此是雙倍輸出）：

OrderedDict([('#', '#'), ('Index no.', 'Index no.'), ('EC / List no.', 'EC / List no.'), ('CAS no.', 'CAS no.'), ('Name', 'Name'), ('Page ID', 'Page ID'), ('Link', 'Link')])
OrderedDict([('#', '756'), ('Index no.', '613-114-00-6'), ('EC / List no.', '225-208-0'), ('CAS no.', '4719-04-4'), ('Name', '2,2\',2"-(hexahydro-1,3,5-triazine-1,3,5-triyl)triethanol|1,3,5-tris(2-hydroxyethyl)hexahydro-1,3,5-triazine'), ('Page ID', '122039'), ('Link', 'https://echa.europa.eu/information-on-chemicals/cl-inventory-database/-/discli/details/122039')])

檔案內容：

"#","Index no.","EC / List no.","CAS no.","Name","Page ID","Link"
"756","613-114-00-6","225-208-0","4719-04-4","2,2',2\"-(hexahydro-1,3,5-triazine-1,3,5-triyl)triethanol|1,3,5-tris(2-hydroxyethyl)hexahydro-1,3,5-triazine","122039","https://echa.europa.eu/information-on-chemicals/cl-inventory-database/-/discli/details/122039"

Answer 2

如果只有第5列的數據中有雙引號，而其他列的引用正確，如圖所示，則可以使用正則表達式捕獲列並重寫CSV：

bad.csv

"#","Index no.","EC / List no.","CAS no.","Name","Page ID"
"756","613-114-00-6","225-208-0","4719-04-4","2,2',2"-(hexahydro-1,3,5-triazine-1,3,5-triyl)triethanol|1,3,5-tris(2-hydroxyethyl)hexahydro-1,3,5-triazine","122039"
"756","613-114-00-6","225-208-0","4719-04-4",""Example"","122039"
"756","613-114-00-6","225-208-0","4719-04-4","Another "example" of bad formatting","122039"

test.py

import re
import csv

with open('bad.csv') as fin:
    with open('good.csv','w',newline='') as fout:
        writer = csv.writer(fout)
        for line in fin:
            items = re.match(r'"(.*?)","(.*?)","(.*?)","(.*?)","(.*)","(.*?)"$',line).groups()
            writer.writerow(items)

good.csv

#,Index no.,EC / List no.,CAS no.,Name,Page ID
756,613-114-00-6,225-208-0,4719-04-4,"2,2',2""-(hexahydro-1,3,5-triazine-1,3,5-triyl)triethanol|1,3,5-tris(2-hydroxyethyl)hexahydro-1,3,5-triazine",122039
756,613-114-00-6,225-208-0,4719-04-4,"""Example""",122039
756,613-114-00-6,225-208-0,4719-04-4,"Another ""example"" of bad formatting",122039

Python CSV：逗號，列內的單引號和雙引號

問題描述

最小，完整和可驗證的示例

2 個解決方案

解決方案1
3 已采納 2019-02-10 10:56:36

解決方案2
0 2019-02-10 18:21:18

Python CSV：逗號，列內的單引號和雙引號

問題描述

最小，完整和可驗證的示例

2 個解決方案

解決方案1 3 已采納 2019-02-10 10:56:36

解決方案2 0 2019-02-10 18:21:18

解決方案1
3 已采納 2019-02-10 10:56:36

解決方案2
0 2019-02-10 18:21:18