Python - 將一行拆分為列 - csv數據

Question

我試圖從csv文件中讀取數據，將每一行拆分為相應的列。

但是當一個特定的列本身有逗號時，我的正則表達式失敗了。

例如：a，b，c，“d，e，g，”，f

我想要的結果如下：

a    b    c    "d,e, g,"    f

這是5列。

這是正則表達式用於通過逗號分割字符串

，（？=（？：“[^”] ？（？：[^“] ）*））|，（？= [^”] +（？：，）|，+ | $）

但它適用於其他人時，它會失敗。

我正在尋找的是，當我使用pyspark從csv讀取數據到dataframe / rdd時，我想加載/保留所有列而不會出現任何錯誤

謝謝

Answer 1

在較新的regex模塊的幫助下，更容易：

import regex as re

string = 'a,b,c,"d,e, g,",f'
rx = re.compile(r'"[^"]*"(*SKIP)(*FAIL)|,')

parts = rx.split(string)
print(parts)
# ['a', 'b', 'c', '"d,e, g,"', 'f']

它支持(*SKIP)(*FAIL)機制，它忽略了本例中雙引號之間的所有內容。

如果您使用雙引號轉義，則可以使用：

 import regex as re string = '''a,b,c,"d,e, g,",f, this, one, with "escaped \\"double",quotes:""''' rx = re.compile(r'".*?(?<!\\\\)"(*SKIP)(*FAIL)|,') parts = rx.split(string) print(parts) # ['a', 'b', 'c', '"d,e, g,"', 'f', ' this', ' one', ' with "escaped "double",quotes:""']

在regex101.com上查看后者的演示。

近50分，我覺得也提供csv方法：

 import csv string = '''a,b,c,"d,e, g,",f, this, one, with "escaped \\"double",quotes:""''' # just make up an iterable, normally a file would go here for row in csv.reader([string]): print(row) # ['a', 'b', 'c', 'd,e, g,', 'f', ' this', ' one', ' with "escaped "double"', 'quotes:""']

Answer 2

試試\\,(?=([^"\\\\]*(\\\\.|"([^"\\\\]*\\\\.)*[^"\\\\]*"))*[^"]*$) 。

使用此答案解釋了如何匹配所有不在引號中的內容，忽略轉義引號和http://regexr.com/以進行測試。

請注意 - 作為問題狀態的其他答案 - 有更好的方法來解析CSV而不是使用正則表達式。

Answer 3

您無法使用正則表達式輕松解析CSV文件。

我從Unix命令行處理CSV的工具包是csvkit ，你可以從https://csvkit.readthedocs.io獲得。 它也有一個Python庫。

標准csv庫的Python文檔位於： https ： //docs.python.org/2/library/csv.html

這里有一個解析CSV的廣泛討論：

https://softwareengineering.stackexchange.com/questions/166454/can-the-csv-format-be-defined-by-a-regex

這是一個很好的路徑，庫很好，你不應該滾動你自己的代碼。

Python - 將一行拆分為列 - csv數據

問題描述

3 個解決方案

解決方案1
3 已采納 2016-08-09 16:10:38

解決方案2
3 2016-08-09 16:11:44

解決方案3
3 2016-08-09 16:13:31

Python - 將一行拆分為列 - csv數據

問題描述

3 個解決方案

解決方案1 3 已采納 2016-08-09 16:10:38

解決方案2 3 2016-08-09 16:11:44

解決方案3 3 2016-08-09 16:13:31

解決方案1
3 已采納 2016-08-09 16:10:38

解決方案2
3 2016-08-09 16:11:44

解決方案3
3 2016-08-09 16:13:31