[英]How to change weird CSV delimiter?
I have a CSV file that I can't open in Excel. 我有一个无法在Excel中打开的CSV文件。
The CSV delimiter is |~|
CSV分隔符为
|~|
, and at the end of a row it is |~~|
,并且在一行的结尾是
|~~|
. 。
I have some sample data: 我有一些样本数据:
Education|~|Name_Dutch|~|Name_English|~|Faculty|~~|International Business|~|MB|~|MB|~|ED|~~|
Where the Header part is: Education|~|Name_Dutch|~|Name_English|~|Faculty|~~|
标头部分为:
Education|~|Name_Dutch|~|Name_English|~|Faculty|~~|
And the Data/Row part is: International Business|~|MB|~|MB|~|ED|~~|
并且数据/行部分是:
International Business|~|MB|~|MB|~|ED|~~|
I need to find out how to change this CSV file in just a normal ,
comma separated value using a Python Script. 我需要找出如何改变只是一个正常的这个CSV文件
,
逗号使用Python脚本分隔值。
You can assist the built-in csv
module + string.split()
function: 您可以协助内置的
csv
模块+ string.split()
函数:
import csv
content = """Education|~|Name_Dutch|~|Name_English|~|Faculty|~~|International Business|~|MB|~|MB|~|ED|~~|"""
# Or read it's content from a file
with open('output.csv', 'w+') as f:
writer = csv.writer(f)
lines = content.split('|~~|')
for line in lines:
csv_row = line.split('|~|')
writer.writerow(csv_row)
it will output a file named output.csv
它将输出一个名为
output.csv
的文件
Education,Name_Dutch,Name_English,Faculty
International Business,MB,MB,ED
""
When dealing with a csv file, I would prefer using the csv
module instead of doing .replace('|~|', ',')
because the csv
module has build-in support for special characters such as ,
处理csv文件时,我更喜欢使用
csv
模块,而不要使用.replace('|~|', ',')
因为csv
模块具有对特殊字符(如,
The custom delimiters you mention seem to be unique enough so you can just do a string.replace on them. 您提到的自定义分隔符似乎足够独特,因此您可以对它们进行字符串替换。 Then just write out the file.
然后只写出文件。 The read and write section has all the details you need.
读写部分包含您需要的所有详细信息。 https://docs.python.org/2/tutorial/inputoutput.html
https://docs.python.org/2/tutorial/inputoutput.html
import csv
in_name = 'your_input_name.csv'
outname = 'your_outpt_name.csv'
with open(in_name, newline='') as csvfile:
csvreader = csv.reader(csvfile, delimiter='~', quotechar='|')
with open(outname, "w", newline='') as outfile:
csvwriter = csv.writer(outfile, quotechar='"', quoting=csv.QUOTE_NONNUMERIC)
for row in csvreader:
line = []
for item in row:
if item != "":
line.append(item)
else:
csvwriter.writerow(line)
line = []
As csv.reader
doesn't recognize "~~"
as end of line, it converts it to ""
, so for csv.writer
we repeatedly prepare the part of list (obtained from csv.reader
) until ""
is reached. 由于
csv.reader
无法将"~~"
识别为行尾,因此将其转换为""
,因此对于csv.writer
我们反复准备列表的一部分(从csv.reader
),直到到达""
。
If the file is small, you can simply read its entire contents into memory and replace all weird delimiters found and then write a new version of it back out. 如果文件很小,您可以简单地将其全部内容读入内存,并替换找到的所有奇怪的定界符,然后将其写出新版本。
However if the file is large or you just want to conserve memory usage, it's also possible to read the file incrementally, a single character at-a-time, and do accomplish what needs to be done. 但是,如果文件很大,或者您只想节省内存使用量,则也可以一次读取一个字符以增量方式读取文件,并完成需要完成的工作。
The csvfile
argument to the csv.reader
constructor "can be any object which supports the iterator protocol and returns a string each time its next()
method is called." csv.reader
构造函数的csvfile
参数“可以是任何支持迭代器协议并在每次调用next()
方法时都返回字符串的对象。”
This means the "object" can be a generator function or a generator expression. 这意味着“对象”可以是生成器函数或生成器表达式。 In the code below I've implemented a simple FSM ( Finite State Machine ) to parse the oddly formatted file and
yield
each line of output it detects. 在下面的代码,我实现了一个简单的FSM( 有限状态机 )来解析格式奇怪的文件,并
yield
输出的每一行检测。 It may seem like a lot of code, but operate very simply so should be relatively easy to understand how it works: 它可能看起来像很多代码,但是操作非常简单,因此应该相对容易理解它的工作方式:
import csv
def weird_file_reader(filename):
"""Generator that opens and produces "lines" read from the file while
translating the sequences of '|~|' to ',' and '|~~|' to '\n' (newlines).
"""
state = 0
line = []
with open(filename, 'rb') as weird_file:
while True:
ch = weird_file.read(1) # read one character
if not ch: # end-of-file?
if line: # partial line read?
yield ''.join(line)
break
if state == 0:
if ch == '|':
state = 1
else:
line.append(ch)
#state = 0 # unnecessary
elif state == 1:
if ch == '~':
state = 2
else:
line.append('|'+ch)
state = 0
elif state == 2:
if ch == '|':
line.append(',')
state = 0
elif ch == '~':
state = 3
else:
line.append('|~'+ch)
state = 0
elif state == 3:
if ch == '|':
line.append('\n')
yield ''.join(line)
line = []
state = 0
else:
line.append('|~~'+ch)
state = 0
else:
raise RuntimeError("Can't happen")
with open('fixed.csv', 'wb') as outfile:
reader = csv.reader((line for line in weird_file_reader('weird.csv')))
writer = csv.writer(outfile)
writer.writerows(reader)
print('done')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.