简体   繁体   English

如何更改奇怪的CSV分隔符?

[英]How to change weird CSV delimiter?

I have a CSV file that I can't open in Excel. 我有一个无法在Excel中打开的CSV文件。

The CSV delimiter is |~| CSV分隔符为|~| , and at the end of a row it is |~~| ,并且在一行的结尾是|~~| .

I have some sample data: 我有一些样本数据:

Education|~|Name_Dutch|~|Name_English|~|Faculty|~~|International Business|~|MB|~|MB|~|ED|~~|

Where the Header part is: Education|~|Name_Dutch|~|Name_English|~|Faculty|~~| 标头部分为: Education|~|Name_Dutch|~|Name_English|~|Faculty|~~|

And the Data/Row part is: International Business|~|MB|~|MB|~|ED|~~| 并且数据/行部分是: International Business|~|MB|~|MB|~|ED|~~|

I need to find out how to change this CSV file in just a normal , comma separated value using a Python Script. 我需要找出如何改变只是一个正常的这个CSV文件,逗号使用Python脚本分隔值。

You can assist the built-in csv module + string.split() function: 您可以协助内置的csv模块+ string.split()函数:

import csv

content = """Education|~|Name_Dutch|~|Name_English|~|Faculty|~~|International Business|~|MB|~|MB|~|ED|~~|"""

# Or read it's content from a file 

with open('output.csv', 'w+') as f:
    writer = csv.writer(f)
    lines = content.split('|~~|')
    for line in lines:
        csv_row = line.split('|~|')
        writer.writerow(csv_row)

it will output a file named output.csv 它将输出一个名为output.csv的文件

Education,Name_Dutch,Name_English,Faculty
International Business,MB,MB,ED
""

When dealing with a csv file, I would prefer using the csv module instead of doing .replace('|~|', ',') because the csv module has build-in support for special characters such as , 处理csv文件时,我更喜欢使用csv模块,而不要使用.replace('|~|', ',')因为csv模块具有对特殊字符(如,

The custom delimiters you mention seem to be unique enough so you can just do a string.replace on them. 您提到的自定义分隔符似乎足够独特,因此您可以对它们进行字符串替换。 Then just write out the file. 然后只写出文件。 The read and write section has all the details you need. 读写部分包含您需要的所有详细信息。 https://docs.python.org/2/tutorial/inputoutput.html https://docs.python.org/2/tutorial/inputoutput.html

import csv

in_name = 'your_input_name.csv'
outname = 'your_outpt_name.csv'

with open(in_name, newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='~', quotechar='|')
    with open(outname, "w", newline='') as outfile:
        csvwriter = csv.writer(outfile, quotechar='"', quoting=csv.QUOTE_NONNUMERIC)
        for row in csvreader:
            line = []
            for item in row:
                if item != "":
                    line.append(item)
                else:
                    csvwriter.writerow(line)
                    line = []

As csv.reader doesn't recognize "~~" as end of line, it converts it to "" , so for csv.writer we repeatedly prepare the part of list (obtained from csv.reader ) until "" is reached. 由于csv.reader无法将"~~"识别为行尾,因此将其转换为"" ,因此对于csv.writer我们反复准备列表的一部分(从csv.reader ),直到到达""

If the file is small, you can simply read its entire contents into memory and replace all weird delimiters found and then write a new version of it back out. 如果文件很小,您可以简单地将其全部内容读入内存,并替换找到的所有奇怪的定界符,然后将其写出新版本。

However if the file is large or you just want to conserve memory usage, it's also possible to read the file incrementally, a single character at-a-time, and do accomplish what needs to be done. 但是,如果文件很大,或者您只想节省内存使用量,则也可以一次读取一个字符以增量方式读取文件,并完成需要完成的工作。

The csvfile argument to the csv.reader constructor "can be any object which supports the iterator protocol and returns a string each time its next() method is called." csv.reader构造函数的csvfile参数“可以是任何支持迭代器协议并在每次调用next()方法时都返回字符串的对象。”

This means the "object" can be a generator function or a generator expression. 这意味着“对象”可以是生成器函数或生成器表达式。 In the code below I've implemented a simple FSM ( Finite State Machine ) to parse the oddly formatted file and yield each line of output it detects. 在下面的代码,我实现了一个简单的FSM( 有限状态机 )来解析格式奇怪的文件,并yield输出的每一行检测。 It may seem like a lot of code, but operate very simply so should be relatively easy to understand how it works: 它可能看起来像很多代码,但是操作非常简单,因此应该相对容易理解它的工作方式:

import csv

def weird_file_reader(filename):
    """Generator that opens and produces "lines" read from the file while
       translating the sequences of '|~|' to ',' and '|~~|' to '\n' (newlines).
    """
    state = 0
    line = []
    with open(filename, 'rb') as weird_file:
        while True:
            ch = weird_file.read(1)  # read one character
            if not ch:  # end-of-file?
                if line:  # partial line read?
                    yield ''.join(line)
                break
            if state == 0:
                if ch == '|':
                    state = 1
                else:
                    line.append(ch)
                    #state = 0  # unnecessary
            elif state == 1:
                if ch == '~':
                    state = 2
                else:
                    line.append('|'+ch)
                    state = 0
            elif state == 2:
                if ch == '|':
                    line.append(',')
                    state = 0
                elif ch == '~':
                    state = 3
                else:
                    line.append('|~'+ch)
                    state = 0
            elif state == 3:
                if ch == '|':
                    line.append('\n')
                    yield ''.join(line)
                    line = []
                    state = 0
                else:
                    line.append('|~~'+ch)
                    state = 0
            else:
                raise RuntimeError("Can't happen")

with open('fixed.csv', 'wb') as outfile:
    reader = csv.reader((line for line in weird_file_reader('weird.csv')))
    writer = csv.writer(outfile)
    writer.writerows(reader)

print('done')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM