简体   繁体   English

python选项卡分隔文件解析问题

[英]python tab separated file parsing problems

From mysql I am generating a tab-separated output file using outfile . 从mysql我正在使用outfile生成制表符分隔的输出文件。 I then use python to load the tsv and process it. 然后,我使用python加载tsv并进行处理。 I feel like I'm missing something, but I cannot figure out how to get csv.reader to accept data where quoted fields can contain \\t tabs, \\n newlines, \\r carriage returns, etc. The csv.reader keeps breaking the rows on all newline characters, not just the \\n newline characters outside of my quoted fields. 我感觉好像丢失了一些东西,但是我无法弄清楚如何使csv.reader接受带引号的字段可以包含\\t制表符, \\n换行符, \\r回车符等的数据csv.reader不断破坏所有换行符上的行,而不仅仅是我引用的字段之外的\\n换行符。

Settings: 设置:

with open('/path/to/file.tsv', 'rbU') as f:
    reader = csv.reader(
        f,
        delimiter='\t',
        lineterminator='\n',
        quoting=csv.QUOTE_ALL
    )
    for line in reader:
        #  do something

Example: 例:

In the example below, \\r is an actual carriage return, \\n is an actual newline, and \\N is what mysql is outputting for a null value. 在下面的示例中, \\r是实际的回车符, \\n是实际的换行符, \\N是mysql输出的null值。

"4256996"   "test@gmail.com"    "Y  "   "98230\r"   "2012-07-10T12:00:00"   "some  location"    \N  \N  "false" "aaa"   "another-field" "true"  1

The resulting output: 结果输出:

['4256996', 'test@gmail.com', 'Y\t', '98230'], ['2012-07-10T12:00:00', 'some  location', '\\N', '\\N', 'false', 'aaa', 'another-field', 'true', '1']

Is there a way to get the csv.reader to read this input data properly, or is this some sort of limitation with the csv.reader object? 有没有办法让csv.reader正确读取此输入数据,或者csv.reader对象有某种限制?

Note: If you try to replicate this, make sure you replace \\r with an actual carriage return, \\n with an actual newline, etc. 注意:如果您尝试复制此代码,请确保将\\r替换为实际的回车符,将\\n替换为实际的换行符,等等。

You need to open your file in binary mode only . 您只需要以二进制模式打开文件。 By adding in 'U' (universal newline mode) you are instead instructing Python to replace any \\r with \\n . 通过添加'U' (通用换行模式),您将指示Python将\\r替换为\\n

with open('/path/to/file.tsv', 'rb') as f:

Once reading just binary data your sample input works: 一旦读你的样品输入工作只是二进制数据:

>>> import csv
>>> from io import BytesIO
>>> sample = BytesIO('''\
... "4256996"\t"test@gmail.com"\t"Y  "\t"98230\r"\t"2012-07-10T12:00:00"\t"some  location"\t\\N\t\\N\t"false"\t"aaa"\t"another-field"\t"true"\t1\r\n''')
>>> sample.readline()
'"4256996"\t"test@gmail.com"\t"Y  "\t"98230\r"\t"2012-07-10T12:00:00"\t"some  location"\t\\N\t\\N\t"false"\t"aaa"\t"another-field"\t"true"\t1\r\n'
>>> sample.seek(0)
0L
>>> reader = csv.reader(sample, delimiter='\t',
...         lineterminator='\n',
...         quoting=csv.QUOTE_ALL
...     )
>>> next(reader)
['4256996', 'test@gmail.com', 'Y  ', '98230\r', '2012-07-10T12:00:00', 'some  location', '\\N', '\\N', 'false', 'aaa', 'another-field', 'true', '1']

To illustrate, reading a line with the U mode set Python reads the data incorrectly: 为了说明这一点,使用Python设置的U模式读取一行会错误地读取数据:

>>> sample.seek(0)
0L
>>> open('/tmp/test.csv', 'wb').write(sample.read())
>>> f = open('/tmp/test.csv', 'rbU')
>>> f.readline()
'"4256996"\t"test@gmail.com"\t"Y  "\t"98230\n'
>>> f = open('/tmp/test.csv', 'rb')
>>> f.readline()
'"4256996"\t"test@gmail.com"\t"Y  "\t"98230\r"\t"2012-07-10T12:00:00"\t"some  location"\t\\N\t\\N\t"false"\t"aaa"\t"another-field"\t"true"\t1\r\n'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM