[英]python tab separated file parsing problems
From mysql I am generating a tab-separated output file using outfile . 从mysql我正在使用outfile生成制表符分隔的输出文件。 I then use python to load the tsv and process it.
然后,我使用python加载tsv并进行处理。 I feel like I'm missing something, but I cannot figure out how to get
csv.reader
to accept data where quoted fields can contain \\t
tabs, \\n
newlines, \\r
carriage returns, etc. The csv.reader
keeps breaking the rows on all newline characters, not just the \\n
newline characters outside of my quoted fields. 我感觉好像丢失了一些东西,但是我无法弄清楚如何使
csv.reader
接受带引号的字段可以包含\\t
制表符, \\n
换行符, \\r
回车符等的数据csv.reader
不断破坏所有换行符上的行,而不仅仅是我引用的字段之外的\\n
换行符。
with open('/path/to/file.tsv', 'rbU') as f:
reader = csv.reader(
f,
delimiter='\t',
lineterminator='\n',
quoting=csv.QUOTE_ALL
)
for line in reader:
# do something
In the example below, \\r
is an actual carriage return, \\n
is an actual newline, and \\N
is what mysql is outputting for a null value. 在下面的示例中,
\\r
是实际的回车符, \\n
是实际的换行符, \\N
是mysql输出的null值。
"4256996" "test@gmail.com" "Y " "98230\r" "2012-07-10T12:00:00" "some location" \N \N "false" "aaa" "another-field" "true" 1
The resulting output: 结果输出:
['4256996', 'test@gmail.com', 'Y\t', '98230'], ['2012-07-10T12:00:00', 'some location', '\\N', '\\N', 'false', 'aaa', 'another-field', 'true', '1']
Is there a way to get the csv.reader
to read this input data properly, or is this some sort of limitation with the csv.reader
object? 有没有办法让
csv.reader
正确读取此输入数据,或者csv.reader
对象有某种限制?
Note: If you try to replicate this, make sure you replace \\r
with an actual carriage return, \\n
with an actual newline, etc. 注意:如果您尝试复制此代码,请确保将
\\r
替换为实际的回车符,将\\n
替换为实际的换行符,等等。
You need to open your file in binary mode only . 您只需要以二进制模式打开文件。 By adding in
'U'
(universal newline mode) you are instead instructing Python to replace any \\r
with \\n
. 通过添加
'U'
(通用换行模式),您将指示Python将\\r
替换为\\n
。
with open('/path/to/file.tsv', 'rb') as f:
Once reading just binary data your sample input works: 一旦读你的样品输入工作只是二进制数据:
>>> import csv
>>> from io import BytesIO
>>> sample = BytesIO('''\
... "4256996"\t"test@gmail.com"\t"Y "\t"98230\r"\t"2012-07-10T12:00:00"\t"some location"\t\\N\t\\N\t"false"\t"aaa"\t"another-field"\t"true"\t1\r\n''')
>>> sample.readline()
'"4256996"\t"test@gmail.com"\t"Y "\t"98230\r"\t"2012-07-10T12:00:00"\t"some location"\t\\N\t\\N\t"false"\t"aaa"\t"another-field"\t"true"\t1\r\n'
>>> sample.seek(0)
0L
>>> reader = csv.reader(sample, delimiter='\t',
... lineterminator='\n',
... quoting=csv.QUOTE_ALL
... )
>>> next(reader)
['4256996', 'test@gmail.com', 'Y ', '98230\r', '2012-07-10T12:00:00', 'some location', '\\N', '\\N', 'false', 'aaa', 'another-field', 'true', '1']
To illustrate, reading a line with the U
mode set Python reads the data incorrectly: 为了说明这一点,使用Python设置的
U
模式读取一行会错误地读取数据:
>>> sample.seek(0)
0L
>>> open('/tmp/test.csv', 'wb').write(sample.read())
>>> f = open('/tmp/test.csv', 'rbU')
>>> f.readline()
'"4256996"\t"test@gmail.com"\t"Y "\t"98230\n'
>>> f = open('/tmp/test.csv', 'rb')
>>> f.readline()
'"4256996"\t"test@gmail.com"\t"Y "\t"98230\r"\t"2012-07-10T12:00:00"\t"some location"\t\\N\t\\N\t"false"\t"aaa"\t"another-field"\t"true"\t1\r\n'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.