[英]How to split tsv file into smaller tsv file based on row values
I have a tsv file in.txt
which I would like to split into a smaller tsv file called out.txt
. 我有一个tsv文件
in.txt
,我想将其拆分成一个较小的tsv文件out.txt
。
I would like to import only the rows of in.txt
which contain a string value My String Value
in column 6 into out.txt
. 我只想将
in.txt
中包含第6列中的字符串值My String Value
的行导入到out.txt
。
import csv
# r is textmode
# rb is binary mode
# binary mode is faster
with open('in.txt','rb') as tsvIn, open('out.txt', 'w') as tsvOut:
tsvIn = csv.reader(tsvIn, delimiter='\t')
tsvOut = csv.writer(tsvOut)
for row in tsvIn:
if "My String Value" in row:
tsvOut.writerows(row)
My output looks like this. 我的输出看起来像这样。
D,r,a,m,a
1,9,6,1,-,0,4,-,1,3
H,y,u,n, ,M,o,k, ,Y,o,o
B,e,o,m,-,s,e,o,n, ,L,e,e
M,u,-,r,y,o,n,g, ,C,h,o,i,",", ,J,i,n, ,K,y,u, ,K,i,m,",", ,J,e,o,n,g,-,s,u,k, ,M,o,o,n,",", ,A,e,-,j,a, ,S,e,o
A, ,p,u,b,l,i,c, ,a,c,c,o,u,n,t,a,n,t,',s, ,s,a,l,a,r,y, ,i,s, ,f,a,r, ,t,o,o, ,s,m,a,l,l, ,f,o,r, ,h,i,m, ,t,o, ,e,v,e,n, ,g,e,t, ,a, ,c,a,v,i,t,y, ,f,i,x,e,d,",", ,l,e,t, ,a,l,o,n,e, ,s,u,p,p,o,r,t, ,h,i,s, ,f,a,m,i,l,y,., ,H,o,w,e,v,e,r,",", ,h,e, ,m,u,s,t, ,s,o,m,e,h,o,w, ,p,r,o,v,i,d,e, ,f,o,r, ,h,i,s, ,s,e,n,i,l,e,",", ,s,h,e,l,l,-,s,h,o,c,k,e,d, ,m,o,t,h,e,r,",", ,h,i,s, ,.,.,.
K,o,r,e,a,n,",", ,E,n,g,l,i,s,h
S,o,u,t,h, ,K,o,r,e,a
It should look like this with tab separated values 用制表符分隔的值应该看起来像这样
Drama Hyn Mok Yoo A public accountant's salary is far to small for him...etc
There are a few things wrong with your code. 您的代码有些错误。 Let's look at this line by line..
让我们逐行看看。
import csv
Import module csv
. 导入模块
csv
。 Ok. 好。
with open('in.txt','rb') as tsvIn, open('out.txt', 'w') as tsvOut:
With auto-closed binary file read handle tsvIn
from in.txt
, and text write handle tsvOut
from out.txt
, do... (Note: you probably want to use mode wb
instead of mode w
; see this post ) 使用自动关闭的二进制文件从
in.txt
读取句柄tsvIn
,从out.txt
文本写入句柄tsvOut
,执行...(注意:您可能要使用模式wb
而不是模式w
;请参阅本文 )
tsvIn = csv.reader(tsvIn, delimiter='\t')
Let tsvIn
be the result of the call of function reader
in module csv
with arguments tsvIn
and delimiter='\\t'
. 令
tsvIn
为参数tsvIn
和delimiter='\\t'
csv
模块中函数reader
的调用结果。 Ok. 好。
tsvOut = csv.writer(tsvOut)
Let tsvOut
be the result of the call of function writer
in module csv
with argument tsvOut
. 令
tsvOut
为使用参数tsvOut
csv
模块中函数writer
的调用结果。 You proably want to add another argument, delimiter='\\t'
, too. 您可能还想添加另一个参数
delimiter='\\t'
。
for row in tsvIn:
For each element in tsvIn
as row
, do... 对于
tsvIn
作为row
每个元素,请执行...
if "My String Value" in row:
If string "My String Value"
is present in row
. 如果
row
存在字符串"My String Value"
。 You mentioned that you wanted to show only those rows whose sixth element was equal to the string, thus you should use something like this instead... 您提到过,您只想显示第六个元素等于字符串的那些行,因此应该改用这样的内容...
if len(row) >= 6 and row[5] == "My String Value":
This means: If the length of row
is at least 6
, and the sixth element of row
is equal to "My String Value"
, do... 这意味着:如果
row
的长度至少为6
,并且row
的第六个元素等于"My String Value"
,请执行以下操作:
tsvOut.writerows(row)
Call method writerows
of object tsvOut
with argument row
. 使用参数
row
调用对象tsvOut
方法writerows
。 Remember that in Python, a string is just a sequence of characters, and a character is a single-element string. 请记住,在Python中,字符串只是一个字符序列,而字符是一个单元素字符串。 Thus, a character is a sequence.
因此,字符是一个序列。 Then, we have that
row
is, according to the docs, a list of strings, each representing a column of the row. 然后,根据文档,该
row
是一个字符串列表,每个字符串代表该行的一列。 Thus, a row is a list of strings. 因此,一行是字符串列表。 Then, we have the
writerows
method, that expects a list of rows, that is, a list of lists of strings, that is, a list of lists of sequences of characters. 然后,我们有了
writerows
方法,该方法需要一个行列表,即一个字符串列表的列表,即一个字符序列列表的列表。 It happens that you can interpret each of row
's elements as a row, when it's actually a string, and each element of that string as a string (as characters are strings!). 这种事,你可以解释各
row
的元素作为一个行,当它实际上是一个字符串,而该字符串作为字符串中的每个元素(如字符字符串!)。 All of this means is that you'll get a messy, character-by-character output. 所有这些意味着您将获得一个混乱的,逐个字符的输出。 You should try this instead...
你应该试试这个...
tsvOut.writerow(row)
Method writerow
expects a single row as an argument, not a list of rows, thus this will yield the expected result. 方法
writerow
期望将单行作为参数,而不是行列表,因此将产生预期的结果。
try this: 尝试这个:
import csv
# r is textmode
# rb is binary mode
# binary mode is faster
with open('in.txt','r') as tsvIn, open('out.txt', 'w') as tsvOut:
reader = csv.reader(tsvIn, delimiter='\t')
writer = csv.writer(tsvOutm, delimiter='\t')
[writer.writerow(row) for row in reader if "My String Value" in row]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.