简体   繁体   English

如何根据行值将tsv文件拆分为较小的tsv文件

[英]How to split tsv file into smaller tsv file based on row values

I have a tsv file in.txt which I would like to split into a smaller tsv file called out.txt . 我有一个tsv文件in.txt ,我想将其拆分成一个较小的tsv文件out.txt

I would like to import only the rows of in.txt which contain a string value My String Value in column 6 into out.txt . 我只想将in.txt中包含第6列中的字符串值My String Value的行导入到out.txt

import csv

# r is textmode
# rb is binary mode
# binary mode is faster

with open('in.txt','rb') as tsvIn, open('out.txt', 'w') as tsvOut:
    tsvIn = csv.reader(tsvIn, delimiter='\t')
    tsvOut = csv.writer(tsvOut)

    for row in tsvIn:
        if "My String Value" in row:
            tsvOut.writerows(row)

My output looks like this. 我的输出看起来像这样。

D,r,a,m,a

1,9,6,1,-,0,4,-,1,3
H,y,u,n, ,M,o,k, ,Y,o,o
B,e,o,m,-,s,e,o,n, ,L,e,e
M,u,-,r,y,o,n,g, ,C,h,o,i,",", ,J,i,n, ,K,y,u, ,K,i,m,",", ,J,e,o,n,g,-,s,u,k, ,M,o,o,n,",", ,A,e,-,j,a, ,S,e,o

A, ,p,u,b,l,i,c, ,a,c,c,o,u,n,t,a,n,t,',s, ,s,a,l,a,r,y, ,i,s, ,f,a,r, ,t,o,o, ,s,m,a,l,l, ,f,o,r, ,h,i,m, ,t,o, ,e,v,e,n, ,g,e,t, ,a, ,c,a,v,i,t,y, ,f,i,x,e,d,",", ,l,e,t, ,a,l,o,n,e, ,s,u,p,p,o,r,t, ,h,i,s, ,f,a,m,i,l,y,., ,H,o,w,e,v,e,r,",", ,h,e, ,m,u,s,t, ,s,o,m,e,h,o,w, ,p,r,o,v,i,d,e, ,f,o,r, ,h,i,s, ,s,e,n,i,l,e,",", ,s,h,e,l,l,-,s,h,o,c,k,e,d, ,m,o,t,h,e,r,",", ,h,i,s, ,.,.,.

K,o,r,e,a,n,",", ,E,n,g,l,i,s,h

S,o,u,t,h, ,K,o,r,e,a

It should look like this with tab separated values 用制表符分隔的值应该看起来像这样

Drama     Hyn Mok Yoo     A public accountant's salary is far to small for him...etc

There are a few things wrong with your code. 您的代码有些错误。 Let's look at this line by line.. 让我们逐行看看。

import csv

Import module csv . 导入模块csv Ok. 好。

with open('in.txt','rb') as tsvIn, open('out.txt', 'w') as tsvOut:

With auto-closed binary file read handle tsvIn from in.txt , and text write handle tsvOut from out.txt , do... (Note: you probably want to use mode wb instead of mode w ; see this post ) 使用自动关闭的二进制文件从in.txt读取句柄tsvIn ,从out.txt文本写入句柄tsvOut ,执行...(注意:您可能要使用模式wb而不是模式w ;请参阅本文

    tsvIn = csv.reader(tsvIn, delimiter='\t')

Let tsvIn be the result of the call of function reader in module csv with arguments tsvIn and delimiter='\\t' . tsvIn为参数tsvIndelimiter='\\t' csv模块中函数reader的调用结果。 Ok. 好。

    tsvOut = csv.writer(tsvOut)

Let tsvOut be the result of the call of function writer in module csv with argument tsvOut . tsvOut为使用参数tsvOut csv模块中函数writer的调用结果。 You proably want to add another argument, delimiter='\\t' , too. 您可能还想添加另一个参数delimiter='\\t'

    for row in tsvIn:

For each element in tsvIn as row , do... 对于tsvIn作为row每个元素,请执行...

        if "My String Value" in row:

If string "My String Value" is present in row . 如果row存在字符串"My String Value" You mentioned that you wanted to show only those rows whose sixth element was equal to the string, thus you should use something like this instead... 您提到过,您只想显示第六个元素等于字符串的那些行,因此应该改用这样的内容...

        if len(row) >= 6 and row[5] == "My String Value":

This means: If the length of row is at least 6 , and the sixth element of row is equal to "My String Value" , do... 这意味着:如果row的长度至少为6 ,并且row的第六个元素等于"My String Value" ,请执行以下操作:

            tsvOut.writerows(row)

Call method writerows of object tsvOut with argument row . 使用参数row调用对象tsvOut方法writerows Remember that in Python, a string is just a sequence of characters, and a character is a single-element string. 请记住,在Python中,字符串只是一个字符序列,而字符是一个单元素字符串。 Thus, a character is a sequence. 因此,字符一个序列。 Then, we have that row is, according to the docs, a list of strings, each representing a column of the row. 然后,根据文档,该row是一个字符串列表,每个字符串代表该行的一列。 Thus, a row is a list of strings. 因此,一行是字符串列表。 Then, we have the writerows method, that expects a list of rows, that is, a list of lists of strings, that is, a list of lists of sequences of characters. 然后,我们有了writerows方法,该方法需要一个行列表,即一个字符串列表的列表,即一个字符序列列表的列表。 It happens that you can interpret each of row 's elements as a row, when it's actually a string, and each element of that string as a string (as characters are strings!). 这种事,你可以解释各row的元素作为一个行,当它实际上是一个字符串,而该字符串作为字符串中的每个元素(如字符字符串!)。 All of this means is that you'll get a messy, character-by-character output. 所有这些意味着您将获得一个混乱的,逐个字符的输出。 You should try this instead... 你应该试试这个...

            tsvOut.writerow(row)

Method writerow expects a single row as an argument, not a list of rows, thus this will yield the expected result. 方法writerow期望将单行作为参数,而不是行列表,因此将产生预期的结果。

try this: 尝试这个:

import csv

# r is textmode
# rb is binary mode
# binary mode is faster

with open('in.txt','r') as tsvIn, open('out.txt', 'w') as tsvOut:
    reader = csv.reader(tsvIn, delimiter='\t')
    writer = csv.writer(tsvOutm, delimiter='\t')

    [writer.writerow(row) for row in reader if "My String Value" in row]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM