如何根据行值将tsv文件拆分为较小的tsv文件

Question

I have a tsv file in.txt which I would like to split into a smaller tsv file called out.txt . 我有一个tsv文件in.txt ，我想将其拆分成一个较小的tsv文件out.txt 。

I would like to import only the rows of in.txt which contain a string value My String Value in column 6 into out.txt . 我只想将in.txt中包含第6列中的字符串值My String Value的行导入到out.txt 。

import csv

# r is textmode
# rb is binary mode
# binary mode is faster

with open('in.txt','rb') as tsvIn, open('out.txt', 'w') as tsvOut:
    tsvIn = csv.reader(tsvIn, delimiter='\t')
    tsvOut = csv.writer(tsvOut)

    for row in tsvIn:
        if "My String Value" in row:
            tsvOut.writerows(row)

My output looks like this. 我的输出看起来像这样。

D,r,a,m,a

1,9,6,1,-,0,4,-,1,3
H,y,u,n, ,M,o,k, ,Y,o,o
B,e,o,m,-,s,e,o,n, ,L,e,e
M,u,-,r,y,o,n,g, ,C,h,o,i,",", ,J,i,n, ,K,y,u, ,K,i,m,",", ,J,e,o,n,g,-,s,u,k, ,M,o,o,n,",", ,A,e,-,j,a, ,S,e,o

A, ,p,u,b,l,i,c, ,a,c,c,o,u,n,t,a,n,t,',s, ,s,a,l,a,r,y, ,i,s, ,f,a,r, ,t,o,o, ,s,m,a,l,l, ,f,o,r, ,h,i,m, ,t,o, ,e,v,e,n, ,g,e,t, ,a, ,c,a,v,i,t,y, ,f,i,x,e,d,",", ,l,e,t, ,a,l,o,n,e, ,s,u,p,p,o,r,t, ,h,i,s, ,f,a,m,i,l,y,., ,H,o,w,e,v,e,r,",", ,h,e, ,m,u,s,t, ,s,o,m,e,h,o,w, ,p,r,o,v,i,d,e, ,f,o,r, ,h,i,s, ,s,e,n,i,l,e,",", ,s,h,e,l,l,-,s,h,o,c,k,e,d, ,m,o,t,h,e,r,",", ,h,i,s, ,.,.,.

K,o,r,e,a,n,",", ,E,n,g,l,i,s,h

S,o,u,t,h, ,K,o,r,e,a

It should look like this with tab separated values 用制表符分隔的值应该看起来像这样

Drama     Hyn Mok Yoo     A public accountant's salary is far to small for him...etc

Answer 1

There are a few things wrong with your code. 您的代码有些错误。 Let's look at this line by line.. 让我们逐行看看。

import csv

Import module csv . 导入模块csv 。 Ok. 好。

with open('in.txt','rb') as tsvIn, open('out.txt', 'w') as tsvOut:

With auto-closed binary file read handle tsvIn from in.txt , and text write handle tsvOut from out.txt , do... (Note: you probably want to use mode wb instead of mode w ; see this post ) 使用自动关闭的二进制文件从in.txt读取句柄tsvIn ，从out.txt文本写入句柄tsvOut ，执行...（注意：您可能要使用模式wb而不是模式w ；请参阅本文）

    tsvIn = csv.reader(tsvIn, delimiter='\t')

Let tsvIn be the result of the call of function reader in module csv with arguments tsvIn and delimiter='\\t' . 令tsvIn为参数tsvIn和delimiter='\\t' csv模块中函数reader的调用结果。 Ok. 好。

    tsvOut = csv.writer(tsvOut)

Let tsvOut be the result of the call of function writer in module csv with argument tsvOut . 令tsvOut为使用参数tsvOut csv模块中函数writer的调用结果。 You proably want to add another argument, delimiter='\\t' , too. 您可能还想添加另一个参数delimiter='\\t' 。

    for row in tsvIn:

For each element in tsvIn as row , do... 对于tsvIn作为row每个元素，请执行...

        if "My String Value" in row:

If string "My String Value" is present in row . 如果row存在字符串"My String Value" 。 You mentioned that you wanted to show only those rows whose sixth element was equal to the string, thus you should use something like this instead... 您提到过，您只想显示第六个元素等于字符串的那些行，因此应该改用这样的内容...

        if len(row) >= 6 and row[5] == "My String Value":

This means: If the length of row is at least 6 , and the sixth element of row is equal to "My String Value" , do... 这意味着：如果row的长度至少为6 ，并且row的第六个元素等于"My String Value" ，请执行以下操作：

            tsvOut.writerows(row)

Call method writerows of object tsvOut with argument row . 使用参数row调用对象tsvOut方法writerows 。 Remember that in Python, a string is just a sequence of characters, and a character is a single-element string. 请记住，在Python中，字符串只是一个字符序列，而字符是一个单元素字符串。 Thus, a character is a sequence. 因此，字符是一个序列。 Then, we have that row is, according to the docs, a list of strings, each representing a column of the row. 然后，根据文档，该row是一个字符串列表，每个字符串代表该行的一列。 Thus, a row is a list of strings. 因此，一行是字符串列表。 Then, we have the writerows method, that expects a list of rows, that is, a list of lists of strings, that is, a list of lists of sequences of characters. 然后，我们有了writerows方法，该方法需要一个行列表，即一个字符串列表的列表，即一个字符序列列表的列表。 It happens that you can interpret each of row 's elements as a row, when it's actually a string, and each element of that string as a string (as characters are strings!). 这种事，你可以解释各row的元素作为一个行，当它实际上是一个字符串，而该字符串作为字符串中的每个元素（如字符字符串！）。 All of this means is that you'll get a messy, character-by-character output. 所有这些意味着您将获得一个混乱的，逐个字符的输出。 You should try this instead... 你应该试试这个...

            tsvOut.writerow(row)

Method writerow expects a single row as an argument, not a list of rows, thus this will yield the expected result. 方法writerow期望将单行作为参数，而不是行列表，因此将产生预期的结果。

Answer 2

try this: 尝试这个：

import csv

# r is textmode
# rb is binary mode
# binary mode is faster

with open('in.txt','r') as tsvIn, open('out.txt', 'w') as tsvOut:
    reader = csv.reader(tsvIn, delimiter='\t')
    writer = csv.writer(tsvOutm, delimiter='\t')

    [writer.writerow(row) for row in reader if "My String Value" in row]

如何根据行值将tsv文件拆分为较小的tsv文件

问题描述

2 个解决方案

解决方案1
2 已采纳 2016-03-18 20:23:43

解决方案2
1 2016-03-18 20:12:22

如何根据行值将tsv文件拆分为较小的tsv文件

问题描述

2 个解决方案

解决方案1 2 已采纳 2016-03-18 20:23:43

解决方案2 1 2016-03-18 20:12:22

解决方案1
2 已采纳 2016-03-18 20:23:43

解决方案2
1 2016-03-18 20:12:22