简体   繁体   English

在python中将tsv转换为tsv

[英]Converting tsv to tsv in python

I have a tsv-file (tab-seperated) and would like to filter out a lot of data using python before I import it into a postgresql database. 我有一个tsv文件(制表符分隔),并想在将其导入到PostgreSQL数据库之前使用python过滤掉很多数据。 My problem is that I can't find a way to keep the format of the original file which is mandatory because otherwise the import processes won't work. 我的问题是我找不到保持强制性的原始文件格式的方法,因为否则导入过程将无法进行。 The web suggested that I should use the csv library, but no matter what delimter I use I always end up with files in a different format than the origin, eg files, that contain a comma after every character or files, that contain a tab after every character or files that have all data in one row. 网络建议我应该使用csv库,但是无论使用什么斜线格式,我总是以与原始格式不同的文件结尾,例如,每个字符后面都包含逗号的文件,或者每个文件后面都包含制表符的文件在一行中包含所有数据的每个字符或文件。 Here is my code: 这是我的代码:

import csv
import glob

# create a list of all tsv-files in one directory
liste = glob.glob("/some_directory/*.tsv")

# go thru all the files
for item in liste:
    #open the tsv-file for reading and a file for writing   
    with open(item, 'r') as tsvin, open('/some_directory/new.tsv', 'w') as csvout:
    tsvin = csv.reader(tsvin, delimiter='\t')
    # I am not sure if I have to enter a delimter here for the outfile. If I enter "delimter='\t'" like for the In-File, the outfile ends up with a tab after every character
    writer = csv.writer(csvout)

# go thru all lines of the input tsv
    for row in tsvin:
        # do some filtering
        if 'some_substring1' in row[4] or 'some_substring2' in row[4]:
            #do some more filtering
            if 'some_substring1' in str(row[9]) or 'some_substring1' in str(row[9]):
            # now I get lost...
            writer.writerow(row)    

Do you have any idea what I am doing wrong? 你知道我在做什么错吗? The final file has to have a tab between every field and some kind of line break at the end. 最终文件必须在每个字段之间都有一个制表符,最后要有某种换行符。

Somehow you are passing a string to w.writerow() , not a list as expected. 您以某种方式将字符串传递给w.writerow() ,而不是预期的列表。

Remember that strings are iterable; 请记住,字符串是可迭代的。 each iteration returns a single character from the string. 每次迭代都从字符串中返回一个字符。 writerow() simply iterates over its argument writing each item separated by the delimiter character (by default a comma). writerow()简单地遍历其参数,写入每个由定界符分隔的项目(默认情况下为逗号)。 So if you pass a string to writerow() it will write each character from the string separated by the delimiter. 因此,如果将字符串传递给writerow() ,它将写入由定界符分隔的字符串中的每个字符。

How is it that row is a string? row是怎么回事? It could be that the delimiter for the input file is incorrect - perhaps the file does not use tabs but has fixed field widths using runs of spaces as the delimiter. 输入文件的定界符可能不正确-也许该文件不使用制表符,但使用空格作为定界符具有固定的字段宽度。

You can check whether the reader is correctly parsing your file by printing out the value of row : 您可以通过打印出row的值来检查阅读器是否正确解析了文件:

for row in tsvin:
    print(row)
    ...

If the file is being correctly parsed, expect to see that row is a list, and that each element of the list corresponds to a column/field from the file. 如果文件已正确解析,则可以看到该row是一个列表,并且列表中的每个元素都对应于文件中的一列/字段。

If it is not parsing correctly then you might see that row is a string, or that it's a list but the fields are empty and/or out of place. 如果解析不正确,则您可能会看到该row是一个字符串,或者它是一个列表,但是字段为空和/或不适当。

It would be helpful if you added a sample of your input file to the question. 如果您将输入文件样本添加到问题中,将很有帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM