简体   繁体   English

python如何读取tsv文件,清理它并另存为新文件?

[英]python how to read a tsv file, clean it and save as new file?

I want to remove all punctuations from column 4 of my tsv file and then save the entire file.我想从我的 tsv 文件的第 4 列中删除所有标点符号,然后保存整个文件。 This is my code:这是我的代码:

import csv
import string

exclude = set(string.punctuation)

with open("test1") as tsvfile:
    tsvreader = csv.reader(tsvfile, delimiter="\t")
    for line in tsvreader:
        line[4] = ''.join(ch for ch in line[4] if ch not in exclude)
    tsvfile.close()

The code above works fine, but my file did not save with the changes i made.上面的代码工作正常,但我的文件没有保存我所做的更改。 How can i save the changes within the old file?如何保存旧文件中的更改?

You are not writing any changes, you are simply changing each fifth element in each row and doing nothing with it, if you want to change the original file you can write to a tempfile and do a shutil.move to replace the original file with the updated temp:您没有写任何更改,您只是更改每行中的每个第五个元素并且不对其执行任何操作,如果您想更改原始文件,您可以写入tempfile并执行shutil.move以将原始文件替换为更新温度:

import string

exclude = string.punctuation
from tempfile import NamedTemporaryFile
from shutil import move

with open("test1") as tsvfile, NamedTemporaryFile(dir=".",delete=False) as t:
    tsvreader = csv.reader(tsvfile, delimiter="\t")
    temp = csv.writer(t,delimiter="\t")
    for row in tsvreader:
        row[4] = row[4].strip(exclude)
        temp.writerow(row)

move(t.name,"test1")

If you want to create a new file instead of updating the original you just need to open a new file and write each cleaned row:如果你想创建一个新文件而不是更新原始文件,你只需要打开一个新文件并写入每个清理过的行:

with open("test1") as tsvfile, open("out","w") as  t:
    tsvreader = csv.reader(tsvfile, delimiter="\t")
    temp = csv.writer(t,delimiter="\t")
    for row in tsvreader:
        row[4] = row[4].strip(exclude)
        temp.writerow(row)

To strip punctuation str.strip(exclude) will be sufficient. str.strip(exclude)标点符号str.strip(exclude)就足够了。 If you want to remove from anywhere you can go back to ''.join([ch for ch in line[4] if ch not in exclude]) but if you were removing from anywhere then you should use str.translate :如果你想从任何地方删除你可以回到''.join([ch for ch in line[4] if ch not in exclude])但是如果你从任何地方删除那么你应该使用str.translate

 row[4] = row[4].translate(None,exclude) 

If you want to add a space:如果要添加空格:

from string import maketrans
tbl = maketrans(exclude," "*len(exclude))

....
row[4] = row[4].translate(tbl) 

Lastly if you actually mean the fourth column then it would be row[3] not row[4]最后,如果您实际上是指第四列,那么它将是row[3]而不是row[4]

You say that you want a new file, so you will need to open a second file and write the cleaned rows to it:你说你想要一个新文件,所以你需要打开第二个文件并将清理过的行写入其中:

import csv
import string

exclude = string.punctuation

with open("test1") as tsvfile, open('out.csv') as outfile:
    tsvreader = csv.reader(tsvfile, delimiter="\t")
    tsvwriter = csv.writer(outfile, delimiter="\t")
    for row in tsvreader:
        row[4] = row[4].translate(None, string.punctuation)
        tsvwriter.writerow(row)

This uses str.translate() to remove all unwanted punctuation characters from the column.这使用str.translate()从列中删除所有不需要的标点符号。 The above is for Python 2. For Python 3 use this:以上适用于 Python 2。对于 Python 3,请使用:

row[4] = row[4].translate({ord(c): None for c in string.punctuation})

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM