简体   繁体   English

如何在python中解析文本文件

[英]How to parse text file in python

I have a task: I have a directory that has many text files. 我有一个任务:我的目录包含许多文本文件。 Each file has many lines. 每个文件有很多行。 Each line has Tab delemeted field. 每行都有Tab删除字段。 I have to exclude some of the lines from this files by comparing the value in the first field with the value in another text file. 我必须通过将第一个字段中的值与另一个文本文件中的值进行比较来排除此文件中的某些行。 Those 'bad' lines I have to copy to a new 'bad' file. 这些“坏”行必须复制到新的“坏”文件中。 The 'good' line (that did not match) I have to copy to another 'good' file. 我必须将“好”行(不匹配)复制到另一个“好”文件中。 At the end I should have many new files ('good' and 'bad'). 最后,我应该有许多新文件(“好”和“坏”)。 In other words script should parse each file in the directory, compare each line with the value in another file and in case it match copy that line into new file. 换句话说,脚本应该解析目录中的每个文件,将每一行与另一个文件中的值进行比较,如果匹配则将该行复制到新文件中。 I wrote this: 我这样写:

import csv
import sys
import os

prefix = 'dna'
goodFiles = []
badFiles = []

fileList = os.listdir(sys.argv[1])

for f in fileList:
    absFile = os.path.join(os.path.abspath(sys.argv[1]), f )
    newBadF = "BADFile" + "_" + f
    badFile = open(newBadF,'w')
    newGoodF = "GOODFile" + "_" + f
    goodFile = open(newGoodF,'w')
    resultList = open(sys.argv[2], 'rb')
    convertList = list(resultList)
    with open(absFile, 'rb') as csvfile:
        reader = csv.reader(csvfile, delimiter='\t')
        for row in reader:
            for field in convertList:
                if row[0].lower() == field.strip():
                    badFile.writelines('"%s"\n' % row)
                    next
                else:
                    goodFile.writelines('"%s"\n' % row)
                    next

My script does not work :) ie it produces files where each line is a list like this: "['342', '343', '344', '345', '346', '347', '348', '349', '350']" while original file has different format ie it does not have comma, it does not have '[' and ']' My question: how to fix it and get new file with the same format as original ones? 我的脚本不起作用:),即它生成的文件中的每一行都是这样的列表:“ [''342','343','344','345','346','347','348', '349','350']“,而原始文件具有不同的格式,即没有逗号,没有'['和']'我的问题:如何修复它并获取与原始格式相同的新文件那些? Thanks 谢谢

you can use a csv.writer in the same way you are using a csv.reader if you would like the same delimiter 如果您希望使用相同的定界符,则可以使用与使用csv.writer相同的方式来使用csv.reader

bad_writer = csv.writer(badFile, delimiter='\t')
good_writer = csv.writer(goodFile, delimiter='\t')
...
if row[0].lower() == field.strip():
    bad_writer.writerow(row)
else:
    good_writer.writerow(row)

etc. 等等

When you call 你打电话时

badFile.writelines('"%s"\n' % row)

the % format operator actually turns the row into the string representation %格式运算符实际上将行转换为字符串表示形式

>>> _list = [1,2,3]
>>> str(_list)
'[1, 2, 3]'
>>> 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM