简体   繁体   English

如何从python中的txt文件中删除带有重复子字符串的行?

[英]how to remove lines with duplicated substrings from txt file in python?

i am trying to remove lines from a .txt file that contains duplicated substrings.我正在尝试从包含重复子字符串的 .txt 文件中删除行。 lets say i have lines like this:假设我有这样的行:

aaaaaa, something.... 
bbbbbb, something differet.. 
cccccc, some other text.. 
cccccc, again different text.. 
dddddd, again some other text..
eeeeee, some other text... 
etc..

i want to filter out all the lines that start with the same substring (first N chars), so that there will be only one (the first one) line starting with it.我想过滤掉以相同子字符串(前 N 个字符)开头的所有行,以便只有一个(第一个)行以它开头。 these i want to copy to a new txt file.这些我想复制到一个新的txt文件。

so in the example above the first three lines would be copied, the fourth would be skipped and the rest would be copied.所以在上面的例子中,前三行将被复制,第四行将被跳过,其余的将被复制。

i want to copy all the lines, not only the substring that i am checking我想复制所有行,而不仅仅是我正在检查的子字符串

this is what i have written based on what i have found这是我根据我的发现写的

lines_seen = set()
outfile = open(outfile, "w")

for line in open(infile, "r"):
    string_to_compare = line[0:N] #save the substring into a variable
    if line.startswith(string_to_compare) not in lines_seen:
        outfile.write(line)
        lines_seen.add(line)
outfile.close()

this code above actually copies all the lines from the outfile into infile, so no filtering is done.上面的这段代码实际上将 outfile 中的所有行复制到 infile 中,因此没有进行过滤。

can anyone tell me where is the mistake or how to make it work, please?谁能告诉我错误在哪里或如何使它起作用?

如果只对前 60 个字符感兴趣,你应该只在你的集合中存储这个切片( lines_see.add(string_to_compare) )并且你的检查应该更改为if string_to_compare not in lines_seen:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM