[英]Python: how to skip the repeated lines in an input file?
我正在从文件中读取数据,因此我想单独读取每一行,因为输出中的第三行必须是前两行的组合。 这是一个小例子:
Input:
<www.example.com/apple> <Anything>
<www.example.com/banana> <Anything>
Output:
<www.example.com/apple> <Anything>
<www.example.com/banana> <Anything>
<Apple> <Banana>
如果有任何行重复或为空行,那么我不想处理它,每次只想获得2条不同的行。
这是我实际输入的一部分:
<http://catalog.data.gov/bread> <http://dbpedia.org>
<http://catalog.data.gov/bread> <http://dbpedia.org>
<http://catalog.data.gov/bread> <http://dbpedia.org>
<http://catalog.data.gov/bread> <http://dbpedia.org>
<http://catalog.data.gov/roll> <http://dbpedia.org>
<http://catalog.data.gov/roll> <http://dbpedia.org>
在这种情况下,我希望输出如下所示:
<http://catalog.data.gov/bread> <http://dbpedia.org>
<http://catalog.data.gov/roll> <http://dbpedia.org>
<bread> <roll>
这是我的代码:
file = open('rdfs.txt')
for id, line in enumerate(file):
if id % 2 == 0:
if line.isspace():
continue
line1 = line.split()
sub_line1, rel_line1 = line1[0], line1[1]
sub_line1 = sub_line1.lstrip("<").rstrip(">")
print(sub_line1)
else:
if line.isspace():
continue
line2 = line.split()
sub_line2, rel_line2 = line2[0], line2[1]
sub_line2 = sub_line2.lstrip("<").rstrip(">")
print(sub_line2)
它工作正常,但是我正在获取所有行,如果第二行与之前的行相等,如何添加,然后跳过所有行,直到找到新的不同行。
我现在得到的输出:
http://catalog.data.gov/bread
http://catalog.data.gov/bread
http://catalog.data.gov/roll
http://catalog.data.gov/roll
谢谢!!
您可以声明一个set()
并将其命名为line_seen
,它将保存所有可见的行,然后检查是否在lines_seen
中的lines_seen
将其添加到您的检查中:
您的代码应如下所示:
file = open('rdfs.txt')
lines_seen = set() # holds lines already seen
for id, line in enumerate(file):
if line not in lines_seen: # not a duplicate
lines_seen.add(line)
if id % 2 == 0:
if line.isspace():
continue
line1 = line.split()
sub_line1, rel_line1 = line1[0], line1[1]
sub_line1 = sub_line1.lstrip("<").rstrip(">")
print(sub_line1)
else:
if line.isspace():
continue
line2 = line.split()
sub_line2, rel_line2 = line2[0], line2[1]
sub_line2 = sub_line2.lstrip("<").rstrip(">")
print(sub_line2)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.