[英]Using text in one file to search for match in second file
I'm using python 2.6 on linux. 我在Linux上使用python 2.6。
I have two text files first.txt has a single string of text on each line. 我有两个文本文件first.txt每行只有一个文本字符串。 So it looks like
所以看起来
lorem LOREM
ipus 议会联盟
asfd ASFD
The second file doesn't quite have the same format. 第二个文件格式不完全相同。 it would look more like this
它看起来像这样
1231 lorem 1231 lorem
1311 assss 31 1 1311屁股31 1
etc 等等
I want to take each line of text from first.txt and determine if there's a match in the second text. 我想从first.txt中获取每一行文本,并确定第二个文本中是否有匹配项。 If there isn't a match then I would like to save the missing text to a third file.
如果没有匹配项,那么我想将丢失的文本保存到第三个文件中。 I would like to ignore case but not completely necessary.
我想忽略情况,但并非完全必要。 This is why I was looking at regex but didn't have much luck.
这就是为什么我一直看正则表达式但运气不佳的原因。
So I'm opening the files, using readlines() to create a list. 所以我打开文件,使用readlines()创建一个列表。
Iterating through the lists and printing out the matches. 遍历列表并打印出匹配项。
Here's my code 这是我的代码
first_file=open('first.txt', "r")
first=first_file.readlines()
first_file.close()
second_file=open('second.txt',"r")
second=second_file.readlines()
second_file.close()
while i < len(first):
j=search[i]
while k < len(second):
m=compare[k]
if not j.find(m):
print m
i=i+1
k=k+1
exit()
It's definitely not elegant. 绝对不优雅。 Anyone have suggestions how to fix this or a better solution?
任何人都建议如何解决此问题或更好的解决方案?
My approach is this: Read the second file, convert it into lowercase and then create a list of the words it contains. 我的方法是:读取第二个文件,将其转换为小写,然后创建包含它的单词的列表。 Then convert this list into a set , for better performance with large files.
然后将此列表转换为set ,以提高大文件的性能。
Then go through each line in the first file, and if it (also converted to lowercase, and with extra whitespace removed) is not in the set we created, write it to the third file. 然后遍历第一个文件中的每一行,如果它(也转换为小写,并去除了多余的空格)不在我们创建的集合中,请将其写入第三个文件。
with open("second.txt") as second_file:
second_values = set(second_file.read().lower().split())
with open("first.txt") as first_file:
with open("third.txt", "wt") as third_file:
for line in first_file:
if line.lower().strip() not in second_values:
third_file.write(line + "\n")
set objects are a simple container type that is unordered and cannot contain duplicate value. 集合对象是一种无序的简单容器类型,不能包含重复值。 It is designed to allow you to quickly add or remove items, or tell if an item is already in the set.
它旨在允许您快速添加或删除项目,或判断项目集中是否已存在项目。
with
statements are a convenient way to ensure that a file is closed, even if an exception occurs. with
语句是确保关闭文件的便捷方法,即使发生异常也是如此。 They are enabled by default from Python 2.6 onwards, in Python 2.5 they require that you put the line from __future__ import with_statements
at the top of your file. 从Python 2.6起默认启用它们,在Python 2.5中要求将
from __future__ import with_statements
中的行放在文件顶部。
The in
operator does what it sounds like: tell you if a value can be found in a collection. in
运算符听起来很像:告诉您是否可以在集合中找到一个值。 When used with a list it just iterates through, like your code does, but when used with a set object it uses hashes to perform much faster. 当与列表一起使用时,它就像代码一样进行迭代,但是当与集合对象一起使用时,它使用哈希来提高执行速度。
not in
does the opposite. not in
则相反。 (Possible point of confusion: in
is also used when defining a for
loop ( for x in [1, 2, 3]
), but this is unrelated.) (可能的混淆点:
in
定义for
循环( for x in [1, 2, 3]
时也使用for x in [1, 2, 3]
,但这无关紧要。)
Assuming that you're looking for the entire line in the second file: 假设您要在第二个文件中查找整行:
second_file=open('second.txt',"r")
second=second_file.readlines()
second_file.close()
first_file=open('first.txt', "r")
for line in first_file:
if line not in second:
print line
first_file.close()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.