简体   繁体   English

比较两个文件并在python中找到匹配的单词

[英]compare two file and find matching words in python

I have a two file: the first one includes terms and their frequency: 我有两个文件:第一个包含术语及其频率:

table 2
apple 4
pencil 89

The second file is a dictionary: 第二个文件是字典:

abroad
apple
bread
...

I want to check whether the first file contains any words from the second file. 我想检查第一个文件是否包含第二个文件中的任何单词。 For example both the first file and the second file contains "apple". 例如,第一个文件和第二个文件都包含“ apple”。 I am new to python. 我是python的新手。 I try something but it does not work. 我尝试了一些东西,但是没有用。 Could you help me ? 你可以帮帮我吗 ? Thank you 谢谢

for line in dictionary:
    words = line.split()
    print words[0]

for line2 in test:
    words2 = line2.split()
    print words2[0]

Something like this: 像这样:

with open("file1") as f1,open("file2") as f2:
    words=set(line.strip() for line in f1)   #create a set of words from dictionary file

    #why sets? sets provide an O(1) lookup, so overall complexity is O(N)

    #now loop over each line of other file (word, freq file)
    for line in f2:
        word,freq=line.split()   #fetch word,freq 
        if word in words:        #if word is found in words set then print it
            print word

output: 输出:

apple

It may help you : 它可以帮助您:

file1 = set(line.strip() for line in open('file1.txt'))

file2 = set(line.strip() for line in open('file2.txt'))

for line in file1 & file2:

    if line:

        print line

Here's what you should do: 这是您应该做的:

  • First, you need to put all the dictionary words in some place where you can easily look them up. 首先,您需要将所有词典单词放在易于查找的位置。 If you don't do that, you'd have to read the whole dictionary file every time you want to check one single word in the other file. 如果不这样做,则每次要检查另一个文件中的一个单词时,都必须阅读整个词典文件。

  • Second, you need to check if each word in the file is in the words you extracted from the dictionary file. 其次,您需要检查文件中的每个单词是否都在您从字典文件中提取的单词中。

For the first part, you need to use either a list or a set . 对于第一部分,您需要使用listset The difference between these two is that list keeps the order you put the items in it. 两者之间的区别在于, list会保持您放置项目的顺序。 A set is unordered, so it doesn't matter which word you read first from the dictionary file. set是无序的,因此从字典文件中首先读取哪个单词并不重要。 Also, a set is faster when you look up an item, because that's what it is for. 另外,查找项目时set更快,因为这就是它的目的。

To see if an item is in a set, you can do: item in my_set which is either True or False. 要查看某项是否在集合中,可以执行以下操作: item in my_set中的项为True或False。

I have your first double list in try.txt and the single list in try_match.txt 我对你的第一双榜try.txt ,并在单一的名单try_match.txt

f = open('try.txt', 'r')
f_match = open('try_match.txt', 'r')
print f
dictionary = []
for line in f:
    a, b = line.split()
    dictionary.append(a)

for line in f_match:
    if line.split()[0] in dictionary:
        print line.split()[0]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM