简体   繁体   中英

Find if lines on one file appear as words in the lines of another file in Python

I have two text files. File one has one word in each line and has about 10Klines. My second file which is the corpus file has about 69k lines and has sentences in it. Each line is an individual sentence

File 1 looks like this

文件1图像 .

And File 2 looks like this

文件2图像

In file 1 each line is considered as a single word. I need to find if words from file 1 appear within the sentences of file 2 as words and if they do how many of them appear in the corpus file. I tried the following code but it returns empty lists. Any clue as to why empty lists are returned?

f=open('Corpus_WX.txt',encoding='utf-8')
for count in range(0,68630):
    g=f.readline()
    words=g.split()
    x=open("Processed_data_edit.txt")
    h=x.readline()
    word=h.split()
    x.close()
    z=list(set(words).intersection(word))
    with open("New_Matches.txt", 'a', encoding='utf-8') as file:
            file.write(str(z))
            file.write("\n")
            file.close()
    count=count+1
    

My logic was to find the common elements here and then compare with file 1 again to get a count of the matches. Is there a better way to get both these steps done simultaneously?

If you just need to find out how many words in file2 occur in file1 , you just need to read in both files and find the size of the intersection of the sets containing the words in both files.

with open("file1.txt") as f:
    file1_words = f.readlines()

with open("file2.txt") as f:
    file2_words = f.read().split() # Read everything and split by whitespace

file1_words = set(file1_words)
file2_words = set(file2_words)

common_words = file1_words.intersection(file2_words)
print(f"File1 and File2 have {len(common_words)} words in common")

If you want to count the occurrences of each word from file1 in file2 , you'll need to write some more code.

First, read the second file and count the occurrences of each word. You could use collections.Counter for this, but it's pretty easy to write your own code if you're learning:

with open("file2.txt") as f:
    file2_words = f.read().split() # Read everything, then split by whitespace

file2_wordcount = dict() # Empty dictionary

for word in file2_words:
    old_count = file2_wordcount.get(word, 0) # Get the count from the dict. Or 0 if it doesn't exist
    file2_wordcount[word] = old_count + 1 # Set the new count

At the end of this block, we have a dictionary file2_wordcount which maps each word to its count in the second file. Next, we need to read the words from the first file and find out how many times they occur in the other file.

# Now, read the lines from file 1
with open("file1.txt") as f:
    file1_words = f.readlines() # Since you have one word per line.

# Convert it into a set to remove duplicates
file1_words = set(file1_words)

for word in file1_words:
    count = file2_wordcount.get(word, 0) # Get the count from the dict. Or 0 if it doesn't exist
    print(word, count) # Print them both

Or, to get the total count, use the sum() function:

total_common_count = sum(file2_wordcount.get(word, 0) for word in file1_words)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM