简体   繁体   中英

How to compare contents of two large text files in Python?

Datasets: Two Large text files for train and test that all words of them are tokenized. a part of data is like the following: " the fulton county grand jury said friday an investigation of atlanta's recent primary election produced `` no evidence '' that any irregularities took place. "

Question: How can I replace every word in the test data not seen in training with the word "unk" in Python?

So far, I made the dictionary by the following codes to count the frequency of each word in the file:

#open text file and assign it to varible with the name "readfile"
readfile= open('C:/Users/amtol/Desktop/NLP/Homework_1/brown-train.txt','r')

writefile=open('C:/Users/amtol/Desktop/NLP/Homework_1/brown-trainReplaced.txt','w')

# Create an empty dictionary 
d = dict()

# Loop through each line of the file
for line in readfile:

    # Split the line into words 
    words = line.split(" ") 

    # Iterate over each word in line 
    for word in words: 
        # Check if the word is already in dictionary 
        if word in d:

        # Increment count of word by 1 
            d[word] = d[word] + 1
        else: 
            # Add the word to dictionary with count 1 
            d[word] = 1

#replace all words occurring in the training data once with the token<unk>.

for key in list(d.keys()): 
    line= d[key] 
    if (line==1):
        line="<unk>"
        writefile.write(str(d))
    else:
        writefile.write(str(d))

#close the file that we have created and we wrote the new data in that
writefile.close()

Honestly the above code doesn't work with writefile.write(str(d)) which I want to write the result in the new textfile, but by print(key, ":", line) it works and shows the frequency of each word but in the console which doesn't create the new file. if you also know the reason for this, please let me know.

First off, your task is to replace the words in test file that are not seen in train file. Your code never mentions the test file. You have to

  • Read the train file, gather what words are there. This is mostly okay; but you need to .strip() your line or the last word in each line will end with a newline. Also, it would make more sense to use set instead of dict if you don't need to know the count (and you don't, you just want to know if it's there or not). Sets are cool because you don't have to care if an element is in already or not; you just toss it in. If you absolutely need to know the count, using collections.Counter is easier than doing it yourself.

  • Read the test file, and write to replacement file, as you are replacing the words in each line. Something like:

    with open("test", "rt") as reader: with open("replacement", "wt") as writer: for line in reader: writer.write(replaced_line(line.strip()) + "\n")

  • Make sense, which your last block does not:P Instead of seeing whether a word from test file is seen or not, and replacing the unseen ones, you are iterating on the words you have seen in the train file, and writing <unk> if you've seen them exactly once. This does something, but not anything close to what it should.

    Instead, split the line you got from the test file and iterate on its words; if the word is in the seen set ( word in seen , literally) then replace its contents; and finally add it to the output sentence. You can do it in a loop, but here's a comprehension that does it:

     new_line = ' '.join(word if word in seen else '<unk>' for word in line.split(' '))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM