简体   繁体   中英

How to compare two large text files in Python?

Datasets : I have two different text datasets(large text files for train and test that each one includes 30,000 sentences). a part of data is like the following: " the fulton county grand jury said friday an investigation of atlanta's recent primary election produced `` no evidence '' that any irregularities took place. "

Question : How can I replace every word in the test data not seen in training with the word "unk" in Python?

My solution : Should I use the "nested for-loops" to compare all words of the train data with all words of the test data and also the "if-statement" to say if any word in test data is not in train data then replace with "unk"?

#open text file and assign it to varaible with the name "readfile"
readfile1= open('train.txt','r')
#create the new empty text file with the new name and then assign it to variable 
# with the name "writefile". now this file is ready for writing in that
writefile=open('test.txt','w')
for word1 in readfile1:
    for word2 in readfile2:
        if (word1!=word2):
            word2='unk'
writefile.close()

Please try the following:

  1. convert your training set into a dict with the work as the key and count as the value. Eg:
    {"Hello":1,
    "World":2}
  1. For every word in the test set try to access the word in dict if it's not there then replace with 'unk'.
     def fun(testset):
            newtestset= testset
            for word in testset:
             try:
              Count = word_dict['Hello']
             except:
              newtestset.replace(word,'unk')
            return newtestset
  1. to generate the dict for all of the txt file:
def freq(str): 
    
    out_dict = {}
    # break the string into list of words 
    str_list = str.split() 
  
    # gives set of unique words 
    unique_words = set(str_list) 

    for word in unique_words:
        out_dict[word] = str_list.count(word)
    return out_dict

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM