How to compare two large text files in Python?

Question

Datasets : I have two different text datasets(large text files for train and test that each one includes 30,000 sentences). a part of data is like the following: " the fulton county grand jury said friday an investigation of atlanta's recent primary election produced `` no evidence '' that any irregularities took place. "

Question : How can I replace every word in the test data not seen in training with the word "unk" in Python?

My solution : Should I use the "nested for-loops" to compare all words of the train data with all words of the test data and also the "if-statement" to say if any word in test data is not in train data then replace with "unk"?

#open text file and assign it to varaible with the name "readfile"
readfile1= open('train.txt','r')
#create the new empty text file with the new name and then assign it to variable 
# with the name "writefile". now this file is ready for writing in that
writefile=open('test.txt','w')
for word1 in readfile1:
    for word2 in readfile2:
        if (word1!=word2):
            word2='unk'
writefile.close()

Answer 1

Please try the following:

convert your training set into a dict with the work as the key and count as the value. Eg:

    {"Hello":1,
    "World":2}

For every word in the test set try to access the word in dict if it's not there then replace with 'unk'.

     def fun(testset):
            newtestset= testset
            for word in testset:
             try:
              Count = word_dict['Hello']
             except:
              newtestset.replace(word,'unk')
            return newtestset

to generate the dict for all of the txt file:

def freq(str): 
    
    out_dict = {}
    # break the string into list of words 
    str_list = str.split() 
  
    # gives set of unique words 
    unique_words = set(str_list) 

    for word in unique_words:
        out_dict[word] = str_list.count(word)
    return out_dict

How to compare two large text files in Python?

Question

1 answers

solution1
0 2019-10-01 15:07:27

How to compare two large text files in Python?

Question

1 answers

solution1 0 2019-10-01 15:07:27

solution1
0 2019-10-01 15:07:27