Datasets : I have two different text datasets(large text files for train and test that each one includes 30,000 sentences). a part of data is like the following: " the fulton county grand jury said friday an investigation of atlanta's recent primary election produced `` no evidence '' that any irregularities took place. "
Question : How can I replace every word in the test data not seen in training with the word "unk" in Python?
My solution : Should I use the "nested for-loops" to compare all words of the train data with all words of the test data and also the "if-statement" to say if any word in test data is not in train data then replace with "unk"?
#open text file and assign it to varaible with the name "readfile"
readfile1= open('train.txt','r')
#create the new empty text file with the new name and then assign it to variable
# with the name "writefile". now this file is ready for writing in that
writefile=open('test.txt','w')
for word1 in readfile1:
for word2 in readfile2:
if (word1!=word2):
word2='unk'
writefile.close()
Please try the following:
{"Hello":1,
"World":2}
def fun(testset):
newtestset= testset
for word in testset:
try:
Count = word_dict['Hello']
except:
newtestset.replace(word,'unk')
return newtestset
def freq(str):
out_dict = {}
# break the string into list of words
str_list = str.split()
# gives set of unique words
unique_words = set(str_list)
for word in unique_words:
out_dict[word] = str_list.count(word)
return out_dict
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.