如何比較 Python 中的兩個大文本文件？

Question

數據集：我有兩個不同的文本數據集（用於訓練和測試的大型文本文件，每個包含 30,000 個句子）。 部分數據如下：“富爾頓縣大陪審團周五表示，對亞特蘭大最近初選的調查‘沒有證據’表明發生了任何違規行為。”

問題：如何用 Python 中的單詞“unk”替換訓練中未見的測試數據中的每個單詞？

我的解決方案：我是否應該使用“嵌套 for 循環”將訓練數據的所有單詞與測試數據的所有單詞進行比較，以及使用“if 語句”來判斷測試數據中的任何單詞是否不在訓練數據中替換為“unk”？

#open text file and assign it to varaible with the name "readfile"
readfile1= open('train.txt','r')
#create the new empty text file with the new name and then assign it to variable 
# with the name "writefile". now this file is ready for writing in that
writefile=open('test.txt','w')
for word1 in readfile1:
    for word2 in readfile2:
        if (word1!=word2):
            word2='unk'
writefile.close()

Answer 1

請嘗試以下方法：

將您的訓練集轉換為以工作為鍵並計數為值的字典。 例如：

    {"Hello":1,
    "World":2}

對於測試集中的每個單詞，如果它不存在，請嘗試訪問 dict 中的單詞，然后替換為“unk”。

     def fun(testset):
            newtestset= testset
            for word in testset:
             try:
              Count = word_dict['Hello']
             except:
              newtestset.replace(word,'unk')
            return newtestset

為所有 txt 文件生成字典：

def freq(str): 
    
    out_dict = {}
    # break the string into list of words 
    str_list = str.split() 
  
    # gives set of unique words 
    unique_words = set(str_list) 

    for word in unique_words:
        out_dict[word] = str_list.count(word)
    return out_dict

如何比較 Python 中的兩個大文本文件？

問題描述

1 個解決方案

解決方案1
0 2019-10-01 15:07:27

如何比較 Python 中的兩個大文本文件？

問題描述

1 個解決方案

解決方案1 0 2019-10-01 15:07:27

解決方案1
0 2019-10-01 15:07:27