使用單詞作為另一個文本文件的鍵從文本文件中提取行

Question

我正在NLP上工作，需要對數據進行一些預處理。 我有兩個輸入文件，並且必須生成一個輸出文件，這些文件的交集其中第一個文件充當鍵。

文件1-包含單詞列表：

輔助的
從
的
詩歌
至

文件2：

0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.0990 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581

0.70853 0.57088 -0.4716 0.18048 0.54449 0.72603 0.18157 -0.52393 0.10381 -0.17566 0.078852 -0.36216 -0.11829 -0.83336 0.11917 -0.16605 0.061555 -0.012719 -0.56623 0.013616 0.22851 -0.14396 -0.067549 -0.38157 -0.23698 -1.7037 -0.86690.2 0.276 0.1613 -0.13273 -0.68881 0.18444 0.0052464 -0.33874 -0.078956 0.24185 0.36576 -0.34727 0.28483 0.075693 -0.062178 -0.38988 0.22902 -0.21617 -0.22562 -0.093918 -0.80375

到0.68047 -0.039263 0.30186 -0.17792 0.42962 0.032246 -0.41376 0.13228 -0.29847 -0.085253 0.17118 0.22419 -0.10046 -0.43653 0.33418 0.67846 0.057204 -0.34448 -0.42785 -0.43275 0.55963 0.10032 0.18677 -0.26854 0.037334 -2.0932 0.22171 -0.3826 0.37788 0.20869 -0.32752 0.12751 0.088359 0.16351 -0.21634 -0.094375 0.018324 0.21048 -0.03088 -0.19722 0.082279 -0.09434 -0.073297 -0.064699 -0.26044

和0.26818 0.14346 -0.27877 0.016257 0.11384 0.69923 -0.51332 -0.47368 -0.33075 -0.13834 0.2702 0.30938 -0.45012 -0.4127 -0.09932 0.038085 0.029749 0.10076 -0.25058 -0.51818 0.34558 0.44922 0.48791 -0.080866 -0.10121 -1.3777 -0.10866 -0.23201 0.01 -0.52244 0.3302 0.33707 -0.35601 0.32431 0.12041 0.3512 -0.069043 0.36885 0.25168 -0.24517 0.25381 0.1367 -0.31178 -0.6321 -0.25028 -0.38097

我想要的新文件（文件3）中的輸出應為：

0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.0990 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581

到0.68047 -0.039263 0.30186 -0.17792 0.42962 0.032246 -0.41376 0.13228 -0.29847 -0.085253 0.17118 0.22419 -0.10046 -0.43653 0.33418 0.67846 0.057204 -0.34448 -0.42785 -0.43275 0.55963 0.10032 0.18677 -0.26854 0.037334 -2.0932 0.22171 -0.3826 0.37788 0.20869 -0.32752 0.12751 0.088359 0.16351 -0.21634 -0.094375 0.018324 0.21048 -0.03088 -0.19722 0.082279 -0.09434 -0.073297 -0.064699 -0.26044

以下代碼可以正常運行，但是我得到的輸出文件為空：

f1 = open('input_key.txt', 'r')
f2 = open('input_file.txt', 'r')
f3 = open('output_file.txt', 'w')

for word in f1.readlines():
    for line in f2.readlines():
        if word is line.strip().split()[0]:     
            f3.write(line)

f1.close()
f2.close()
f3.close()

我無法理解這里出了什么問題。 任何幫助表示贊賞。 file2和file3之間沒有多余的行。 我只是添加了這些內容以使問題易於理解。

更新
多虧了這些注釋，我才知道if語句的計算結果為false。 有什么方法可以克服這個問題或其他替代方案來執行我的任務？

Answer 1

關鍵字is身份運算符，檢查2個元素是否相同

==是相等邏輯運算符

if word is line.strip().split()[0]:

更改為

if word == line.strip().split()[0]:

Answer 2

這將執行您想要的操作：

f1 = open('input_key.txt', 'r')
f2 = open('input_file.txt', 'r')
f3 = open('output_file.txt', 'a')

for word in f1.readlines():
    for line in f2.readlines():
        if line != '\n' and word.strip() == line.strip().split()[0]:
            f3.write(line)
    f2.seek(0)

f1.close()
f2.close()
f3.close()

您需要重置為光標位置readlines在與每個循環結束f2.seek(0)

我也將打開output_file.txt作為a （追加），您可以在腳本的開頭刪除output_file.txt ，以每次運行時將其清除：

import os
os.remove("output_file.txt")

我也會做==代替is ， is將測試兩個對象是否相同，而不是是否等於其他東西

編輯：我會在下面關於list comprehension wiesion答案中查看有關編寫更簡潔代碼的一些技巧

Answer 3

我只是復制了您的文件，並按照我的要求編寫了代碼：

with open("words.txt", "r") as word_file:
    words = [word.strip() for word in word_file.read().splitlines() if word.strip()]

with open("feed.txt", "r") as feed_file:
    lines = [line.strip() for line in feed_file.read().splitlines() if line.strip()]

with open('result.txt', 'w') as result_file:
    result_file.write("\n".join([line for line in lines if line.split()[0] in words]))

當然，我在這里做了很多列表推導，以避免所有嵌套循環。

如果您的單詞和輸入文件很大，那么您應該避免理解時將整個文件讀入內存（感謝@Bayko提醒您），您應該切換到：

words = []
with open("words.txt", "r") as word_file:
    # This reads the words file line by line instead of reading the entire file
    for word in word_file:
        word = word.strip()
        if word:
            words.append(word)

with open('result.txt', 'w') as result_file:
    with open("feed.txt", "r") as feed_file:
        # This reads the input file line by line instead of reading the entire file
        for line in feed_file:
            line = line.strip()
            if not line:
                continue
            if line.split()[0] in words:
                result_file.write(line + "\n")

另外，當我在本地運行您的代碼時：

 if word is line.strip().split()[0]: 
IndexError：列表索引超出范圍

這個錯誤是因為空行-但最重要的是，你被卡住f2.readlines() -你永遠做一個f2.seek(0)復位位置和==是不一樣is （見@ Atterson的回答）。 在您的代碼中修復這些問題看起來像：

f1 = open('words.txt', 'r')
f2 = open('feed.txt', 'r')
f3 = open('result.txt', 'w')

for word in f1.readlines():
    word = word.strip()
    if not word:
        continue
    for line in f2.readlines():
        line = line.strip()
        if not line:
            continue
        if word == line.split()[0]:
            f3.write(line + "\n")
    f2.seek(0)

f1.close()
f2.close()
f3.close()

使用這兩個腳本，我的result.txt看起來像

0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.0990 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581

到0.68047 -0.039263 0.30186 -0.17792 0.42962 0.032246 -0.41376 0.13228 -0.29847 -0.085253 0.17118 0.22419 -0.10046 -0.43653 0.33418 0.67846 0.057204 -0.34448 -0.42785 -0.43275 0.55963 0.10032 0.18677 -0.26854 0.037334 -2.0932 0.22171 -0.3826 0.37788 0.20869 -0.32752 0.12751 0.088359 0.16351 -0.21634 -0.094375 0.018324 0.21048 -0.03088 -0.19722 0.082279 -0.09434 -0.073297 -0.064699 -0.26044

Answer 4

您的第二個文件中有三行。 拆分數組后，需要對數組進行檢查以檢查其是否包含任何元素。 那解決了問題。 或者嘗試使用IndexError捕獲，您應該會很好

使用單詞作為另一個文本文件的鍵從文本文件中提取行

問題描述

4 個解決方案

解決方案1
2 2018-06-08 19:47:15

解決方案2
2 已采納 2018-06-08 19:56:18

解決方案3
2 2018-06-08 20:06:51

解決方案4
0 2018-06-08 19:41:18

使用單詞作為另一個文本文件的鍵從文本文件中提取行

問題描述

4 個解決方案

解決方案1 2 2018-06-08 19:47:15

解決方案2 2 已采納 2018-06-08 19:56:18

解決方案3 2 2018-06-08 20:06:51

解決方案4 0 2018-06-08 19:41:18

解決方案1
2 2018-06-08 19:47:15

解決方案2
2 已采納 2018-06-08 19:56:18

解決方案3
2 2018-06-08 20:06:51

解決方案4
0 2018-06-08 19:41:18