如何检查一个txt文件是否存在另一个txt文件中的关键字？

Question

我有一个输入 txtfile，比如

The quick brown fox jumps over the lazy dog
The quick brown fox
A beautiful dog

我将关键字保存为 txt 文件，例如，

fox dog ...

我想检查输入文件的每一行是否有这些关键字，我知道如何一个一个地检查关键字，

with open("input.txt") as f:
    a_file = f.read().splitlines()
b_file = []
for line in a_file:

    if "dog" in line:
        b_file.append("dog")
    elif "fox" in line:
        b_file.append("fox")
    else:
        b_file.append("Not found")
with open('output.txt', 'w') as f:
    f.write('\n'.join(b_file) + '\n')

但是如何检查它们是否在另一个文件中？ PS 我需要检查一些特定的行，而不是文件中的所有内容，例如，结果应该是，

fox dog
fox
dog

Answer 1

您应该加载这两个文件。 一个是关键字查询，另一个是搜索内容。 例如，我有一个名为keywords.txt和content.txt的文件，然后将其全部打开：

with open("keywords.txt") as f1, open("content.txt") as f2:
    keywords = f1.read()
    content = f2.read()
# keywords: fox dog
# content: The quick brown fox jumps over the lazy dog\nThe quick brown fox\nA beautiful dog

如果您只想检查内容是否包含关键字，那么只需执行以下操作：

keywords = [line.split() for line in keywords.split("\n")]
keywords = sum(keywords, [])
# keywords: ['fox', 'dog']

content = [line.split() for line in content.split("\n")]
content = sum(content, [])
# content: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', 'The', 'quick', 'brown', 'fox', 'A', 'beautiful', 'dog']

# check intersection of 2 sets, if there is some words overlap
# ==> keywords appear in the content
if set(keywords)&set(content):
    print(True)
else:
    print(False)

Answer 2

尽管您更改了一些要求，但您似乎想要这样：

从文件中读取关键字列表，这些关键字在一行中，以空格分隔
查找文本文档中包含任何这些关键字的行，以及 output 它们出现的行的行号（索引）以及包含这些关键字的所有行的确切关键字

这个脚本是这样做的：

with open('keywords.txt') as f:
    keywords = f.read().split()

with open('document.txt') as f, open('output.txt', 'w') as o:
    for n, line in enumerate(f):
        if matches := [k for k in keywords if k in line]:
            o.write(f'{n+1}: {matches}\n')

使用keywords.txt类似：

fox dog

和document.txt是这样的：

the quick brown fox
jumped over the lazy dog
on a beautiful dog day afternoon, you foxy dog
there is nothing on FOX
and sometimes you're in a foxhole with a dog

它将写入output.txt ：

1: ['fox']
2: ['dog']
3: ['fox', 'dog']
5: ['fox', 'dog']

如果您不想要部分匹配（如foxhole ）并且您关心找到单词的顺序，并且可能还想了解重复项，并且您想要确保大小写无关紧要：

with open('keywords.txt') as f:
    keywords = [k.lower() for k in f.read().split()]

with open('document.txt') as f, open('output.txt', 'w') as o:
    for n, line in enumerate(f):
        if matches := [w for w in line.split() if w.lower() in keywords]:
            o.write(f'{n+1}: {matches}\n')

最后，也许您的 document.txt 在第 6 行带有标点符号：

I watch "FOX", but although I search doggedly, I can't find a thing, you foxy dog!

然后这个脚本：

import re
import string

with open('keywords.txt') as f:
    keywords = [k.lower() for k in f.read().split()]

with open('document.txt') as f, open('output.txt', 'w') as o:
    for n, line in enumerate(f):
        if matches := [w for w in re.sub('['+string.punctuation+']', '', line).split() if w.lower() in keywords]:
            o.write(f'{n+1}: {matches}\n')

将此写入output.txt ：

1: ['fox']
2: ['dog']
3: ['dog', 'dog']
4: ['FOX']
5: ['dog']
6: ['FOX', 'dog']

Answer 3

对于所有不熟悉 Python 的人，我想通过两个目标扩展Grismar 的多种答案：

解释使用的语言结构
将所有匹配变体提取到函数和枚举中

1. 语言结构

[expr for var in generator]是用于构建列表的列表理解
i, var in enumerate(list)使用enumerate在循环内有索引和迭代器变量
var:= expr是Python 3.8引入的海象运算符（赋值表达式）

2.提取匹配变体

Enum （类）定义了 3 种建议的匹配模式。 然后我们可以对两者使用此模式：

(a) 读取准备匹配的关键字，使用提取的 function keywords_from
(b) 使用提取的 function match_keywords查找这些关键字的匹配项

from enum import Enum

class KeywordMatch(Enum):
     EXACT = 'exact'
     LOWER = 'lower'
     PARTIAL = 'partial'

# Usage: keywords = keywords_from('keywords.txt', KeywordMatch.LOWER)
def keywords_from(filename, mode):
    with open(filename) as f:
        if mode == KeywordMatch.LOWER:
            keywords = [k.lower() for k in f.read().split()]
        else:
            keywords = f.read().split()
    return keywords


import re
import string

# Usage: if match_keywords(line, KeywordMatch.LOWER):
def match_keywords(line, mode):
    if mode == KeywordMatch.LOWER
        matches = [w for w in line.split() if w.lower() in keywords]
    elif mode == KeywordMatch.PARTIAL:
        matches = [w for w in re.sub('['+string.punctuation+']', '', line).split() if w.lower() in keywords]
    else:
        matches = [k for k in keywords if k in line]
    return matches


if __name__ == "__main__":
    mode = KeywordMatch.LOWER

    keywords = keywords_from('keywords.txt', mode)
    
    with open('document.txt') as f, open('output.txt', 'w') as o:
    for n, line in enumerate(f):
        matches = match_keywords(line, mode)  
        # can also test or debug-print matches
        if matches: 
            o.write(f'{n+1}: {matches}\n')

笔记：

尽管进行了所有模块化， keywords列表仍然是一个全局变量（不是那么干净）
删除了Walrus 运算符并将matches项分开，以便在写入文件之前测试或调试它们

也可以看看：

Real Python： How to Use Generators and yield in Python
Real Python: Python enumerate(): 用计数器简化循环
实数 Python：赋值表达式：海象运算符
":=" 语法和赋值表达式：什么和为什么？

如何检查一个txt文件是否存在另一个txt文件中的关键字？

问题描述

3 个解决方案

解决方案1
1 2022-02-16 02:53:38

解决方案2
1 已采纳 2022-02-16 07:45:38

解决方案3
1 2022-02-16 08:52:41

1. 语言结构

2.提取匹配变体

也可以看看：

如何检查一个txt文件是否存在另一个txt文件中的关键字？

问题描述

3 个解决方案

解决方案1 1 2022-02-16 02:53:38

解决方案2 1 已采纳 2022-02-16 07:45:38

解决方案3 1 2022-02-16 08:52:41

1. 语言结构

2.提取匹配变体

也可以看看：

解决方案1
1 2022-02-16 02:53:38

解决方案2
1 已采纳 2022-02-16 07:45:38

解决方案3
1 2022-02-16 08:52:41