繁体   English   中英

如何检查一个txt文件是否存在另一个txt文件中的关键字?

[英]How to check if a txt file exists a keyword in another txt file?

我有一个输入 txtfile,比如

The quick brown fox jumps over the lazy dog
The quick brown fox
A beautiful dog

我将关键字保存为 txt 文件,例如,

fox dog ...

我想检查输入文件的每一行是否有这些关键字,我知道如何一个一个地检查关键字,

with open("input.txt") as f:
    a_file = f.read().splitlines()
b_file = []
for line in a_file:

    if "dog" in line:
        b_file.append("dog")
    elif "fox" in line:
        b_file.append("fox")
    else:
        b_file.append("Not found")
with open('output.txt', 'w') as f:
    f.write('\n'.join(b_file) + '\n')

但是如何检查它们是否在另一个文件中? PS 我需要检查一些特定的行,而不是文件中的所有内容,例如,结果应该是,

fox dog
fox
dog

您应该加载这两个文件。 一个是关键字查询,另一个是搜索内容。 例如,我有一个名为keywords.txtcontent.txt的文件,然后将其全部打开:

with open("keywords.txt") as f1, open("content.txt") as f2:
    keywords = f1.read()
    content = f2.read()
# keywords: fox dog
# content: The quick brown fox jumps over the lazy dog\nThe quick brown fox\nA beautiful dog

如果您只想检查内容是否包含关键字,那么只需执行以下操作:

keywords = [line.split() for line in keywords.split("\n")]
keywords = sum(keywords, [])
# keywords: ['fox', 'dog']

content = [line.split() for line in content.split("\n")]
content = sum(content, [])
# content: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', 'The', 'quick', 'brown', 'fox', 'A', 'beautiful', 'dog']

# check intersection of 2 sets, if there is some words overlap
# ==> keywords appear in the content
if set(keywords)&set(content):
    print(True)
else:
    print(False)

尽管您更改了一些要求,但您似乎想要这样:

  • 从文件中读取关键字列表,这些关键字在一行中,以空格分隔
  • 查找文本文档中包含任何这些关键字的行,以及 output 它们出现的行的行号(索引)以及包含这些关键字的所有行的确切关键字

这个脚本是这样做的:

with open('keywords.txt') as f:
    keywords = f.read().split()

with open('document.txt') as f, open('output.txt', 'w') as o:
    for n, line in enumerate(f):
        if matches := [k for k in keywords if k in line]:
            o.write(f'{n+1}: {matches}\n')

使用keywords.txt类似:

fox dog

document.txt是这样的:

the quick brown fox
jumped over the lazy dog
on a beautiful dog day afternoon, you foxy dog
there is nothing on FOX
and sometimes you're in a foxhole with a dog

它将写入output.txt

1: ['fox']
2: ['dog']
3: ['fox', 'dog']
5: ['fox', 'dog']

如果您不想要部分匹配(如foxhole )并且您关心找到单词的顺序,并且可能还想了解重复项,并且您想要确保大小写无关紧要:

with open('keywords.txt') as f:
    keywords = [k.lower() for k in f.read().split()]

with open('document.txt') as f, open('output.txt', 'w') as o:
    for n, line in enumerate(f):
        if matches := [w for w in line.split() if w.lower() in keywords]:
            o.write(f'{n+1}: {matches}\n')

最后,也许您的 document.txt 在第 6 行带有标点符号:

I watch "FOX", but although I search doggedly, I can't find a thing, you foxy dog!

然后这个脚本:

import re
import string

with open('keywords.txt') as f:
    keywords = [k.lower() for k in f.read().split()]

with open('document.txt') as f, open('output.txt', 'w') as o:
    for n, line in enumerate(f):
        if matches := [w for w in re.sub('['+string.punctuation+']', '', line).split() if w.lower() in keywords]:
            o.write(f'{n+1}: {matches}\n')

将此写入output.txt

1: ['fox']
2: ['dog']
3: ['dog', 'dog']
4: ['FOX']
5: ['dog']
6: ['FOX', 'dog']

对于所有不熟悉 Python 的人,我想通过两个目标扩展Grismar 的多种答案

  1. 解释使用的语言结构
  2. 将所有匹配变体提取到函数和枚举中

1. 语言结构

2.提取匹配变体

Enum (类)定义了 3 种建议的匹配模式。 然后我们可以对两者使用此模式:

  • (a) 读取准备匹配的关键字,使用提取的 function keywords_from
  • (b) 使用提取的 function match_keywords查找这些关键字的匹配项
from enum import Enum

class KeywordMatch(Enum):
     EXACT = 'exact'
     LOWER = 'lower'
     PARTIAL = 'partial'

# Usage: keywords = keywords_from('keywords.txt', KeywordMatch.LOWER)
def keywords_from(filename, mode):
    with open(filename) as f:
        if mode == KeywordMatch.LOWER:
            keywords = [k.lower() for k in f.read().split()]
        else:
            keywords = f.read().split()
    return keywords


import re
import string

# Usage: if match_keywords(line, KeywordMatch.LOWER):
def match_keywords(line, mode):
    if mode == KeywordMatch.LOWER
        matches = [w for w in line.split() if w.lower() in keywords]
    elif mode == KeywordMatch.PARTIAL:
        matches = [w for w in re.sub('['+string.punctuation+']', '', line).split() if w.lower() in keywords]
    else:
        matches = [k for k in keywords if k in line]
    return matches


if __name__ == "__main__":
    mode = KeywordMatch.LOWER

    keywords = keywords_from('keywords.txt', mode)
    
    with open('document.txt') as f, open('output.txt', 'w') as o:
    for n, line in enumerate(f):
        matches = match_keywords(line, mode)  
        # can also test or debug-print matches
        if matches: 
            o.write(f'{n+1}: {matches}\n')

笔记:

  • 尽管进行了所有模块化, keywords列表仍然是一个全局变量(不是那么干净)
  • 删除了Walrus 运算符并将matches项分开,以便在写入文件之前测试或调试它们

也可以看看:

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM