简体   繁体   English

在文本文件中查找包含特定字符且具有特定长度的单词

[英]find a word within a text file that contains certain characters and is a specific length

I am trying to find words within a text file that are of 7 letters in length and contain the letters a, b, c, e and r. 我试图在文本文件中找到长度为7个字母并包含字母a,b,c,e和r的单词。 So far I have this: 到目前为止我有这个:

import re

file = open("dictionary.txt","r")
text = file.readlines()
file.close()


keyword = re.compile(r'\w{7}')

for line in text:
    result = keyword.search (line)
    if result:
       print (result.group())

Can anyone help me? 谁能帮我?

You need to not only match the word characters, but also the word boundary : 您不仅需要匹配单词字符,还需要匹配单词边界

keyword = re.compile(r'\b\w{7}\b')

The \\b anchor matches at the start or end of a word, limiting the word to exactly 7 characters. \\b锚点匹配单词的开头或结尾,将单词限制为正好 7个字符。

It'd be more efficient if you were to loop through the file line-by-line instead of reading it all into memory in one go: 如果你逐行循环遍历文件而不是一次性将其全部读入内存,效率会更高:

import re

keyword = re.compile(r'\b\w{7}\b')

with open("dictionary.txt","r") as dictionary:    
    for line in dictionary:
        for result in keyword.findall(line):
            print(result)

Using keyword.findall() gives us a list of all matches on the line. 使用keyword.findall()为我们提供了该行上所有匹配项的列表。

To check if the matches have at least one of the required characters in it, I personally would just use a set intersection test: 要检查匹配项中是否至少包含一个必需字符,我个人只会使用集合交集测试:

import re

keyword = re.compile(r'\b\w{7}\b')
required = set('abcer')

with open("dictionary.txt","r") as dictionary:    
    for line in dictionary:
        results = [required.intersection(word) for word in keyword.findall(line)]
        for result in results
            print(result)
\b(?=\w{0,6}?[abcer])\w{7}\b

That's the regular expression you want. 这是你想要的正则表达式。 It works by using the basic form for a word of exactly seven letters ( \\b\\w{7}\\b ) and adding a lookahead - a zero width assertion that looks forward and tries to find one of your required letters. 它的工作原理是使用基本形式的七个字母( \\b\\w{7}\\b )并添加一个前瞻 - 一个向前看的零宽度断言,并试图找到你需要的一个字母。 A breakdown: 细分:

\b            A word boundary
(?=           Look ahead and find...
    \w        A word character (A-Za-z0-9_)
    {0,6}     Repeated 0 to 6 times
    ?         Lazily (not necessary, but marginally more efficient).
    [abcer]   Followed by one of a, b, c, e, or r
)             Go back to where we were before (just after the word boundary
\w            And match a word character
{7}           Exactly seven times.
\b            Then one more word Boundary.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM