简体   繁体   English

如何查找不包含特定字母的单词?

[英]How to Find Words Not Containing Specific Letters?

I'm trying to write a code using regex and my text file.我正在尝试使用正则表达式和我的文本文件编写代码。 My file contains these words line by line:我的文件逐行包含这些词:

nana
abab
nanac
eded

My purpose is: displaying the words which does not contain the letters which are given as substring's letters.我的目的是:显示不包含作为子字符串字母给出的字母的单词。

For example, if my substring is "bn" , my output should be only eded .例如,如果我的子字符串是"bn" ,我的输出应该只是eded Because nana and nanac contains "n" and abab contains "b".因为nanananac包含“n”而abab包含“b”。

I have written a code but it only checks first letter of my substring:我写了一个代码,但它只检查我的子字符串的第一个字母:

import re

substring = "bn"
def xstring():
    with open("deneme.txt") as f:
        for line in f:
            for word in re.findall(r'\w+', line):
                for letter in substring:
                    if len(re.findall(letter, word)) == 0:
                        print(word)
                        #yield word
xstring()

How do I solve this problem?我该如何解决这个问题?

Here, we would just want to have a simple expression such as:在这里,我们只想有一个简单的表达式,例如:

^[^bn]+$

We are adding b and n in a not-char class [^bn] and collecting all other chars, then by adding ^ and $ anchors we will be failing all strings that might have b and n .我们在非字符类[^bn]中添加bn并收集所有其他字符,然后通过添加^$锚点,我们将使所有可能具有bn字符串失败。

Demo演示

Test测试

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"^[^bn]+$"

test_str = ("nana\n"
    "abab\n"
    "nanac\n"
    "eded")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):
    
    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
    
    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1
        
        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.  

RegEx正则表达式

If this expression wasn't desired, it can be modified/changed in regex101.com .如果不需要此表达式,则可以在regex101.com 中对其进行修改/更改。

RegEx Circuit正则表达式电路

jex.im visualizes regular expressions: jex.im可视化正则表达式:

在此处输入图片说明

@Xosrov has the right approach, with a few minor issues and typos. @Xosrov 有正确的方法,但有一些小问题和拼写错误。 The below version of the same logic works相同逻辑的以下版本有效

import re

def xstring(substring, words):
    regex = re.compile('[%s]' % ''.join(sorted(set(substring))))
    # Excluding words matching regex.pattern
    for word in words:
        if not re.search(regex, word):
            print(word)

words = [
    'nana',
    'abab',
    'nanac',
    'eded',
]

xstring("bn", words)

If you want to check if a string has a set of letters, use brackets.如果要检查字符串是否包含一组字母,请使用方括号。
For example using [bn] will match words that contain one of those letters.例如,使用[bn]将匹配包含这些字母之一的单词。

import re
substring = "bn"
regex = re.compile('[' + substring + ']')
def xstring():
    with open("dename.txt") as f:
        for line in f:
            if(re.search(regex, line) is None):
                print(line)
xstring()

It might not be the most efficient but you could try doing something with set intersections the following code segment will print the the value in the string word only if it does not contain any of the letters 'b' or 'n'它可能不是最有效的,但您可以尝试使用设置交集执行某些操作,以下代码段仅在字符串 word 中不包含任何字母 'b' 或 'n' 时才会打印该值

if (not any(set(word) & set('bn'))):
        print(word)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM