简体   繁体   English

Python中的正则表达式。 不匹配

[英]Regex in Python. NOT matches

I'll go straight: I have a string like this (but with thousands of lines) 我直接说:我有一个像这样的字符串(但是有成千上万行)

Ach-emos_2
Ach. emos_54
Achėmos_18
Ąžuolas_4
Somtehing else_2

and I need to remove lines that does not match az and ąčęėįšųūž plus _ plus any integer (3rd and 4th lines match this). 并且我需要删除与aząčęėįšųūž_以及any integer不匹配的行(第3和第4行与此匹配)。 And this should be case insensitive. 这应该不区分大小写。 I think regex should be 我认为正则表达式应该是

[a-ząčęėįšųūž]+_\d+ #don't know where to put case insensitive modifier

But how should look a regex that matches lines that are NOT alpha (and lithuanian letters) plus underscore plus integer? 但是,如何查找与非alpha(和立陶宛字母)加下划线加整数的行匹配的正则表达式呢? I tried 我试过了

re.sub(r'[^a-ząčęėįšųūž]+_\d+\n', '', words)

but no good. 但是不好

Thanks in advance, sorry if my english is not quite good. 在此先感谢您,我的英语不太好。

As to making the matching case insensitive, you can use the I or IGNORECASE flags from the re module, for example when compiling your regex: 为了使匹配的大小写不敏感,可以使用re模块中的IIGNORECASE标志,例如,在编译正则表达式时:

regex = re.compile("^[a-ząčęėįšųūž]+_\d+$", re.I)

As to removing the lines not matching this regex, you can simply construct a new string consisting of the lines that do match: 至于删除不匹配此正则表达式的行,您可以简单地构造一个新字符串,其中包含确实匹配的行:

new_s = "\n".join(line for line in s.split("\n") if re.match(regex, line))

First of all, given your example inputs, every line ends with underscore + integers, so all you really need to do is invert the original match. 首先,给定示例输入,每行以下划线+整数结尾,因此您真正需要做的就是反转原始匹配。 If the example wasn't really representative, then inverting the match could land you results like this: 如果该示例并非真正具有代表性,那么反转匹配可能会导致您得到如下结果:

abcdefg_nodigitshere abcdefg_nodigits此处

But you can subfilter that this way: 但是您可以这样子过滤:

import re
mydigre = re.compile(r'_\d+$')
myreg = re.compile(r'^[a-ząčęėįšųūž]+_\d+$', re.I)

for line in inputs.splitlines():
    if re.match(myreg, line):
        # do x
    elif re.match(mydigre, line):
        # do y
    else:
        # line doesn't end with _\d+

Another option would be to use Python sets. 另一种选择是使用Python集。 This approach only makes sense if all your lines are unique (or if you don't mind eliminating duplicate lines) and you don't care about order. 仅当您的所有行都是唯一的(或者如果您不介意消除重复的行)并且您不关心顺序时,这种方法才有意义。 It probably has a high memory cost, too, but is likely to be fast. 它可能也有很高的存储成本,但可能很快。

all_lines = set([line for line in inputs.splitlines()])
alpha_lines = set([line for line in all_lines if re.match(myreg, line)])
nonalpha_lines = all_lines - alpha_lines
nonalpha_digi_lines = set([line for line in nonalpha_lines if re.match(mydigire, line)])

Not sure how python does modifiers, but to edit in-place, use something like this (case insensitive): 不知道python如何做修饰符,但是要就地编辑,请使用如下代码(不区分大小写):

edit Note that some of these characters are utf8. 编辑请注意,其中某些字符是utf8。 To use the literal representation your editor and language must support this, otherwise use the \\u.. code in the character class (recommended). 要使用文字表示形式,您的编辑器和语言必须支持此形式,否则请使用字符类中的\\ u ..代码(推荐)。

s/(?i)^(?![a-ząčęėįšųūž]+_\\d+(?:\\n|$)).*(?:\\n|$)//mg;

where the regex is: r'(?i)^(?![a-ząčęėįšųūž]+_\\d+(?:\\n|$)).*(?:\\n|$)' 正则表达式在哪里: r'(?i)^(?![a-ząčęėįšųūž]+_\\d+(?:\\n|$)).*(?:\\n|$)'
the replacement is '' 替换为''
modifier is multiline and global. 修饰符是多行和全局的。

Breakdown: modifiers are global and multiline 细目:修饰符是全局和多行的

(?i)                              // case insensitive flag
^                                 // start of line
(?![a-ząčęėįšųūž]+_\d+(?:\n|$))   // look ahead, not this form of a line ?
.*                                // ok then select all except newline or eos
(?:\n|$)                          // select newline or end of string

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM