简体   繁体   English

正则表达式与下划线匹配

[英]Regular Expression Matching With Underscores

I'm using Python's re package (yes I am aware that regular expressions are more general, but who knows, there may be other packages) to read some data which includes inequalities with variable names after which come +, -, >, < or =. 我正在使用Python的重包(是的我知道正则表达式更通用,但是谁知道,可能还有其他包)来读取一些数据,其中包括带有变量名的不等式,之后是+, - ,>,<或=。 (It's a system of inequalities.) I need to filter out the variable names. (这是一个不平等的系统。)我需要过滤掉变量名称。

Up until now, I used 到目前为止,我用过

var_pattern = re.compile(r'[a-z|A-Z]+\d*\.?')

which is somewhat 'hacky' as it isn't too general. 这有点“hacky”,因为它不太笼统。 I didn't mind but came across a problem with weird names as below. 我不介意但是遇到了一个奇怪名字的问题,如下所示。

My next go was 我的下一步是

var_pattern = re.compile(r'[a-z|A-Z]+[a-zA-Z0-9_.]*')

which should, after at least one initial letter, match just about everything that occurs except for +,-, >, < and =. 在至少一个首字母之后,应该匹配除了+, - ,>,<和=之外发生的所有事情。 This works nice with variable names like 'x23' oder 'C2000001.' 这适用于变量名称,如'x23'oder'C2000001'。 but not with 'x_w_3_dummy_1'. 但不是'x_w_3_dummy_1'。 I would have thought it might still be because of the underscore but it seems to work just fine with the variable 'x_b_1_0_0'. 我原以为它可能仍然是因为下划线但它似乎与变量'x_b_1_0_0'一起工作得很好。

Does anybody have an idea of what might cause and, more importantly, how to fix it? 有没有人知道可能会导致什么,更重要的是,如何解决它?

As an aside, I also tried 顺便说一句,我也试过了

var_pattern = re.compile(r'[a-z|A-Z]+[^+^-^>^<^=]*')

but to no avail either. 但无济于事。

Your pattern should work just fine for your example, but correcting your pattern a little to actually match your intention: 您的模式应该适用于您的示例,但要稍微纠正您的模式以实际符合您的意图:

r'[a-zA-Z][a-zA-Z0-9_]*'

This matches 1 initial letter (lower or uppcase), followed by 0 or more letters, digits and underscores. 这匹配1个首字母(lower或uppcase),后跟0个或更多字母,数字和下划线。 Your version had a redundant + , and included | 您的版本有冗余+ ,并包含| in what was allowed for the first character, and . 在第一个角色允许的内容中,和. for the rest of the name. 其余的名字。

A demonstration to show this matches all your samples: 显示此示例的演示符合您的所有样本:

>>> import re
>>> names = ('x23', 'C2000001', 'x_w_3_dummy_1', 'x_b_1_0_0')
>>> var_pattern = re.compile(r'[a-zA-Z][a-zA-Z0-9_]*')
>>> for name in names:
...     print var_pattern.search(name).group()
... 
x23
C2000001
x_w_3_dummy_1
x_b_1_0_0

The pattern does not match any + , - , > , < or = characters that might follow the variable name: 该模式与可能跟随变量名称的任何+-><=字符不匹配:

>>> var_pattern.findall('x23<10\nC2000001=24\nx_w_3_dummy_1+15\nx_b_1_0_0-5')
['x23', 'C2000001', 'x_w_3_dummy_1', 'x_b_1_0_0']

应该:

[a-zA-Z_][a-zA-Z0-9_.]*

Your question has already been answered, apart from why your original expression didn't work with your underscores. 除了您的原始表达不适用于您的下划线之外,您的问题已经得到了解答。 If you have the pattern 如果你有模式

r'[a-zA-Z][a-zA-Z0-9_.]*'

then because of the dot it's actually equivalent to 那么因为它实际上相当于

r'[a-zA-Z].*'

so contrary to what you thought, this does match both your "x_w_3_dummy_1" and your "x_b_1_0_0". 所以与你的想法相反,这确实匹配你的“x_w_3_dummy_1” 你的“x_b_1_0_0”。 The problem is that because of the dot it will also match your subsequent delimiter, like your +,-, >, < and = as well as anything after it. 问题是因为它也会与你的后续分隔符匹配,比如你的+, - ,>,<和=以及之后的任何分隔符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM