[英]Regular Expression Matching With Underscores
I'm using Python's re package (yes I am aware that regular expressions are more general, but who knows, there may be other packages) to read some data which includes inequalities with variable names after which come +, -, >, < or =. 我正在使用Python的重包(是的我知道正则表达式更通用,但是谁知道,可能还有其他包)来读取一些数据,其中包括带有变量名的不等式,之后是+, - ,>,<或=。 (It's a system of inequalities.) I need to filter out the variable names. (这是一个不平等的系统。)我需要过滤掉变量名称。
Up until now, I used 到目前为止,我用过
var_pattern = re.compile(r'[a-z|A-Z]+\d*\.?')
which is somewhat 'hacky' as it isn't too general. 这有点“hacky”,因为它不太笼统。 I didn't mind but came across a problem with weird names as below. 我不介意但是遇到了一个奇怪名字的问题,如下所示。
My next go was 我的下一步是
var_pattern = re.compile(r'[a-z|A-Z]+[a-zA-Z0-9_.]*')
which should, after at least one initial letter, match just about everything that occurs except for +,-, >, < and =. 在至少一个首字母之后,应该匹配除了+, - ,>,<和=之外发生的所有事情。 This works nice with variable names like 'x23' oder 'C2000001.' 这适用于变量名称,如'x23'oder'C2000001'。 but not with 'x_w_3_dummy_1'. 但不是'x_w_3_dummy_1'。 I would have thought it might still be because of the underscore but it seems to work just fine with the variable 'x_b_1_0_0'. 我原以为它可能仍然是因为下划线但它似乎与变量'x_b_1_0_0'一起工作得很好。
Does anybody have an idea of what might cause and, more importantly, how to fix it? 有没有人知道可能会导致什么,更重要的是,如何解决它?
As an aside, I also tried 顺便说一句,我也试过了
var_pattern = re.compile(r'[a-z|A-Z]+[^+^-^>^<^=]*')
but to no avail either. 但无济于事。
Your pattern should work just fine for your example, but correcting your pattern a little to actually match your intention: 您的模式应该适用于您的示例,但要稍微纠正您的模式以实际符合您的意图:
r'[a-zA-Z][a-zA-Z0-9_]*'
This matches 1 initial letter (lower or uppcase), followed by 0 or more letters, digits and underscores. 这匹配1个首字母(lower或uppcase),后跟0个或更多字母,数字和下划线。 Your version had a redundant +
, and included |
您的版本有冗余+
,并包含|
in what was allowed for the first character, and .
在第一个角色允许的内容中,和.
for the rest of the name. 其余的名字。
A demonstration to show this matches all your samples: 显示此示例的演示符合您的所有样本:
>>> import re
>>> names = ('x23', 'C2000001', 'x_w_3_dummy_1', 'x_b_1_0_0')
>>> var_pattern = re.compile(r'[a-zA-Z][a-zA-Z0-9_]*')
>>> for name in names:
... print var_pattern.search(name).group()
...
x23
C2000001
x_w_3_dummy_1
x_b_1_0_0
The pattern does not match any +
, -
, >
, <
or =
characters that might follow the variable name: 该模式与可能跟随变量名称的任何+
, -
, >
, <
或=
字符不匹配:
>>> var_pattern.findall('x23<10\nC2000001=24\nx_w_3_dummy_1+15\nx_b_1_0_0-5')
['x23', 'C2000001', 'x_w_3_dummy_1', 'x_b_1_0_0']
应该:
[a-zA-Z_][a-zA-Z0-9_.]*
Your question has already been answered, apart from why your original expression didn't work with your underscores. 除了您的原始表达不适用于您的下划线之外,您的问题已经得到了解答。 If you have the pattern 如果你有模式
r'[a-zA-Z][a-zA-Z0-9_.]*'
then because of the dot it's actually equivalent to 那么因为它实际上相当于
r'[a-zA-Z].*'
so contrary to what you thought, this does match both your "x_w_3_dummy_1" and your "x_b_1_0_0". 所以与你的想法相反,这确实匹配你的“x_w_3_dummy_1” 和你的“x_b_1_0_0”。 The problem is that because of the dot it will also match your subsequent delimiter, like your +,-, >, < and = as well as anything after it. 问题是因为它也会与你的后续分隔符匹配,比如你的+, - ,>,<和=以及之后的任何分隔符。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.