正则表达式与下划线匹配

Question

I'm using Python's re package (yes I am aware that regular expressions are more general, but who knows, there may be other packages) to read some data which includes inequalities with variable names after which come +, -, >, < or =. 我正在使用Python的重包（是的我知道正则表达式更通用，但是谁知道，可能还有其他包）来读取一些数据，其中包括带有变量名的不等式，之后是+， - ，>，<或=。 (It's a system of inequalities.) I need to filter out the variable names. （这是一个不平等的系统。）我需要过滤掉变量名称。

Up until now, I used 到目前为止，我用过

var_pattern = re.compile(r'[a-z|A-Z]+\d*\.?')

which is somewhat 'hacky' as it isn't too general. 这有点“hacky”，因为它不太笼统。 I didn't mind but came across a problem with weird names as below. 我不介意但是遇到了一个奇怪名字的问题，如下所示。

My next go was 我的下一步是

var_pattern = re.compile(r'[a-z|A-Z]+[a-zA-Z0-9_.]*')

which should, after at least one initial letter, match just about everything that occurs except for +,-, >, < and =. 在至少一个首字母之后，应该匹配除了+， - ，>，<和=之外发生的所有事情。 This works nice with variable names like 'x23' oder 'C2000001.' 这适用于变量名称，如'x23'oder'C2000001'。 but not with 'x_w_3_dummy_1'. 但不是'x_w_3_dummy_1'。 I would have thought it might still be because of the underscore but it seems to work just fine with the variable 'x_b_1_0_0'. 我原以为它可能仍然是因为下划线但它似乎与变量'x_b_1_0_0'一起工作得很好。

Does anybody have an idea of what might cause and, more importantly, how to fix it? 有没有人知道可能会导致什么，更重要的是，如何解决它？

As an aside, I also tried 顺便说一句，我也试过了

var_pattern = re.compile(r'[a-z|A-Z]+[^+^-^>^<^=]*')

but to no avail either. 但无济于事。

Answer 1

Your pattern should work just fine for your example, but correcting your pattern a little to actually match your intention: 您的模式应该适用于您的示例，但要稍微纠正您的模式以实际符合您的意图：

r'[a-zA-Z][a-zA-Z0-9_]*'

This matches 1 initial letter (lower or uppcase), followed by 0 or more letters, digits and underscores. 这匹配1个首字母（lower或uppcase），后跟0个或更多字母，数字和下划线。 Your version had a redundant + , and included | 您的版本有冗余+ ，并包含| in what was allowed for the first character, and . 在第一个角色允许的内容中，和. for the rest of the name. 其余的名字。

A demonstration to show this matches all your samples: 显示此示例的演示符合您的所有样本：

>>> import re
>>> names = ('x23', 'C2000001', 'x_w_3_dummy_1', 'x_b_1_0_0')
>>> var_pattern = re.compile(r'[a-zA-Z][a-zA-Z0-9_]*')
>>> for name in names:
...     print var_pattern.search(name).group()
... 
x23
C2000001
x_w_3_dummy_1
x_b_1_0_0

The pattern does not match any + , - , > , < or = characters that might follow the variable name: 该模式与可能跟随变量名称的任何+ ， - ， > ， <或=字符不匹配：

>>> var_pattern.findall('x23<10\nC2000001=24\nx_w_3_dummy_1+15\nx_b_1_0_0-5')
['x23', 'C2000001', 'x_w_3_dummy_1', 'x_b_1_0_0']

Answer 2

应该：

[a-zA-Z_][a-zA-Z0-9_.]*

Answer 3

Your question has already been answered, apart from why your original expression didn't work with your underscores. 除了您的原始表达不适用于您的下划线之外，您的问题已经得到了解答。 If you have the pattern 如果你有模式

r'[a-zA-Z][a-zA-Z0-9_.]*'

then because of the dot it's actually equivalent to 那么因为它实际上相当于

r'[a-zA-Z].*'

so contrary to what you thought, this does match both your "x_w_3_dummy_1" and your "x_b_1_0_0". 所以与你的想法相反，这确实匹配你的“x_w_3_dummy_1” 和你的“x_b_1_0_0”。 The problem is that because of the dot it will also match your subsequent delimiter, like your +,-, >, < and = as well as anything after it. 问题是因为它也会与你的后续分隔符匹配，比如你的+， - ，>，<和=以及之后的任何分隔符。

正则表达式与下划线匹配

问题描述

3 个解决方案

解决方案1
2 已采纳 2013-03-26 12:35:51

解决方案2
0 2013-03-26 12:37:27

解决方案3
0 2013-03-26 14:35:43

正则表达式与下划线匹配

问题描述

3 个解决方案

解决方案1 2 已采纳 2013-03-26 12:35:51

解决方案2 0 2013-03-26 12:37:27

解决方案3 0 2013-03-26 14:35:43

解决方案1
2 已采纳 2013-03-26 12:35:51

解决方案2
0 2013-03-26 12:37:27

解决方案3
0 2013-03-26 14:35:43