[英]Python regex for select/extract from nested groups
I am trying to process a string with CHAR(int) and NCHAR(int) to convert those instances with their ASCII counter-parts. 我正在尝试使用CHAR(int)和NCHAR(int)处理一个字符串,以使用其ASCII计数器转换这些实例。 An example would be something like this: 一个例子是这样的:
CHAR(124) + (SELECT TOP 1 CAST(name AS VARCHAR(8000)) FROM (SELECT TOP 1 colid, name FROM [Projects]..[syscolumns]
WHERE xtype=char(85)
AND id = OBJECT_ID(NCHAR(69)+NCHAR(78)+NCHAR(95)+NCHAR(69)+NCHAR(109)+NCHAR(112)+NCHAR(108))
Note that I don't want to do anything to VARCHAR(int), and just to the CHAR(int) and NCHAR(int) parts only. 请注意,我不想对VARCHAR(int)做任何事情,而只对CHAR(int)和NCHAR(int)部分做任何事情。 The above should translate to: 上面的应该翻译为:
|(SELECT TOP 1 CAST(name AS VARCHAR(8000)) FROM (SELECT TOP 1 colid, name FROM [Projects]..[syscolumns] WHERE xtype=U AND id = OBJECT_ID(EN_Empl) |(选择顶部1个CAST(名称为VARCHAR(8000)),来自(选择顶部1个colid,名称为[项目] .. [syscolumns]),其中xtype = U AND id = OBJECT_ID(EN_Empl)
Note that any "+" on either side of CHAR(int) or NCHAR(int) should be removed. 请注意,应删除CHAR(int)或NCHAR(int)两侧的任何“ +”号。 I tried the the following: 我尝试了以下方法:
def conv(m):
return chr(int(m.group(2)))
print re.sub(r'([\+ ]?n?char\((.*?)\)[\+ ]?)', conv, str, re.IGNORECASE)
where str
=the raw string that must be processed. 其中str
=必须处理的原始字符串。
Somehow, the VARCHAR(8000) is being picked up. 不知何故,VARCHAR(8000)被拾取。 If I tweak the regex, the "=" after xtype is going away, rather than just the space and the "+" on either side of a CHAR(int) or NCHAR(int) instance. 如果我调整正则表达式,则xtype之后的“ =”将会消失,而不仅仅是CHAR(int)或NCHAR(int)实例两侧的空格和“ +”。
Hope someone can pull me out of this. 希望有人可以把我从中拉出来。
ADDITIONAL SAMPLE STRINGS: 其他示例字符串:
String "char(124)+(Select Top 1 cast(name as varchar(8000)) from (Select Top 1 colid,name From [Projects]..[syscolumns] Where id = OBJECT_ID(NCHAR(69)+NCHAR(78)+NCHAR(95)+NCHAR(69)+NCHAR(109)+NCHAR(112)+NCHAR(108)))"
字符串"char(124)+(Select Top 1 cast(name as varchar(8000)) from (Select Top 1 colid,name From [Projects]..[syscolumns] Where id = OBJECT_ID(NCHAR(69)+NCHAR(78)+NCHAR(95)+NCHAR(69)+NCHAR(109)+NCHAR(112)+NCHAR(108)))"
Regex: r'(\\bn?char\\((\\d+)\\)(?:\\s*\\+\\s*)?)'
正则表达式: r'(\\bn?char\\((\\d+)\\)(?:\\s*\\+\\s*)?)'
Result: "|(Select Top 1 cast(name as varchar(8000)) from (Select Top 1 colid,name From [Projects]..[syscolumns] Where id = OBJECT_ID(ENCHAR(78)+NCHAR(95)+NCHAR(69)+NCHAR(109)+NCHAR(112)+NCHAR(108)))"
结果: "|(Select Top 1 cast(name as varchar(8000)) from (Select Top 1 colid,name From [Projects]..[syscolumns] Where id = OBJECT_ID(ENCHAR(78)+NCHAR(95)+NCHAR(69)+NCHAR(109)+NCHAR(112)+NCHAR(108)))"
You have three issues: 您有三个问题:
flags=re.IGNORECASE
and not just re.IGNORECASE
in re.sub . 您需要使用flags=re.IGNORECASE
,而不仅仅是re.sub中的 re.IGNORECASE
。 That is a keyword argument. 那是一个关键字参数。 \\b
to find the word boundary. 您需要使用\\b
查找单词边界。 str
as a name since you will overwrite the built-in by the same name 请勿使用str
作为名称,因为您将用相同的名称覆盖内置文件 This works: 这有效:
import re
tgt='''\
CHAR(124) + (SELECT TOP 1 CAST(name AS VARCHAR(8000)) FROM (SELECT TOP 1 colid, name FROM [Projects]..[syscolumns]
WHERE xtype=char(85)
AND id = OBJECT_ID(NCHAR(69)+NCHAR(78)+NCHAR(95)+NCHAR(69)+NCHAR(109)+NCHAR(112)+NCHAR(108))'''
pat=r'(\bn?char\((\d+)\)(?:\s*\+\s*)?)'
def conv(m):
return chr(int(m.group(2)))
print re.sub(pat, conv, tgt, flags=re.IGNORECASE)
More completely: 更完整地:
import re
tgt='''\
CHAR(124) + (SELECT TOP 1 CAST(name AS VARCHAR(8000)) FROM (SELECT TOP 1 colid, name FROM [Projects]..[syscolumns]
WHERE xtype=char(85)
AND id = OBJECT_ID(NCHAR(69)+NCHAR(78)+NCHAR(95)+NCHAR(69)+NCHAR(109)+NCHAR(112)+NCHAR(108))'''
pat=r'(\bn?char\((\d+)\)(?:\s*\+\s*)?)'
def conv(m):
return chr(int(m.group(2)))
print re.sub(r'''
( # group 1
\b # word boundary
n?char # nchar or char
\( # literal left paren
(\s*\d+\s*) # digits surrounded by spaces
\) # literal right paren
(?:\s*\+\s*)? # optionally followed by a concating '+'
) '''
, conv, tgt, flags=re.VERBOSE | re.IGNORECASE)
Prints: 印刷品:
|(SELECT TOP 1 CAST(name AS VARCHAR(8000)) FROM (SELECT TOP 1 colid, name FROM [Projects]..[syscolumns]
WHERE xtype=U
AND id = OBJECT_ID(EN_Empl)
You can go a long way just by adding the word boundary ( \\b
) assertion, but I'd like to suggest that you (1) use re.VERBOSE
to write a regexp someone can understand later; 您只需添加单词boundary( \\b
)断言就可以走很长一段路,但是我建议您(1)使用re.VERBOSE
编写一个以后可以理解的正则表达式; (2) compile the regexp to reduce clutter at the call site; (2)编译正则表达式以减少呼叫现场的混乱情况; and, (3) tighten some of the matching criteria. (3)加强一些匹配标准。 Like so: 像这样:
def conv(m):
return chr(int(m.group(1)))
pat = re.compile(r"""[+\s]* # optional whitespace or +
\b # word boundary
n?char # NCHAR or CHAR
\( # left paren
([\d\s]+) # digits or spaces - group 1
\) # right paren
[+\s]* # optional whitespace or +
""", re.VERBOSE | re.IGNORECASE)
print pat.sub(conv, data)
Note that I changed your str
to data
: str
is the name of a heavily used builtin function, and it's a Really Bad Idea to create a variable with the same name. 请注意,我将您的str
更改为data
: str
是一个经常使用的内置函数的名称,创建具有相同名称的变量是一个绝妙的主意。
You only need to use a word boundary \\b
: 您只需要使用单词边界\\b
:
def conv(m):
return chr(int(m.group(1)))
print re.sub(r'\bn?char\(([^)]+)\)(?:\s*\+\s*)?', conv, str, re.IGNORECASE)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.