简体   繁体   English

用于从嵌套组中选择/提取的Python正则表达式

[英]Python regex for select/extract from nested groups

I am trying to process a string with CHAR(int) and NCHAR(int) to convert those instances with their ASCII counter-parts. 我正在尝试使用CHAR(int)和NCHAR(int)处理一个字符串,以使用其ASCII计数器转换这些实例。 An example would be something like this: 一个例子是这样的:

CHAR(124) + (SELECT TOP 1 CAST(name AS VARCHAR(8000)) FROM (SELECT TOP 1 colid, name FROM [Projects]..[syscolumns]
WHERE xtype=char(85)
AND id = OBJECT_ID(NCHAR(69)+NCHAR(78)+NCHAR(95)+NCHAR(69)+NCHAR(109)+NCHAR(112)+NCHAR(108))

Note that I don't want to do anything to VARCHAR(int), and just to the CHAR(int) and NCHAR(int) parts only. 请注意,我不想对VARCHAR(int)做任何事情,而只对CHAR(int)和NCHAR(int)部分做任何事情。 The above should translate to: 上面的应该翻译为:

|(SELECT TOP 1 CAST(name AS VARCHAR(8000)) FROM (SELECT TOP 1 colid, name FROM [Projects]..[syscolumns] WHERE xtype=U AND id = OBJECT_ID(EN_Empl) |(选择顶部1个CAST(名称为VARCHAR(8000)),来自(选择顶部1个colid,名称为[项目] .. [syscolumns]),其中xtype = U AND id = OBJECT_ID(EN_Empl)

Note that any "+" on either side of CHAR(int) or NCHAR(int) should be removed. 请注意,应删除CHAR(int)或NCHAR(int)两侧的任何“ +”号。 I tried the the following: 我尝试了以下方法:

def conv(m):
    return chr(int(m.group(2)))

print re.sub(r'([\+ ]?n?char\((.*?)\)[\+ ]?)', conv, str, re.IGNORECASE)

where str =the raw string that must be processed. 其中str =必须处理的原始字符串。

Somehow, the VARCHAR(8000) is being picked up. 不知何故,VARCHAR(8000)被拾取。 If I tweak the regex, the "=" after xtype is going away, rather than just the space and the "+" on either side of a CHAR(int) or NCHAR(int) instance. 如果我调整正则表达式,则xtype之后的“ =”将会消失,而不仅仅是CHAR(int)或NCHAR(int)实例两侧的空格和“ +”。

Hope someone can pull me out of this. 希望有人可以把我从中拉出来。

ADDITIONAL SAMPLE STRINGS: 其他示例字符串:

String "char(124)+(Select Top 1 cast(name as varchar(8000)) from (Select Top 1 colid,name From [Projects]..[syscolumns] Where id = OBJECT_ID(NCHAR(69)+NCHAR(78)+NCHAR(95)+NCHAR(69)+NCHAR(109)+NCHAR(112)+NCHAR(108)))" 字符串"char(124)+(Select Top 1 cast(name as varchar(8000)) from (Select Top 1 colid,name From [Projects]..[syscolumns] Where id = OBJECT_ID(NCHAR(69)+NCHAR(78)+NCHAR(95)+NCHAR(69)+NCHAR(109)+NCHAR(112)+NCHAR(108)))"

Regex: r'(\\bn?char\\((\\d+)\\)(?:\\s*\\+\\s*)?)' 正则表达式: r'(\\bn?char\\((\\d+)\\)(?:\\s*\\+\\s*)?)'

Result: "|(Select Top 1 cast(name as varchar(8000)) from (Select Top 1 colid,name From [Projects]..[syscolumns] Where id = OBJECT_ID(ENCHAR(78)+NCHAR(95)+NCHAR(69)+NCHAR(109)+NCHAR(112)+NCHAR(108)))" 结果: "|(Select Top 1 cast(name as varchar(8000)) from (Select Top 1 colid,name From [Projects]..[syscolumns] Where id = OBJECT_ID(ENCHAR(78)+NCHAR(95)+NCHAR(69)+NCHAR(109)+NCHAR(112)+NCHAR(108)))"

You have three issues: 您有三个问题:

  1. You need to use flags=re.IGNORECASE and not just re.IGNORECASE in re.sub . 您需要使用flags=re.IGNORECASE ,而不仅仅是re.sub中的 re.IGNORECASE That is a keyword argument. 那是一个关键字参数。
  2. You need to use \\b to find the word boundary. 您需要使用\\b查找单词边界。
  3. You should not use str as a name since you will overwrite the built-in by the same name 请勿使用str作为名称,因为您将用相同的名称覆盖内置文件

This works: 这有效:

import re

tgt='''\
CHAR(124) + (SELECT TOP 1 CAST(name AS VARCHAR(8000)) FROM (SELECT TOP 1 colid, name FROM [Projects]..[syscolumns]
WHERE xtype=char(85)
AND id = OBJECT_ID(NCHAR(69)+NCHAR(78)+NCHAR(95)+NCHAR(69)+NCHAR(109)+NCHAR(112)+NCHAR(108))'''

pat=r'(\bn?char\((\d+)\)(?:\s*\+\s*)?)'

def conv(m):
    return chr(int(m.group(2)))

print re.sub(pat, conv, tgt, flags=re.IGNORECASE)    

More completely: 更完整地:

import re

tgt='''\
CHAR(124) + (SELECT TOP 1 CAST(name AS VARCHAR(8000)) FROM (SELECT TOP 1 colid, name FROM [Projects]..[syscolumns]
WHERE xtype=char(85)
AND id = OBJECT_ID(NCHAR(69)+NCHAR(78)+NCHAR(95)+NCHAR(69)+NCHAR(109)+NCHAR(112)+NCHAR(108))'''

pat=r'(\bn?char\((\d+)\)(?:\s*\+\s*)?)'

def conv(m):
    return chr(int(m.group(2)))

print re.sub(r'''
              (                                 # group 1
              \b                                # word boundary
              n?char                            # nchar or char
              \(                                # literal left paren
              (\s*\d+\s*)                       # digits surrounded by spaces
              \)                                # literal right paren
              (?:\s*\+\s*)?                     # optionally followed by a concating '+' 
              )                                 '''
            , conv, tgt, flags=re.VERBOSE | re.IGNORECASE)   

Prints: 印刷品:

|(SELECT TOP 1 CAST(name AS VARCHAR(8000)) FROM (SELECT TOP 1 colid, name FROM [Projects]..[syscolumns]
WHERE xtype=U
AND id = OBJECT_ID(EN_Empl)

You can go a long way just by adding the word boundary ( \\b ) assertion, but I'd like to suggest that you (1) use re.VERBOSE to write a regexp someone can understand later; 您只需添加单词boundary( \\b )断言就可以走很长一段路,但是我建议您(1)使用re.VERBOSE编写一个以后可以理解的正则表达式; (2) compile the regexp to reduce clutter at the call site; (2)编译正则表达式以减少呼叫现场的混乱情况; and, (3) tighten some of the matching criteria. (3)加强一些匹配标准。 Like so: 像这样:

def conv(m):
    return chr(int(m.group(1)))

pat = re.compile(r"""[+\s]*    # optional whitespace or +
                     \b        # word boundary
                     n?char    # NCHAR or CHAR
                     \(        # left paren
                     ([\d\s]+) # digits or spaces - group 1
                     \)        # right paren
                     [+\s]*    # optional whitespace or +
                  """, re.VERBOSE | re.IGNORECASE)
print pat.sub(conv, data)

Note that I changed your str to data : str is the name of a heavily used builtin function, and it's a Really Bad Idea to create a variable with the same name. 请注意,我将您的str更改为datastr是一个经常使用的内置函数的名称,创建具有相同名称的变量是一个绝妙的主意。

You only need to use a word boundary \\b : 您只需要使用单词边界\\b

def conv(m):
    return chr(int(m.group(1)))

print re.sub(r'\bn?char\(([^)]+)\)(?:\s*\+\s*)?', conv, str, re.IGNORECASE)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM