简体   繁体   English

从逻辑表达式中提取子字符串

[英]Extract substrings from logical expressions

Let's say I have a string that looks like this: 假设我有一个看起来像这样的字符串:

myStr = '(Txt_l1 (Txt_l2)) or (Txt2_l1 (Txt2_l2))'

What I would like to obtain in the end would be: 我最终希望获得的是:

myStr_l1 = '(Txt_l1) or (Txt2_l1)'

and

myStr_l2 = '(Txt_l2) or (Txt2_l2)'

Some properties: 一些属性:

  • all "Txt_"-elements of the string start with an uppercase letter 字符串的所有“ Txt_”元素均以大写字母开头

  • the string can contain much more elements (so there could also be Txt3 , Txt4 ,...) 该字符串可以包含更多元素(因此也可能有Txt3Txt4 ,...)

  • the suffixes '_l1' and '_l2' look different in reality; 实际上,后缀“ _l1”和“ _l2”看起来有所不同; they cannot be used for matching (I chose them for demonstration purposes) 它们不能用于匹配(我出于演示目的选择了它们)

I found a way to get the first part done by using: 我找到了一种使用以下方法完成第一部分的方法:

myStr_l1 = re.sub('\(\w+\)','',myStr)

which gives me 这给了我

'(Txt_l1 ) or (Txt2_l1 )'

However, I don't know how to obtain myStr_l2 . 但是,我不知道如何获取myStr_l2 My idea was to remove everything between two open parentheses. 我的想法是删除两个括号之间的所有内容。 But when I do something like this: 但是当我做这样的事情时:

re.sub('\(w+\(', '', myStr)

the entire string is returned. 返回整个字符串。

re.sub('\(.*\(', '', myStr)

removes - of course - far too much and gives me 删除-当然-太多了,给了我

'Txt2_l2))'

Does anyone have an idea how to get myStr_l2 ? 有谁知道如何获取myStr_l2吗?

When there is an "and" instead of an "or", the strings look slightly different: 当使用“和”而不是“或”时,字符串看起来略有不同:

myStr2 = '(Txt_l1 (Txt_l2) and Txt2_l1 (Txt2_l2))'

Then I can still use the command from above: 然后我仍然可以使用上面的命令:

re.sub('\(\w+\)','',myStr2)

which gives: 这使:

'(Txt_l1  and Txt2_l1 )'

but I again fail to get myStr2_l2 . 但是我再次无法获取myStr2_l2 How would I do this for these kind of strings? 我将如何处理此类字符串?

And how would one then do this for mixed expressions with "and" and "or" eg like this: 然后如何对带有“ and”和“ or”的混合表达式执行此操作,例如:

myStr3 = '(Txt_l1 (Txt_l2) and Txt2_l1 (Txt2_l2)) or  (Txt3_l1 (Txt3_l2) and Txt4_l1 (Txt2_l2))' 

re.sub('\(\w+\)','',myStr3)

gives me 给我

'(Txt_l1  and Txt2_l1 ) or  (Txt3_l1  and Txt4_l1 )'

but again: How would I obtain myStr3_l2 ? 但同样:我将如何获得myStr3_l2

Regexp is not powerful enough for nested expressions (in your case: nested elements in parentheses). 对于嵌套表达式,正则表达式不够强大(在您的情况下:括号中的嵌套元素)。 You will have to write a parser. 您将必须编写一个解析器。 Look at https://pyparsing.wikispaces.com/ 看看https://pyparsing.wikispaces.com/

I'm not entirely sure what you want but I wrote this to strip everything between the parenthesis. 我不确定您要什么,但是我写了这条以去除括号之间的所有内容。

import re

mystr = '(Txt_l1 (Txt_l2)) or (Txt2_l1 (Txt2_l2))'
sets = mystr.split(' or ')
noParens = []
for line in sets:
    mat = re.match(r'\((.* )\((.*\)\))', line, re.M)
    if mat:
        noParens.append(mat.group(1))
        noParens.append(mat.group(2).replace(')',''))
print(noParens)

This takes all the parenthesis away and puts your elements in a list. 这消除了所有括号并将您的元素放在列表中。 Here's an alternate way of doing it without using Regular Expressions. 这是一种不使用正则表达式的替代方法。

mystr = '(Txt_l1 (Txt_l2)) or (Txt2_l1 (Txt2_l2))'
noParens = []

mystr = mystr.replace(' or ', ' ')
mystr = mystr.replace(')','')
mystr = mystr.replace('(','')
noParens = mystr.split()

print(noParens)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM