[英]Python Regular Expression to combine outer text with text between tags
我有以下字符串(阶段1):
(Undergraduate level <A HREF="blah=">PHYS 218</A> Minimum Grade of D) and (Undergraduate level <A HREF="blah">MATH 152</A> Minimum Grade of D or Undergraduate level <A HREF="/blah=">MATH 172</A> Minimum Grade of D or Undergraduate level <A HREF="blah">MATH 251</A> Minimum Grade of D)
从这里我进入(阶段2):
(Undergraduate level PHYS 218 Minimum Grade of D) and (Undergraduate level MATH 152 Minimum Grade of D or Undergraduate level MATH 172 Minimum Grade of D or Undergraduate level MATH 251 Minimum Grade of D)
然后最终我想要的是(阶段3):
(PHYS 218) and (MATH 152 or MATH 172 or MATH 251)
目前,我这样做的方式太可怕了。
我采用阶段1的字符串,完全删除所有a
标签,然后合并剩余的文本。
然后,我从a
标签中获取课程编号,并将其放入上述步骤的字符串中,以进入第二阶段。
然后,我在第二阶段中查找课程,删除该课程左右两边的所有内容,直到我碰到(
, )
or
, and
为止。
有什么方法可以使用正则表达式或其他方法完全做到这一点? 谢谢。
x="""(Undergraduate level <A HREF="blah=">PHYS 218</A> Minimum Grade of D) and (Undergraduate level <A HREF="blah">MATH 152</A> Minimum Grade of D or Undergraduate level <A HREF="/blah=">MATH 172</A> Minimum Grade of D or Undergraduate level <A HREF="blah">MATH 251</A> Minimum Grade of D)"""
import re
print re.sub(r"<[^>]*>\s*|Undergraduate level\s*|Minimum Grade of [A-Z]+","",x)
如果格式始终是固定的并且不会有太大变化,则可以使用re.sub
。
参见演示。
https://regex101.com/r/hF7zZ1/2
编辑:
如果文本更改,请尝试此
x="""(Undergraduate level <A HREF="blah=">PHYS 218</A> Minimum Grade of D) and (Undergraduate level <A HREF="blah">MATH 152</A> Minimum Grade of D or Undergraduate level <A HREF="/blah=">MATH 172</A> Minimum Grade of D or Undergraduate level <A HREF="blah">MATH 251</A> Minimum Grade of D)"""
import re
print "".join(re.findall(r"(\(|\)|\s*or\s*|\s*and\s*|(?<=>)[^<]*(?=<\/A>))",x))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.