繁体   English   中英

Python正则表达式将外部文本与标签之间的文本结合在一起

[英]Python Regular Expression to combine outer text with text between tags

我有以下字符串(阶段1):

(Undergraduate level  <A HREF="blah=">PHYS 218</A> Minimum Grade of D) and (Undergraduate level  <A HREF="blah">MATH 152</A> Minimum Grade of D or Undergraduate level  <A HREF="/blah=">MATH 172</A> Minimum Grade of D or Undergraduate level  <A HREF="blah">MATH 251</A> Minimum Grade of D)

从这里我进入(阶段2):

(Undergraduate level PHYS 218 Minimum Grade of D) and (Undergraduate level MATH 152 Minimum Grade of D or Undergraduate level MATH 172 Minimum Grade of D or Undergraduate level MATH 251 Minimum Grade of D)

然后最终我想要的是(阶段3):

(PHYS 218) and (MATH 152 or MATH 172 or MATH 251)

目前,我这样做的方式太可怕了。

我采用阶段1的字符串,完全删除所有a标签,然后合并剩余的文本。

然后,我从a标签中获取课程编号,并将其放入上述步骤的字符串中,以进入第二阶段。

然后,我在第二阶段中查找课程,删除该课程左右两边的所有内容,直到我碰到() orand为止。

有什么方法可以使用正则表达式或其他方法完全做到这一点? 谢谢。

x="""(Undergraduate level  <A HREF="blah=">PHYS 218</A> Minimum Grade of D) and (Undergraduate level  <A HREF="blah">MATH 152</A> Minimum Grade of D or Undergraduate level  <A HREF="/blah=">MATH 172</A> Minimum Grade of D or Undergraduate level  <A HREF="blah">MATH 251</A> Minimum Grade of D)"""
import re
print re.sub(r"<[^>]*>\s*|Undergraduate level\s*|Minimum Grade of [A-Z]+","",x)

如果格式始终是固定的并且不会有太大变化,则可以使用re.sub

参见演示。

https://regex101.com/r/hF7zZ1/2

编辑:

如果文本更改,请尝试此

x="""(Undergraduate level  <A HREF="blah=">PHYS 218</A> Minimum Grade of D) and (Undergraduate level  <A HREF="blah">MATH 152</A> Minimum Grade of D or Undergraduate level  <A HREF="/blah=">MATH 172</A> Minimum Grade of D or Undergraduate level  <A HREF="blah">MATH 251</A> Minimum Grade of D)"""
import re
print "".join(re.findall(r"(\(|\)|\s*or\s*|\s*and\s*|(?<=>)[^<]*(?=<\/A>))",x))

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM