[英]Python Regular Expression to combine outer text with text between tags
我有以下字符串(階段1):
(Undergraduate level <A HREF="blah=">PHYS 218</A> Minimum Grade of D) and (Undergraduate level <A HREF="blah">MATH 152</A> Minimum Grade of D or Undergraduate level <A HREF="/blah=">MATH 172</A> Minimum Grade of D or Undergraduate level <A HREF="blah">MATH 251</A> Minimum Grade of D)
從這里我進入(階段2):
(Undergraduate level PHYS 218 Minimum Grade of D) and (Undergraduate level MATH 152 Minimum Grade of D or Undergraduate level MATH 172 Minimum Grade of D or Undergraduate level MATH 251 Minimum Grade of D)
然后最終我想要的是(階段3):
(PHYS 218) and (MATH 152 or MATH 172 or MATH 251)
目前,我這樣做的方式太可怕了。
我采用階段1的字符串,完全刪除所有a
標簽,然后合並剩余的文本。
然后,我從a
標簽中獲取課程編號,並將其放入上述步驟的字符串中,以進入第二階段。
然后,我在第二階段中查找課程,刪除該課程左右兩邊的所有內容,直到我碰到(
, )
or
, and
為止。
有什么方法可以使用正則表達式或其他方法完全做到這一點? 謝謝。
x="""(Undergraduate level <A HREF="blah=">PHYS 218</A> Minimum Grade of D) and (Undergraduate level <A HREF="blah">MATH 152</A> Minimum Grade of D or Undergraduate level <A HREF="/blah=">MATH 172</A> Minimum Grade of D or Undergraduate level <A HREF="blah">MATH 251</A> Minimum Grade of D)"""
import re
print re.sub(r"<[^>]*>\s*|Undergraduate level\s*|Minimum Grade of [A-Z]+","",x)
如果格式始終是固定的並且不會有太大變化,則可以使用re.sub
。
參見演示。
https://regex101.com/r/hF7zZ1/2
編輯:
如果文本更改,請嘗試此
x="""(Undergraduate level <A HREF="blah=">PHYS 218</A> Minimum Grade of D) and (Undergraduate level <A HREF="blah">MATH 152</A> Minimum Grade of D or Undergraduate level <A HREF="/blah=">MATH 172</A> Minimum Grade of D or Undergraduate level <A HREF="blah">MATH 251</A> Minimum Grade of D)"""
import re
print "".join(re.findall(r"(\(|\)|\s*or\s*|\s*and\s*|(?<=>)[^<]*(?=<\/A>))",x))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.