簡體   English   中英

Python正則表達式將外部文本與標簽之間的文本結合在一起

[英]Python Regular Expression to combine outer text with text between tags

我有以下字符串(階段1):

(Undergraduate level  <A HREF="blah=">PHYS 218</A> Minimum Grade of D) and (Undergraduate level  <A HREF="blah">MATH 152</A> Minimum Grade of D or Undergraduate level  <A HREF="/blah=">MATH 172</A> Minimum Grade of D or Undergraduate level  <A HREF="blah">MATH 251</A> Minimum Grade of D)

從這里我進入(階段2):

(Undergraduate level PHYS 218 Minimum Grade of D) and (Undergraduate level MATH 152 Minimum Grade of D or Undergraduate level MATH 172 Minimum Grade of D or Undergraduate level MATH 251 Minimum Grade of D)

然后最終我想要的是(階段3):

(PHYS 218) and (MATH 152 or MATH 172 or MATH 251)

目前,我這樣做的方式太可怕了。

我采用階段1的字符串,完全刪除所有a標簽,然后合並剩余的文本。

然后,我從a標簽中獲取課程編號,並將其放入上述步驟的字符串中,以進入第二階段。

然后,我在第二階段中查找課程,刪除該課程左右兩邊的所有內容,直到我碰到() orand為止。

有什么方法可以使用正則表達式或其他方法完全做到這一點? 謝謝。

x="""(Undergraduate level  <A HREF="blah=">PHYS 218</A> Minimum Grade of D) and (Undergraduate level  <A HREF="blah">MATH 152</A> Minimum Grade of D or Undergraduate level  <A HREF="/blah=">MATH 172</A> Minimum Grade of D or Undergraduate level  <A HREF="blah">MATH 251</A> Minimum Grade of D)"""
import re
print re.sub(r"<[^>]*>\s*|Undergraduate level\s*|Minimum Grade of [A-Z]+","",x)

如果格式始終是固定的並且不會有太大變化,則可以使用re.sub

參見演示。

https://regex101.com/r/hF7zZ1/2

編輯:

如果文本更改,請嘗試此

x="""(Undergraduate level  <A HREF="blah=">PHYS 218</A> Minimum Grade of D) and (Undergraduate level  <A HREF="blah">MATH 152</A> Minimum Grade of D or Undergraduate level  <A HREF="/blah=">MATH 172</A> Minimum Grade of D or Undergraduate level  <A HREF="blah">MATH 251</A> Minimum Grade of D)"""
import re
print "".join(re.findall(r"(\(|\)|\s*or\s*|\s*and\s*|(?<=>)[^<]*(?=<\/A>))",x))

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM