[英]How to remove all occurrences of sub strings between two tags from a string?
我正在嘗試刪除以下字符串中<pre><code>
和</code></pre>
之間的所有子字符串,並且還刪除<pre><code>
和</code></pre>
:
txt = '<p>Large pythons were <pre><code> the subject of many </code></pre> a news story </p>\n last year due to the fact that there were at least two deaths <pre><code> directly attributable to them </code></pre>. Tragically, the deaths would not have happened had <pre><code> the owners of these snakes kept them </code></pre> safely, and responsibly, contained. The following article, by David Chiszar, Hobart M. Smith, <a href= Albert Petkus and Joseph Dougherty </a>, was recently published in the Bulletin of the Chicago Herpetological Society, and represents the first clear, and accurate, <p> account of the death that occurred July 1993</p>\n'
我編寫了以下代碼,以刪除三個子字符串出現時的這些標記。
def remsubstr( s, first, last ):
if first and last not in s:
return s
try:
start = s.index( first ) + len( first )
end = s.index( last, start )
d = (s[:start] +" "+ s[end:]).replace('<p>', '').replace('</p>\n', '')
started = d.index("<pre><code>" )
ended = d.index("</code></pre>") + len("</code></pre>")
nw = d.replace(d[started:ended], '')
if first and last in nw:
start = nw.index( first ) + len( first )
end = nw.index( last, start )
d1 = (nw[:start] +" "+ nw[end:])
started = d1.index("<pre><code>" )
ended = d1.index("</code></pre>") + len("</code></pre>")
nw1 = d1.replace(d1[started:ended], '')
if first and last in nw1:
start = nw1.index( first ) + len( first )
end = nw1.index( last, start )
d2 = (nw1[:start] +" "+ nw1[end:])
started = d2.index("<pre><code>" )
ended = d2.index("</code></pre>") + len("</code></pre>")
nw2 = d2.replace(d2[started:ended], '')
return nw2
return nw1
return nw
except ValueError:
return ""
我可以使用上面的示例代碼刪除所有必需的標簽:
remsubstr(txt,"<pre><code>", "</code></pre>")
結果:
'Large pythons were a news story last year due to the fact that there were at least two deaths . Tragically, the deaths would not have happened had safely, and responsibly, contained. The following article, by David Chiszar, Hobart M. Smith, <a href= Albert Petkus and Joseph Dougherty </a>, was recently published in the Bulletin of the Chicago Herpetological Society, and represents the first clear, and accurate, account of the death that occurred July 1993'
我有成千上萬的字符串,應針對該字符串應用函數以消除這種情況的多次出現。
尋找幫助來編寫代碼,該代碼將刪除標記之間的所有子字符串,並且適用於三個以上的子字符串/標簽實例。
我建議使用BeautifulSoup 。 在那里,您可以組合.find_all()和.decompose()。 在您的情況下,應該這樣做:
import bs4
txt = '<p>Large pythons were <pre><code> the subject of many </code></pre> a news story </p>\n last year due to the fact that there were at least two deaths <pre><code> directly attributable to them </code></pre>. Tragically, the deaths would not have happened had <pre><code> the owners of these snakes kept them </code></pre> safely, and responsibly, contained. The following article, by David Chiszar, Hobart M. Smith, <a href= Albert Petkus and Joseph Dougherty </a>, was recently published in the Bulletin of the Chicago Herpetological Society, and represents the first clear, and accurate, <p> account of the death that occurred July 1993</p>\n'
soup = bs4.BeautifulSoup(txt, "html.parser")
for tag in soup.find_all('pre'):
if tag.find('code'):
tag.decompose()
result = str(soup)
使用Beautiful Soup 4時 ,標准字符串操作對於XML文件中的嵌套不是最佳的。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.