使用正則表達式替換python中字符串中的第n個子字符串

Question

我正在編寫一個函數來一次編輯html文件中的許多字符串。 但是，要求有些特殊。 這是一個例子。

我的字串：

a href='http://en.wikipedia.org/wiki/Velocity'>
<img src="/uploads/3/3/9/3/3393839/____________________________________________________________________________________________________________________________________________________614162727.png" alt="Picture" style="width:100%;max-width:220px" />
</a>
<div style="display:block;font-size:90%"></div>
</div></div>

</td>
<td class='wsite-multicol-col' style='width:50%;padding:0 5px'>

<div><div class="wsite-image wsite-image-border-none " style="padding-top:0;padding-bottom:0;margin-left:0;margin-right:0;text-align:right">
<a href='http://www2.franciscan.edu/academic/MathSci/MathScienceIntegation/MathScienceIntegation-827.htm'>
<img src="/uploads/3/3/9/3/3393839/___________________________________________________________________________________________________________________________________308536556.png" alt="Picture" style="width:100%;max-width:595px" />
</a>

實際的字符串要長得多！ 我正在嘗試將所有引用Wikipedia鏈接的圖像替換為一個圖像，並將所有引用另一個鏈接的圖像替換為另一圖像。

這是我到目前為止的內容：

wikiPath = r"www.somewebsite.com/myimage.png"

def dePolute(myString):

    newString =""

    # Last index found
    lastIndex = 0


    while True:
        wikiIndex = myString.index('wikipedia',lastIndex)
        picStartIndex = myString.index('<img ', wikiIndex)
        picEndIndex = myString.index('/>', wikiIndex)

        newString = re.sub(r'<img.*?/>','src="' + wikiPath ,myString,1)

    return newString

因此，這顯然行不通-但是我的想法是，首先找到存在於所有這些鏈接以及從該索引開始的img標簽之間的子鏈接的'wiki'關鍵字的索引。 不幸的是，我不知道如何做re.sub，而是從特定的索引開始。 我不能做newString = re.sub（specification，newEntry，originalString [wikiIndex：]），因為那樣會返回一個子字符串而不是整個字符串。

這是我希望我的字符串在程序完成運行后的樣子：

a href='http://en.wikipedia.org/wiki/Velocity'>
<img src="www.somewebsite.com/myimage.png" alt="Picture" style="width:100%;max-width:220px" />
</a>
<div style="display:block;font-size:90%"></div>
</div></div>

</td>
<td class='wsite-multicol-col' style='width:50%;padding:0 5px'>

<div><div class="wsite-image wsite-image-border-none " style="padding-top:0;padding-bottom:0;margin-left:0;margin-right:0;text-align:right">
<a href='http://www2.franciscan.edu/academic/MathSci/MathScienceIntegation/MathScienceIntegation-827.htm'>
<img src="/uploads/3/3/9/3/3393839/___________________________________________________________________________________________________________________________________308536556.png" alt="Picture" style="width:100%;max-width:595px" />
</a>

Answer 1

我會使用HTML解析器（例如BeautifulSoup 。

這個想法是使用CSS選擇器來定位img元素，該元素位於href內具有wikipedia a元素內。 對於每個img元素，請替換src屬性值：

from bs4 import BeautifulSoup

data = """your HTML"""

soup = BeautifulSoup(data, "html.parser")

for img in soup.select("a[href*=wikipedia] img[src]"):
    img["src"] = wikiPath

print(soup.prettify())

使用正則表達式替換python中字符串中的第n個子字符串

問題描述

1 個解決方案

解決方案1
4 2016-02-19 04:30:01

使用正則表達式替換python中字符串中的第n個子字符串

問題描述

1 個解決方案

解決方案1 4 2016-02-19 04:30:01

解決方案1
4 2016-02-19 04:30:01