简体   繁体   English

使用正则表达式替换python中字符串中的第n个子字符串

[英]Replacing the nth substring within a string in python using regex

I'm writing a function to edit many strings in an html file at once. 我正在编写一个函数来一次编辑html文件中的许多字符串。 The requirements are a bit peculiar, however. 但是,要求有些特殊。 Here's an example. 这是一个例子。

My String: 我的字串:

a href='http://en.wikipedia.org/wiki/Velocity'>
<img src="/uploads/3/3/9/3/3393839/____________________________________________________________________________________________________________________________________________________614162727.png" alt="Picture" style="width:100%;max-width:220px" />
</a>
<div style="display:block;font-size:90%"></div>
</div></div>

</td>
<td class='wsite-multicol-col' style='width:50%;padding:0 5px'>

<div><div class="wsite-image wsite-image-border-none " style="padding-top:0;padding-bottom:0;margin-left:0;margin-right:0;text-align:right">
<a href='http://www2.franciscan.edu/academic/MathSci/MathScienceIntegation/MathScienceIntegation-827.htm'>
<img src="/uploads/3/3/9/3/3393839/___________________________________________________________________________________________________________________________________308536556.png" alt="Picture" style="width:100%;max-width:595px" />
</a>

The actual string is much longer! 实际的字符串要长得多! I'm trying to replace all images that refer to a wikipedia links with one image and all that refer to another link to another image. 我正在尝试将所有引用Wikipedia链接的图像替换为一个图像,并将所有引用另一个链接的图像替换为另一图像。

Here's what I have so far: 这是我到目前为止的内容:

wikiPath = r"www.somewebsite.com/myimage.png"

def dePolute(myString):

    newString =""

    # Last index found
    lastIndex = 0


    while True:
        wikiIndex = myString.index('wikipedia',lastIndex)
        picStartIndex = myString.index('<img ', wikiIndex)
        picEndIndex = myString.index('/>', wikiIndex)

        newString = re.sub(r'<img.*?/>','src="' + wikiPath ,myString,1)

    return newString 

So this obviously doesn't work - but the idea I had was to first find the index of the 'wiki' keyword that exists for all of those links and sub between img tags starting from that index. 因此,这显然行不通-但是我的想法是,首先找到存在于所有这些链接以及从该索引开始的img标签之间的子链接的'wiki'关键字的索引。 Unfortunately I don't know how to do re.sub but starting at a particular index. 不幸的是,我不知道如何做re.sub,而是从特定的索引开始。 I can't do newString = re.sub(specification, newEntry, originalString[wikiIndex:]) because that would return a substring and not the entire string. 我不能做newString = re.sub(specification,newEntry,originalString [wikiIndex:]),因为那样会返回一个子字符串而不是整个字符串。


This is what I would like My String to look like after the program finishes running: 这是我希望我的字符串在程序完成运行后的样子:

a href='http://en.wikipedia.org/wiki/Velocity'>
<img src="www.somewebsite.com/myimage.png" alt="Picture" style="width:100%;max-width:220px" />
</a>
<div style="display:block;font-size:90%"></div>
</div></div>

</td>
<td class='wsite-multicol-col' style='width:50%;padding:0 5px'>

<div><div class="wsite-image wsite-image-border-none " style="padding-top:0;padding-bottom:0;margin-left:0;margin-right:0;text-align:right">
<a href='http://www2.franciscan.edu/academic/MathSci/MathScienceIntegation/MathScienceIntegation-827.htm'>
<img src="/uploads/3/3/9/3/3393839/___________________________________________________________________________________________________________________________________308536556.png" alt="Picture" style="width:100%;max-width:595px" />
</a>

I would do that with an HTML parser, like BeautifulSoup . 我会使用HTML解析器(例如BeautifulSoup

The idea is to use a CSS selector to locate img elements located inside a elements that have wikipedia inside href . 这个想法是使用CSS选择器来定位img元素,该元素位于href内具有wikipedia a元素内。 For every img element would, replace the src attribute value: 对于每个img元素,请替换src属性值:

from bs4 import BeautifulSoup

data = """your HTML"""

soup = BeautifulSoup(data, "html.parser")

for img in soup.select("a[href*=wikipedia] img[src]"):
    img["src"] = wikiPath

print(soup.prettify())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM