I'm trying to replace all instances of href="../directory"
with href="../directory/index.html"
.
In Python, this
reg = re.compile(r'<a href="../(.*?)">')
for match in re.findall(reg, input_html):
output_html = input_html.replace(match, match+'index.html')
produces the following output:
href="../personal-autonomy/index.htmlindex.htmlindex.htmlindex.html"
href="../paternalism/index.html"
href="../principle-beneficence/index.htmlindex.htmlindex.html"
href="../decision-capacity/index.htmlindex.htmlindex.html"
Any idea why it works with the second link, but the others don't?
Relevant part of the source:
<p>
<a href="../personal-autonomy/">autonomy: personal</a> |
<a href="../principle-beneficence/">beneficence, principle of</a> |
<a href="../decision-capacity/">decision-making capacity</a> |
<a href="../legal-obligation/">legal obligation and authority</a> |
<a href="../paternalism/">paternalism</a> |
<a href="../identity-personal/">personal identity</a> |
<a href="../identity-ethics/">personal identity: and ethics</a> |
<a href="../respect/">respect</a> |
<a href="../well-being/">well-being</a>
</p>
EDIT : The repeated 'index.html' is actually the result of multiple matches. (eg href="../personal-autonomy/index.htmlindex.htmlindex.htmlindex.html" is because ../personal-autonomy is found four times in the original source).
As a general regex question, how would you replace all instances without adding an additional 'index.html' to all matches?
import re
from lxml import html
def replace_link(link):
if re.match(r"\.\./[^/]+/$", link):
link += "index.html"
return link
print html.rewrite_links(your_html_text, replace_link)
<p>
<a href="../personal-autonomy/index.html">autonomy: personal</a> |
<a href="../principle-beneficence/index.html">beneficence, principle of</a> |
<a href="../decision-capacity/index.html">decision-making capacity</a> |
<a href="../legal-obligation/index.html">legal obligation and authority</a> |
<a href="../paternalism/index.html">paternalism</a> |
<a href="../identity-personal/index.html">personal identity</a> |
<a href="../identity-ethics/index.html">personal identity: and ethics</a> |
<a href="../respect/index.html">respect</a> |
<a href="../well-being/index.html">well-being</a>
</p>
I think i found out the problem
reg = re.compile(r'<a href="../(.*?)">')
for match in re.findall(reg, input_html):
output_html = input_html.replace(match, match+'index.html')
Here 'input_html' is modified inside the for loop and then same 'input_html' is searched again for the regex which is the bug :)
Have your tied escaping your first two .
?
reg = re.compile(r'<a[ ]href="[.][.]/(.*?)">')
But I would try to use lxml instead.
The problem is the content of the a-tag also matches what you try to replace.
It's in no way the ideal way to do it, but I think you will find it works correctly if you replace your regex with:
reg = re.compile(r'<a href="(\.\./.*?)">')
There is an error in your regex in that the ..
does not match two dots. Instead, it is the .
metacharacter . To mean a dot, you need to escape it.
Your regex should be: <a href="\\.\\./(.*?)"
Besides, assuming all your href are of the form ../somedirectory/ you can get away with a simpler regex:
for match in re.compile(r'<a href="(.*?)"').findall(html):
html = html.replace(match, match + "index.html")
Here, the regex matches
<a href=" # start of the taf and attribute
( # start of a group
.* # any character, any number of times
) # end of group
" # end of the attribute
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.