Python和re.compile返回不一致的結果

Question

我正在嘗試將href="../directory"所有實例替換為href="../directory/index.html" 。

在Python中，

reg = re.compile(r'<a href="../(.*?)">')
for match in re.findall(reg, input_html):
    output_html = input_html.replace(match, match+'index.html')

產生以下輸出：

href="../personal-autonomy/index.htmlindex.htmlindex.htmlindex.html"  
href="../paternalism/index.html"  
href="../principle-beneficence/index.htmlindex.htmlindex.html"  
href="../decision-capacity/index.htmlindex.htmlindex.html"

知道為什么它可以與第二個鏈接一起使用，但其他鏈接卻不起作用嗎？

來源的相關部分：

<p> 

 <a href="../personal-autonomy/">autonomy: personal</a> |
 <a href="../principle-beneficence/">beneficence, principle of</a> |
 <a href="../decision-capacity/">decision-making capacity</a> |
 <a href="../legal-obligation/">legal obligation and authority</a> |
 <a href="../paternalism/">paternalism</a> |
 <a href="../identity-personal/">personal identity</a> |
 <a href="../identity-ethics/">personal identity: and ethics</a> |
 <a href="../respect/">respect</a> |
 <a href="../well-being/">well-being</a> 

</p>

編輯：重復的'index.html'實際上是多個匹配項的結果。 （例如href =“ ../ personal-autonomy / index.htmlindex.htmlindex.htmlindex.html”是因為在原始源中四次發現../personal-autonomy）。

作為一般的正則表達式問題，如何在不向所有匹配項添加額外的“ index.html”的情況下替換所有實例？

Answer 1

不要用正則表達式解析html：

import re    
from lxml import html

def replace_link(link):
    if re.match(r"\.\./[^/]+/$", link):
        link += "index.html"
    return link

print html.rewrite_links(your_html_text, replace_link)

輸出量

<p> 

 <a href="../personal-autonomy/index.html">autonomy: personal</a> |
 <a href="../principle-beneficence/index.html">beneficence, principle of</a> |
 <a href="../decision-capacity/index.html">decision-making capacity</a> |
 <a href="../legal-obligation/index.html">legal obligation and authority</a> |
 <a href="../paternalism/index.html">paternalism</a> |
 <a href="../identity-personal/index.html">personal identity</a> |
 <a href="../identity-ethics/index.html">personal identity: and ethics</a> |
 <a href="../respect/index.html">respect</a> |
 <a href="../well-being/index.html">well-being</a> 

</p>

Answer 2

我想我發現了問題

reg = re.compile(r'<a href="../(.*?)">')

for match in re.findall(reg, input_html):

output_html = input_html.replace(match, match+'index.html')

在這里，'input_html'在for循環中被修改，然后再次在同一個'input_html'中搜索正則表達式，這是一個bug :)

Answer 3

讓你的前兩個逃脫. ？

reg = re.compile(r'<a[ ]href="[.][.]/(.*?)">')

但是我會嘗試使用lxml代替。

Answer 4

問題是，a標簽的內容也與您嘗試替換的內容匹配。

這絕不是理想的方法，但是我認為，如果將regex替換為：

reg = re.compile(r'<a href="(\.\./.*?)">')

Answer 5

正則表達式中有一個錯誤，因為..不匹配兩個點。 相反，它是. 元字符。 要表示一個點，您需要將其轉義。

您的正則表達式應為： <a href="\\.\\./(.*?)"

此外，假設所有您的href的形式為../somedirectory/你可以逃脫一個簡單的正則表達式：

for match in re.compile(r'<a href="(.*?)"').findall(html):
    html = html.replace(match, match + "index.html")

在這里，正則表達式匹配

<a href="    # start of the taf and attribute
(            # start of a group
 .*          # any character, any number of times
)            # end of group
"            # end of the attribute

Python和re.compile返回不一致的結果

問題描述

5 個解決方案

解決方案1
5 已采納 2011-01-27 14:26:48

輸出量

解決方案2
1 2011-01-27 14:09:07

解決方案3
0 2011-01-27 12:44:45

解決方案4
0 2011-01-27 13:11:16

解決方案5
0 2011-01-27 13:15:38

Python和re.compile返回不一致的結果

問題描述

5 個解決方案

解決方案1 5 已采納 2011-01-27 14:26:48

輸出量

解決方案2 1 2011-01-27 14:09:07

解決方案3 0 2011-01-27 12:44:45

解決方案4 0 2011-01-27 13:11:16

解決方案5 0 2011-01-27 13:15:38

解決方案1
5 已采納 2011-01-27 14:26:48

解決方案2
1 2011-01-27 14:09:07

解決方案3
0 2011-01-27 12:44:45

解決方案4
0 2011-01-27 13:11:16

解決方案5
0 2011-01-27 13:15:38