Python和re.compile返回不一致的结果

Question

I'm trying to replace all instances of href="../directory" with href="../directory/index.html" . 我正在尝试将href="../directory"所有实例替换为href="../directory/index.html" 。

In Python, this 在Python中，

reg = re.compile(r'<a href="../(.*?)">')
for match in re.findall(reg, input_html):
    output_html = input_html.replace(match, match+'index.html')

produces the following output: 产生以下输出：

href="../personal-autonomy/index.htmlindex.htmlindex.htmlindex.html"  
href="../paternalism/index.html"  
href="../principle-beneficence/index.htmlindex.htmlindex.html"  
href="../decision-capacity/index.htmlindex.htmlindex.html"

Any idea why it works with the second link, but the others don't? 知道为什么它可以与第二个链接一起使用，但其他链接却不起作用吗？

Relevant part of the source: 来源的相关部分：

<p> 

 <a href="../personal-autonomy/">autonomy: personal</a> |
 <a href="../principle-beneficence/">beneficence, principle of</a> |
 <a href="../decision-capacity/">decision-making capacity</a> |
 <a href="../legal-obligation/">legal obligation and authority</a> |
 <a href="../paternalism/">paternalism</a> |
 <a href="../identity-personal/">personal identity</a> |
 <a href="../identity-ethics/">personal identity: and ethics</a> |
 <a href="../respect/">respect</a> |
 <a href="../well-being/">well-being</a> 

</p>

EDIT : The repeated 'index.html' is actually the result of multiple matches. 编辑：重复的'index.html'实际上是多个匹配项的结果。 (eg href="../personal-autonomy/index.htmlindex.htmlindex.htmlindex.html" is because ../personal-autonomy is found four times in the original source). （例如href =“ ../ personal-autonomy / index.htmlindex.htmlindex.htmlindex.html”是因为在原始源中四次发现../personal-autonomy）。

As a general regex question, how would you replace all instances without adding an additional 'index.html' to all matches? 作为一般的正则表达式问题，如何在不向所有匹配项添加额外的“ index.html”的情况下替换所有实例？

Answer 1

Don't parse html with regexs: 不要用正则表达式解析html：

import re    
from lxml import html

def replace_link(link):
    if re.match(r"\.\./[^/]+/$", link):
        link += "index.html"
    return link

print html.rewrite_links(your_html_text, replace_link)

Output 输出量

<p> 

 <a href="../personal-autonomy/index.html">autonomy: personal</a> |
 <a href="../principle-beneficence/index.html">beneficence, principle of</a> |
 <a href="../decision-capacity/index.html">decision-making capacity</a> |
 <a href="../legal-obligation/index.html">legal obligation and authority</a> |
 <a href="../paternalism/index.html">paternalism</a> |
 <a href="../identity-personal/index.html">personal identity</a> |
 <a href="../identity-ethics/index.html">personal identity: and ethics</a> |
 <a href="../respect/index.html">respect</a> |
 <a href="../well-being/index.html">well-being</a> 

</p>

Answer 2

I think i found out the problem 我想我发现了问题

reg = re.compile(r'<a href="../(.*?)">')

for match in re.findall(reg, input_html):

output_html = input_html.replace(match, match+'index.html')

Here 'input_html' is modified inside the for loop and then same 'input_html' is searched again for the regex which is the bug :) 在这里，'input_html'在for循环中被修改，然后再次在同一个'input_html'中搜索正则表达式，这是一个bug :)

Answer 3

Have your tied escaping your first two . 让你的前两个逃脱. ? ？

reg = re.compile(r'<a[ ]href="[.][.]/(.*?)">')

But I would try to use lxml instead. 但是我会尝试使用lxml代替。

Answer 4

The problem is the content of the a-tag also matches what you try to replace. 问题是，a标签的内容也与您尝试替换的内容匹配。

It's in no way the ideal way to do it, but I think you will find it works correctly if you replace your regex with: 这绝不是理想的方法，但是我认为，如果将regex替换为：

reg = re.compile(r'<a href="(\.\./.*?)">')

Answer 5

There is an error in your regex in that the .. does not match two dots. 正则表达式中有一个错误，因为..不匹配两个点。 Instead, it is the . 相反，它是. metacharacter . 元字符。 To mean a dot, you need to escape it. 要表示一个点，您需要将其转义。

Your regex should be: <a href="\\.\\./(.*?)" 您的正则表达式应为： <a href="\\.\\./(.*?)"

Besides, assuming all your href are of the form ../somedirectory/ you can get away with a simpler regex: 此外，假设所有您的href的形式为../somedirectory/你可以逃脱一个简单的正则表达式：

for match in re.compile(r'<a href="(.*?)"').findall(html):
    html = html.replace(match, match + "index.html")

Here, the regex matches 在这里，正则表达式匹配

<a href="    # start of the taf and attribute
(            # start of a group
 .*          # any character, any number of times
)            # end of group
"            # end of the attribute

Python和re.compile返回不一致的结果

问题描述

5 个解决方案

解决方案1
5 已采纳 2011-01-27 14:26:48

Output 输出量

解决方案2
1 2011-01-27 14:09:07

解决方案3
0 2011-01-27 12:44:45

解决方案4
0 2011-01-27 13:11:16

解决方案5
0 2011-01-27 13:15:38

Python和re.compile返回不一致的结果

问题描述

5 个解决方案

解决方案1 5 已采纳 2011-01-27 14:26:48

Output 输出量

解决方案2 1 2011-01-27 14:09:07

解决方案3 0 2011-01-27 12:44:45

解决方案4 0 2011-01-27 13:11:16

解决方案5 0 2011-01-27 13:15:38

解决方案1
5 已采纳 2011-01-27 14:26:48

解决方案2
1 2011-01-27 14:09:07

解决方案3
0 2011-01-27 12:44:45

解决方案4
0 2011-01-27 13:11:16

解决方案5
0 2011-01-27 13:15:38