需要帮助 Python 的正则表达式

Question

So I want to get the link in this html code and I have tried using regex for it所以我想得到这个 html 代码中的链接，我已经尝试使用正则表达式

<div class="title" onclick="ta.setEvtCookie('Search_Results_Page', 'POI_Name', '', 0, '/Attraction_Review-g1787072-d2242305-Reviews-Lake_Travis_Zipline_Adventures-Volente_Texas.html')"><span>Lake Travis <span class="highlighted">Zipline</span> Adventures</span></div>

I have done this so far but this isn't catching till the end part到目前为止我已经这样做了，但直到最后部分才赶上

/Attraction_Review-\\w+-\\w+-\\w+ /Attraction_Review-\\w+-\\w+-\\w+

it only catches它只会抓住

/Attraction_Review-g1787072-d2242305-Reviews /Attraction_Review-g1787072-d2242305-Reviews

How can I make it catch till the .html part?我怎样才能让它赶上 .html 部分？

I want it to catch the whole link我希望它捕捉整个链接

Also, the link is being generated dynamically so there isnt any fixed length此外，链接是动态生成的，因此没有任何固定长度

Answer 1

How about an alternative to regex approach: use HTML parser to get the onclick attribute value and use Javascript parser to extract the last function argument.如何替代正则表达式方法：使用HTML 解析器获取onclick属性值并使用Javascript 解析器提取最后一个函数参数。

Here I'm using BeautifulSoup and slimit parsers:在这里，我使用BeautifulSoup和slimit解析器：

from bs4 import BeautifulSoup
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor


data = """<div class="title" onclick="ta.setEvtCookie('Search_Results_Page', 'POI_Name', '', 0, '/Attraction_Review-g1787072-d2242305-Reviews-Lake_Travis_Zipline_Adventures-Volente_Texas.html')"><span>Lake Travis <span class="highlighted">Zipline</span> Adventures</span></div>"""

soup = BeautifulSoup(data)

# get onclick value
onclick = soup.find("div", class_="title", onclick=True)["onclick"]

# parse onclick js code
parser = Parser()
tree = parser.parse(onclick)
for node in nodevisitor.visit(tree):
    if isinstance(node, ast.FunctionCall):
        print(node.args[-1].value)

Prints:印刷：

'/Attraction_Review-g1787072-d2242305-Reviews-Lake_Travis_Zipline_Adventures-Volente_Texas.html'

I understand that using a Javascript parser for such a simple and straightforward piece of Javascript code might be a little bit too much - feel free to replace that part with regex.我知道对这样一段简单直接的 Javascript 代码使用 Javascript 解析器可能有点太多了 - 随意用正则表达式替换该部分。 But, make sure the HTML itself is parsed with an HTML parser.但是，请确保使用 HTML 解析器解析 HTML 本身。

需要帮助 Python 的正则表达式

问题描述

1 个解决方案

解决方案1
3 已采纳 2015-11-01 00:29:02

需要帮助 Python 的正则表达式

问题描述

1 个解决方案

解决方案1 3 已采纳 2015-11-01 00:29:02

解决方案1
3 已采纳 2015-11-01 00:29:02