I am having trouble figuring out how to select part of an html link using regex
say the link is:
<a href="race?raceid=1234">Mushroom Cup</a>
I have figured out how to get the race id, but I cannot for the life of me figure out how to use a regular expression to find just 'Mushroom cup'. The best I can do is get 1234>Mushroom Cup.
I'm new to regular expressions and it is just too much for me to comprehend.
很像
re.findall('<a href="race\?raceid=(\d+)">([^<]+)</a>',html_text)
Don't ever use regex for parsing HTML. Instead use HTML parsers like lxml or BeautifulSoup .
Here's an example using BeautifulSoup
:
import urlparse
from bs4 import BeautifulSoup
soup = BeautifulSoup("""
<html>
<head>
<title>Python regex url grab - Stack Overflow</title>
</head>
<body>
<a href="race?raceid=1234">Mushroom Cup</a>
</body>
</html
""")
link = soup.find('a')
par = urlparse.parse_qs(urlparse.urlparse(link.attrs['href']).query)
print par['raceid'][0] # prints 1234
print link.text # prints Mushroom Cup
Note, that urlparse
is used for getting link parameter's value. See more here: Retrieving parameters from a URL .
Also see:
Hope that helps.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.