Python regex url grab

Question

I am having trouble figuring out how to select part of an html link using regex

say the link is:

<a href="race?raceid=1234">Mushroom Cup</a>

I have figured out how to get the race id, but I cannot for the life of me figure out how to use a regular expression to find just 'Mushroom cup'. The best I can do is get 1234>Mushroom Cup.

I'm new to regular expressions and it is just too much for me to comprehend.

Answer 1

很像

re.findall('<a href="race\?raceid=(\d+)">([^<]+)</a>',html_text)

Answer 2

Don't ever use regex for parsing HTML. Instead use HTML parsers like lxml or BeautifulSoup .

Here's an example using BeautifulSoup :

import urlparse
from bs4 import BeautifulSoup

soup = BeautifulSoup("""
<html>
<head>
    <title>Python regex url grab - Stack Overflow</title>
</head>
<body>
    <a href="race?raceid=1234">Mushroom Cup</a>
</body>
</html
""")

link = soup.find('a')
par = urlparse.parse_qs(urlparse.urlparse(link.attrs['href']).query)
print par['raceid'][0]   # prints 1234
print link.text   # prints Mushroom Cup

Note, that urlparse is used for getting link parameter's value. See more here: Retrieving parameters from a URL .

Also see:

Hope that helps.

Python regex url grab

Question

2 answers

solution1
1 ACCPTED 2013-08-19 21:02:59

solution2
1 2013-08-19 21:05:45

Python regex url grab

Question

2 answers

solution1 1 ACCPTED 2013-08-19 21:02:59

solution2 1 2013-08-19 21:05:45

solution1
1 ACCPTED 2013-08-19 21:02:59

solution2
1 2013-08-19 21:05:45