简体   繁体   中英

Python regex url grab

I am having trouble figuring out how to select part of an html link using regex

say the link is:

<a href="race?raceid=1234">Mushroom Cup</a>

I have figured out how to get the race id, but I cannot for the life of me figure out how to use a regular expression to find just 'Mushroom cup'. The best I can do is get 1234>Mushroom Cup.

I'm new to regular expressions and it is just too much for me to comprehend.

很像

re.findall('<a href="race\?raceid=(\d+)">([^<]+)</a>',html_text)

Don't ever use regex for parsing HTML. Instead use HTML parsers like lxml or BeautifulSoup .

Here's an example using BeautifulSoup :

import urlparse
from bs4 import BeautifulSoup

soup = BeautifulSoup("""
<html>
<head>
    <title>Python regex url grab - Stack Overflow</title>
</head>
<body>
    <a href="race?raceid=1234">Mushroom Cup</a>
</body>
</html
""")

link = soup.find('a')
par = urlparse.parse_qs(urlparse.urlparse(link.attrs['href']).query)
print par['raceid'][0]   # prints 1234
print link.text   # prints Mushroom Cup

Note, that urlparse is used for getting link parameter's value. See more here: Retrieving parameters from a URL .

Also see:

Hope that helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM