简体   繁体   中英

Beautiful Soup finding href based on hyperlink Text

I'm having an issue trying to get beautiful soup to find an a href with a specific title and extract the href only.

I have the code below but cant seem to make it get the href only(whatever is between the open " and close ") based on the hyperlink text found in the in that href.

res = requests.get(website_url)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
temp_tag_href = soup.select_one("a[href*=some text]")
sometexthrefonly = temp_tag_href.attrs['href']

Effectively, i would like it to go through the entire html parsed in soup and only return what is between the href open " and close " because the that hyperlink text is 'some text'.

so the steps would be:

1: parse html, 
2: look at all the a hrefs tags, 
3: find the href that has the hyperlink text 'some text', 
4: output only what is in between the href " " (not including the 
   "") for that href

Any help will greatly be appreciated!

ahmed,

So after some quick refreshers on requests and researching the BeautifulSoup library, I think you'll want something like the following:

res = requests.get(website_url)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
link = list(filter(lambda x: x['href'] == 'some text', soup.find_all('a')))[0]
print(link['href']) # since you don't specify output to where, I'll use stdout for simplicity

As it turns out in the Beautiful Soup Documentation there is a convenient way to access whatever attributes you want from an html element using dictionary lookup syntax. You can also do all kinds of lookups using this library.

If you are doing web scraping, it may also be useful to try switching to a library that supports XPATH, which allows you to write powerful queries such as //a[@href="some text"][1] which will get you the first link with url equal to "some text"

this should do the work:

from BeautifulSoup import BeautifulSoup

html = '''<a href="some_url">next</a>
<div><a href="another_url">later</a></div>
<h3><a href="yet_another_url">later</a></h3>'''

soup = BeautifulSoup(html)

# iterate all hrefs
for a in soup.find_all('a', href=True):
    print("Next HREF: %s" % a['href'])
    if a['href'] == 'some_text':
       print("Found it!")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM