Here's my code to get html
from bs4 import BeautifulSoup
import urllib.request
from fake_useragent import UserAgent
url = "https://blahblah.com"
ua = UserAgent()
ran_header = ua.random
req = urllib.request.Request(url,data=None,headers={'User-Agent': ran_header})
uClient = urllib.request.urlopen(req)
page_html = uClient.read()
uClient.close()
html_source = BeautifulSoup(page_html, "html.parser")
results = html_source.findAll("a",{"onclick":"googleTag('click-listings-item-image');"})
From here results
contains various listings containing different info. If I then print(results[0])
:
<a href="https://blahblah.com//link//asdfqwersdf" onclick="googleTag('click-listings-item-image');">
<div class="results-panel-new col-sm-12">
<div class="row">
<div class="col-xs-12 col-sm-3 col-lg-2 text-center thumb-table-cell">
<span class="eq-table-new text-center"><img class="img-thumbnail" src="//images/120x90/7831a94157234bc6.jpg" /></span>
</div>
<div class="col-xs-12 hidden-sm hidden-md col-lg-1 text-center thumb-table-cell">
<span class="eq-table-new text-center"><span class="hidden-sm hidden-md hidden-lg">Year: </span>2000</span>
</div>
<div class="col-xs-12 hidden-sm hidden-md col-lg-2 text-center thumb-table-cell">
<span class="eq-table-new text-center">Fake City, USA</span>
</div>
<div class="col-xs-12 col-sm-3 col-lg-2 text-center thumb-table-cell">
<span class="eq-table-new text-center"><span class="hidden-sm hidden-md hidden-lg">Price: </span>$900</span>
</div>
</div>
<div class="row">
<div class="hidden-xs col-sm-12 table_details_new"><span>Descriptive details</span></div>
</div>
</div><!-- results-panel-new -->
</a>
I can get the image, Year, Location, and Price by doing a variation of this:
ModelYear = results[0].div.find("div",{"class":"col-xs-12 hidden-sm hidden-md col-lg-1 text-center thumb-table-cell"}).span.text
How do I get the very first href from results[0]
?
You can use find_all( , href=True)
eg:
results[0].find_all('a', href=True)[0]
基于聊天讨论, href
链接看起来很简单: results[0]['href']
。
Your selector is returning an a
tag element as you can see shown in print out. So yes, you simply directly access the href with results[0]['href']
. You can also tell this as the entire panel (the card displaying the listing) on the page is a clickable element. If you wanted to make this clearer you could change your selector for results to #js_thumb_view ~ a
. This is also a faster selector.
results = html_source.select('#js_thumb_view ~ a')
Then all links, for example, with
links = [result['href'] for result in results]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.