[英]How can I get information from an <a href> tag within <div> tags with BeautifulSoup and Python?
all.全部。 I have a quick question about BeautifulSoup with Python.
我有一个关于使用 Python 的 BeautifulSoup 的快速问题。 I have several bits of HTML that look like this (the only differences are the links and product names) and I'm trying to get the link from the "href" attribute.
我有几段 HTML 看起来像这样(唯一的区别是链接和产品名称),我正在尝试从“href”属性中获取链接。
<div id="productListing1" xmlns:dew="urn:Microsoft.Search.Response.Document">
<span id="rank" style="display:none;">94.36</span>
<div class="productPhoto">
<img src="/assets/images/ocpimages/87684/00131cl.gif" height="82" width="82" />
</div>
<div class="productName">
<a class="on" href="/Products/ProductInfoDisplay.aspx?SiteId=1&Product=8768400131">CAPRI SUN - JUICE DRINK - COOLERS VARIETY PACK 6 OZ</a>
</div>
<div class="size">40 CT</div>
I currently have this Python code:我目前有这个 Python 代码:
productLinks = soup.findAll('a', attrs={'class' : 'on'})
for link in productLinks:
print link['href']
This works (for every link on the page I get something like /Products/ProductInfoDisplay.aspx?SiteId=1&Product=8768400131
);这有效(对于页面上的每个链接,我都会得到类似
/Products/ProductInfoDisplay.aspx?SiteId=1&Product=8768400131
); however, I've been trying to figure out if there's a way to get the link in the "href" attribute without searching explicitly for 'class="on"'.但是,我一直在尝试弄清楚是否有办法在“href”属性中获取链接,而无需明确搜索“class =“on””。 I guess my first question should be whether or not this is the best way to find this information (class="on" seems too generic and likely to break in the future although my CSS and HTML skills aren't that good).
我想我的第一个问题应该是这是否是查找此信息的最佳方式(尽管我的 CSS 和 HTML 技能不是很好,但 class="on" 似乎太笼统并且将来可能会中断)。 I've tried numerous combinations of find, findAll, findAllnext, etc. methods but I can't quite make it work.
我已经尝试了多种 find、findAll、findAllnext 等方法的组合,但我无法让它发挥作用。 This is mostly what I had (I rearranged and changed it numerous times):
这主要是我所拥有的(我重新排列并更改了很多次):
productLinks = soup.find('div', attrs={'class' : 'productName'}).find('a', href=True)
If this isn't a good way to do this, how can I get to the <a>
tag from the <div class="productName">
tag?如果这不是一个好方法,我怎样才能从
<div class="productName">
标签到达<a>
标签? Let me know if you need more information.如果您需要更多信息,请与我们联系。
Thank you.谢谢你。
Well, once you have the <div>
, element, you can get the <a>
subelement by calling find()
:好吧,一旦有了
<div>
元素,就可以通过调用find()
来获取<a>
子元素:
productDivs = soup.findAll('div', attrs={'class' : 'productName'})
for div in productDivs:
print div.find('a')['href']
However, since the <a>
is immediately above <div>
, you can get the a
attribute from the div:但是,由于
<a>
就在<div>
之上,您可以从 div 中获取a
属性:
productDivs = soup.findAll('div', attrs={'class' : 'productName'})
for div in productDivs:
print div.a['href']
Now, if you want to put all the <a>
elements in a list, your code above will not work because find()
just returns one element matched by its criteria.现在,如果你想把所有的
<a>
元素放在一个列表中,你上面的代码将不起作用,因为find()
只返回一个与其条件匹配的元素。 You would get the list of divs and get the subelements from them, for example, using list comprehensions:您将获取 div 列表并从中获取子元素,例如,使用列表推导式:
productLinks = [div.a for div in
soup.findAll('div', attrs={'class' : 'productName'})]
for link in productLinks:
print link['href']
I am giving this solution in BeautifulSoup4我在 BeautifulSoup4 中给出了这个解决方案
for data in soup.find_all('div', class_='productName'):
for a in data.find_all('a'):
print(a.get('href')) #for getting link
print(a.text) #for getting text between the link
data = soup.find_all('div', class_='productName') a_class = data[0].find_all('a') url_ = a_class[0].get('href') print(url_)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.