![](/img/trans.png)
[英]How do I get href links under a specific class with BeautifulSoup
[英]Webcrawler BeautifulSoup - how do I get titles from links without class tags
我試圖從中收集數據的網站是http://www.boxofficemojo.com/yearly/chart/?yr=2015&p=.htm 。 現在,我想在此頁面上獲得電影的所有標題,然后再轉到每個鏈接中的其余數據(工作室等)和其他數據。 這是我到目前為止的內容:
import requests
from bs4 import BeautifulSoup
from urllib2 import urlopen
def trade_spider(max_pages):
page = 0
while page <= max_pages:
url = 'http://www.boxofficemojo.com/yearly/chart/?page=' + str(page) + '&view=releasedate&view2=domestic&yr=2015&p=.htm'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'div':'body'}):
href = 'http://www.boxofficemojo.com' + link.get('href')
title = link.string
print title
get_single_item_data(href)
page += 1
def get_single_item_data(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for item_name in soup.findAll('section', {'id':'postingbody'}):
print item_name.text
trade_spider(1)
我遇到的問題是
for soup.findAll('a',{'div':'body'})中的鏈接:
href =' http ://www.boxofficemojo.com'+ link.get('href')
問題是在網站上,沒有所有鏈接都包含在其中的識別類。 鏈接僅帶有“ <ahref>”標簽。
如何獲得此頁面上鏈接的所有標題?
很抱歉沒有給出完整的答案,但是這里有一個線索。
對於這些刮擦中的問題,我起了個好名字。 當我使用find()
, find_all()
方法時,我將其稱為“ Abstract Identification
因為當標記類/標識名稱不面向數據時,您可能會獲得隨機數據。
然后是Nested Identification
。 那時,您必須不使用find()
, find_all()
方法來查找數據,而是從字面上爬過find_all()
標簽。 這需要BeautifulSoup
更加熟練。
嵌套標識是一個較長的過程,通常比較麻煩,但有時是唯一的解決方案。
那怎么辦呢? 當擁有<class 'bs4.element.Tag'>
對象時,您可以找到存儲為標簽對象屬性的標簽。
from bs4 import element, BeautifulSoup as BS
html = '' +\
'<body>' +\
'<h3>' +\
'<p>Some text to scrape</p>' +\
'<p>Some text NOT to scrape</p>' +\
'</h3>' +\
'\n\n' +\
'<strong>' +\
'<p>Some more text to scrape</p>' +\
'\n\n' +\
'<a href="www.example.com/some-url/you/find/important/">Some Important Link</a>' +\
'</strong>' +\
'</body>'
soup = BS(html)
# Starting point to extract a link
h3_tag = soup.find('h3') # finds the first h3 tag in the soup object
child_of_h3__p = h3_tag.p # locates the first p tag in the h3 tag
# climbing in the nest
child_of_h3__forbidden_p = h3_tag.p.next_sibling
# or
#child_of_h3__forbidden_p = child_of_h3__p.next_sibling
# sometimes `.next_sibling` will yield '' or '\n', think of this element as a
# tag separator in which case you need to continue using `.next_sibling`
# to get past the separator and onto the tag.
# Grab the tag below the h3 tag, which is the strong tag
# we need to go up 1 tag, and down 2 from our current object.
# (down 2 so we skip the tag_seperator)
tag_below_h3 = child_of_h3__p.parent.next_sibling.next_sibling
# Heres 3 different ways to get to the link tag using Nested Identification
# 1.) getting a list of childern from our object
childern_tags = tag_below_h3.contents
p_tag = childern_tags[0]
tag_separator = childern_tags[1]
a_tag = childern_tags[2] # or childrent_tags[-1] to get the last tag
print (a_tag)
print '1.) We Found the link: %s' % a_tag['href']
# 2.) Theres only 1 <a> tag, so we can just grab it directly
a_href = tag_below_h3.a['href']
print '\n2.) We Found the link: %s' % a_href
# 3.) using next_sibling to crawl
tag_separator = tag_below_h3.p.next_sibling
a_tag = tag_below_h3.p.next_sibling.next_sibling # or tag_separator.next_sibling
print '\n3.) We Found the link: %s' % a_tag['href']
print '\nWe also found a tag seperator: %s' % repr(tag_separator)
# our tag seperator is a NavigableString.
if type(tag_separator) == element.NavigableString:
print '\nNavigableString\'s are usually plain text that reside inside a tag.'
print 'In this case however it is a tag seperator.\n'
現在,如果我沒記錯的話,訪問某個標簽或標簽分隔符,會將對象從Tag
更改為NavigableString
在這種情況下,您需要將其傳遞給BeautifulSoup
才能使用諸如find()
。 要檢查這一點,您可以這樣做。
from bs4 import element, BeautifulSoup
# ... Do some beautiful soup data mining
# reach a NavigableString object
if type(formerly_a_tag_obj) == element.NavigableString:
formerly_a_tag_obj = BeautifulSoup(formerly_a_tag_obj) # is now a soup
一種可能的方法是使用.select()
方法,該方法接受CSS選擇器參數:
for link in soup.select('td > b > font > a[href^=/movies/?]'):
......
......
有關使用CSS選擇器的簡要說明:
td > b
:找到所有td
元素,然后從每個td
找到直接子b
元素 > font
:從過濾的b
元素中查找直接子font
元素 > a[href^=/movies/?]
:從已過濾的font
元素中,返回直接子元素,具有href
屬性值a
元素以"/movies/?"
開頭
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.