簡體   English   中英

python scrapy從網站提取數據

[英]python scrapy extract data from website

我想從此頁面抓取數據。 這是我當前的代碼:

buf = cStringIO.StringIO()
c = pycurl.Curl()
c.setopt(c.URL, "http://www.guardalo.org/99407/")
c.setopt(c.VERBOSE, 0)
c.setopt(c.WRITEFUNCTION, buf.write)
c.setopt(c.CONNECTTIMEOUT, 15)
c.setopt(c.TIMEOUT, 15)
c.setopt(c.SSL_VERIFYPEER, 0)
c.setopt(c.SSL_VERIFYHOST, 0)
c.setopt(c.USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0) Gecko/20100101 Firefox/8.0')
c.perform()
body = buf.getvalue()
c.close()

response = HtmlResponse(url='http://www.guardalo.org/99407/', body=body)
print Selector(response=response).xpath('//edindex/text()').extract()

它可以工作,但是我需要標題,視頻鏈接和描述作為單獨的變量。 我該如何實現?

可以使用//title/text()//video/source/@src視頻源鏈接提取//title/text()

selector = Selector(response=response)

title = selector.xpath('//title/text()').extract()[0]
description = selector.xpath('//edindex/text()').extract()
video_sources = selector.xpath('//video/source/@src').extract()[0]

code_url = selector.xpath('//meta[@name="EdImage"]/@content').extract()[0]
code = re.search(r'(\w+)-play-small.jpg$', code_url).group(1)

print title
print description
print video_sources
print code

打印:

Best Babies Laughing Video Compilation 2012 [HD] - Guardalo
[u'Best Babies Laughing Video Compilation 2012 [HD]', u"Ciao a tutti amici di guardalo,quello che propongo oggi \xe8 un video sui neonati buffi con risate travolgenti, facce molto buffe,iniziamo con una coppia di gemellini che se la ridono fra loro,per passare subito con una biondina che si squaqqera dalle risate al suono dello strappo della carta ed \xe8 solo l'inizio.", u'\r\nBuone risate a tutti', u'Elia ride', u'Funny Triplet Babies Laughing Compilation 2014 [NEW HD]', u'Real Talent Little girl Singing Listen by Beyonce .', u'Bimbo Napoletano alle Prese con il Distributore di Benzina', u'Telecamera nascosta al figlio guardate che fa,video bambini divertenti,video bambini divertentissimi']
http://static.guardalo.org/video_image/pre-roll-guardalo.mp4
L49VXZwfup8

無需為單個URL抓取而scrapy -只需使用更簡單的工具(甚至最簡單的urllib.urlopen(theurl).read() !)來獲取單個頁面的HTML,然后使用BeautifulSoup分析HTML。 從一個簡單的“查看源代碼”看來,您正在尋找:

<title>Best Babies Laughing Video Compilation 2012 [HD] - Guardalo</title>

(標題),這是以下三種之一:

<source src="http://static.guardalo.org/video_image/pre-roll-guardalo.mp4" type='video/mp4'>
<source src="http://static.guardalo.org/video_image/pre-roll-guardalo.webm" type='video/webm'>
<source src="http://static.guardalo.org/video_image/pre-roll-guardalo.ogv" type='video/ogg'>

(視頻鏈接為復數形式,我不能選擇一個,因為您沒有告訴我們您喜歡哪種格式!-),以及

<meta name="description" content="Ciao a tutti amici di guardalo,quello che propongo oggi è un video sui neonati buffi con risate" />

(說明)。 BeautifulSoup使得獲取每一個都很簡單,例如在需要導入之后

html = urllib.urlopen('http://www.guardalo.org/99407/').read()
soup = BeautifulSoup(html)
title = soup.find('title').text

等等(但您必須選擇一個視頻鏈接-而且我在他們的消息來源中看到它們被稱為“前貼片廣告”,因此,實際上到實際非廣告視頻的鏈接實際上不在頁面,但只有在登錄后才能訪問)。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM