简体   繁体   中英

python scrapy extract data from website

I want to scrape data from this page . Here is my current code:

buf = cStringIO.StringIO()
c = pycurl.Curl()
c.setopt(c.URL, "http://www.guardalo.org/99407/")
c.setopt(c.VERBOSE, 0)
c.setopt(c.WRITEFUNCTION, buf.write)
c.setopt(c.CONNECTTIMEOUT, 15)
c.setopt(c.TIMEOUT, 15)
c.setopt(c.SSL_VERIFYPEER, 0)
c.setopt(c.SSL_VERIFYHOST, 0)
c.setopt(c.USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0) Gecko/20100101 Firefox/8.0')
c.perform()
body = buf.getvalue()
c.close()

response = HtmlResponse(url='http://www.guardalo.org/99407/', body=body)
print Selector(response=response).xpath('//edindex/text()').extract()

It works, but I need title, video link and description as separate variables. How can I achieve this?

Title can be extracted using //title/text() , video source link via //video/source/@src :

selector = Selector(response=response)

title = selector.xpath('//title/text()').extract()[0]
description = selector.xpath('//edindex/text()').extract()
video_sources = selector.xpath('//video/source/@src').extract()[0]

code_url = selector.xpath('//meta[@name="EdImage"]/@content').extract()[0]
code = re.search(r'(\w+)-play-small.jpg$', code_url).group(1)

print title
print description
print video_sources
print code

Prints:

Best Babies Laughing Video Compilation 2012 [HD] - Guardalo
[u'Best Babies Laughing Video Compilation 2012 [HD]', u"Ciao a tutti amici di guardalo,quello che propongo oggi \xe8 un video sui neonati buffi con risate travolgenti, facce molto buffe,iniziamo con una coppia di gemellini che se la ridono fra loro,per passare subito con una biondina che si squaqqera dalle risate al suono dello strappo della carta ed \xe8 solo l'inizio.", u'\r\nBuone risate a tutti', u'Elia ride', u'Funny Triplet Babies Laughing Compilation 2014 [NEW HD]', u'Real Talent Little girl Singing Listen by Beyonce .', u'Bimbo Napoletano alle Prese con il Distributore di Benzina', u'Telecamera nascosta al figlio guardate che fa,video bambini divertenti,video bambini divertentissimi']
http://static.guardalo.org/video_image/pre-roll-guardalo.mp4
L49VXZwfup8

No need for scrapy for a single-URL fetch -- just get that single page's HTML with a simpler tool (even simplest urllib.urlopen(theurl).read() !) then analyze the HTML eg with BeautifulSoup. From a simple "view source" it looks like you're looking for:

<title>Best Babies Laughing Video Compilation 2012 [HD] - Guardalo</title>

(the title), one of the three:

<source src="http://static.guardalo.org/video_image/pre-roll-guardalo.mp4" type='video/mp4'>
<source src="http://static.guardalo.org/video_image/pre-roll-guardalo.webm" type='video/webm'>
<source src="http://static.guardalo.org/video_image/pre-roll-guardalo.ogv" type='video/ogg'>

(the video linkS, plural, and I can't pick one because you don't tell us which format[s] you prefer!-), and

<meta name="description" content="Ciao a tutti amici di guardalo,quello che propongo oggi è un video sui neonati buffi con risate" />

(the description). BeautifulSoup makes it pretty trivial to get each one, eg after the needed imports

html = urllib.urlopen('http://www.guardalo.org/99407/').read()
soup = BeautifulSoup(html)
title = soup.find('title').text

etc etc (but you'll have to pick one video link -- and I see in their sources they're mentioned as "pre-rolls" so it may be that the links to actual non-ads videos are in fact not on the page but only accessible after a log-in or whatever).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM