简体   繁体   English

python scrapy从网站提取数据

[英]python scrapy extract data from website

I want to scrape data from this page . 我想从此页面抓取数据。 Here is my current code: 这是我当前的代码:

buf = cStringIO.StringIO()
c = pycurl.Curl()
c.setopt(c.URL, "http://www.guardalo.org/99407/")
c.setopt(c.VERBOSE, 0)
c.setopt(c.WRITEFUNCTION, buf.write)
c.setopt(c.CONNECTTIMEOUT, 15)
c.setopt(c.TIMEOUT, 15)
c.setopt(c.SSL_VERIFYPEER, 0)
c.setopt(c.SSL_VERIFYHOST, 0)
c.setopt(c.USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0) Gecko/20100101 Firefox/8.0')
c.perform()
body = buf.getvalue()
c.close()

response = HtmlResponse(url='http://www.guardalo.org/99407/', body=body)
print Selector(response=response).xpath('//edindex/text()').extract()

It works, but I need title, video link and description as separate variables. 它可以工作,但是我需要标题,视频链接和描述作为单独的变量。 How can I achieve this? 我该如何实现?

Title can be extracted using //title/text() , video source link via //video/source/@src : 可以使用//title/text()//video/source/@src视频源链接提取//title/text()

selector = Selector(response=response)

title = selector.xpath('//title/text()').extract()[0]
description = selector.xpath('//edindex/text()').extract()
video_sources = selector.xpath('//video/source/@src').extract()[0]

code_url = selector.xpath('//meta[@name="EdImage"]/@content').extract()[0]
code = re.search(r'(\w+)-play-small.jpg$', code_url).group(1)

print title
print description
print video_sources
print code

Prints: 打印:

Best Babies Laughing Video Compilation 2012 [HD] - Guardalo
[u'Best Babies Laughing Video Compilation 2012 [HD]', u"Ciao a tutti amici di guardalo,quello che propongo oggi \xe8 un video sui neonati buffi con risate travolgenti, facce molto buffe,iniziamo con una coppia di gemellini che se la ridono fra loro,per passare subito con una biondina che si squaqqera dalle risate al suono dello strappo della carta ed \xe8 solo l'inizio.", u'\r\nBuone risate a tutti', u'Elia ride', u'Funny Triplet Babies Laughing Compilation 2014 [NEW HD]', u'Real Talent Little girl Singing Listen by Beyonce .', u'Bimbo Napoletano alle Prese con il Distributore di Benzina', u'Telecamera nascosta al figlio guardate che fa,video bambini divertenti,video bambini divertentissimi']
http://static.guardalo.org/video_image/pre-roll-guardalo.mp4
L49VXZwfup8

No need for scrapy for a single-URL fetch -- just get that single page's HTML with a simpler tool (even simplest urllib.urlopen(theurl).read() !) then analyze the HTML eg with BeautifulSoup. 无需为单个URL抓取而scrapy -只需使用更简单的工具(甚至最简单的urllib.urlopen(theurl).read() !)来获取单个页面的HTML,然后使用BeautifulSoup分析HTML。 From a simple "view source" it looks like you're looking for: 从一个简单的“查看源代码”看来,您正在寻找:

<title>Best Babies Laughing Video Compilation 2012 [HD] - Guardalo</title>

(the title), one of the three: (标题),这是以下三种之一:

<source src="http://static.guardalo.org/video_image/pre-roll-guardalo.mp4" type='video/mp4'>
<source src="http://static.guardalo.org/video_image/pre-roll-guardalo.webm" type='video/webm'>
<source src="http://static.guardalo.org/video_image/pre-roll-guardalo.ogv" type='video/ogg'>

(the video linkS, plural, and I can't pick one because you don't tell us which format[s] you prefer!-), and (视频链接为复数形式,我不能选择一个,因为您没有告诉我们您喜欢哪种格式!-),以及

<meta name="description" content="Ciao a tutti amici di guardalo,quello che propongo oggi è un video sui neonati buffi con risate" />

(the description). (说明)。 BeautifulSoup makes it pretty trivial to get each one, eg after the needed imports BeautifulSoup使得获取每一个都很简单,例如在需要导入之后

html = urllib.urlopen('http://www.guardalo.org/99407/').read()
soup = BeautifulSoup(html)
title = soup.find('title').text

etc etc (but you'll have to pick one video link -- and I see in their sources they're mentioned as "pre-rolls" so it may be that the links to actual non-ads videos are in fact not on the page but only accessible after a log-in or whatever). 等等(但您必须选择一个视频链接-而且我在他们的消息来源中看到它们被称为“前贴片广告”,因此,实际上到实际非广告视频的链接实际上不在页面,但只有在登录后才能访问)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM