简体   繁体   English

使用scrapy获取Python中的链接?

[英]Using scrapy to get links in Python?

Sorry if this is a dumb question, but I have absolutely no idea how to use Scrapy. 抱歉,这是一个愚蠢的问题,但是我绝对不知道如何使用Scrapy。 I don't want to create a Scrapy crawler (or w/e), I want to incorporate it into my existing code. 我不想创建Scrapy搜寻器(或带有w / e),而是要将其合并到现有代码中。 I've looked at the docs, but I found them a bit confusing. 我看了看文档,但发现它们有些混乱。

What I need to do is, get links from a list on the site. 我需要做的是,从网站上的列表中获取链接。 I just need an example to better understand it. 我只需要一个例子来更好地理解它。 Also, is it possible to have a for loop to do something with each list item? 另外,是否可以有一个for循环对每个列表项执行某些操作? They are ordered like 他们被订购像

<ul>
  <li>example</li>
</ul>

Thanks! 谢谢!

maybe you don't need scrappy if it's that simple. 就是这么简单,也许您不需要草率的。

cat local.html

<html><body>
<ul>  
<li>example</li>  
<li>example2</li>
</ul>
<div><a href="test">test</a><div><a href="hi">hi</a></div></div>
</body></html>

then... 然后...

import urllib2
from lxml import html

page =urllib2.urlopen("file:///root/local.html")
root = html.parse(page).getroot()
details = root.cssselect("li")
for x in details:
        print(x.text_content())

for x in root.xpath('//a/@href'):
        print(x)

You might want to consider BeautifulSoup, which is great for parsing HTML/XML, their documentation is quite helpful as well. 您可能需要考虑BeautifulSoup,它对于解析HTML / XML非常有用,它们的文档也非常有用。 Getting the links would be something like: 获取链接将类似于:

import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')

for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
    if link.has_key('href'):
        print link['href']

SoupStrainer removes the need to parse the entire thing when all you're after are the links. 当您只需要链接时,SoupStrainer无需解析整个内容。

EDIT: Just saw that you need to use Scrapy. 编辑:刚刚看到您需要使用Scrapy。 I'm afraid I've not used it, but try looking at the official documentation , it looks like they have what you might be after. 恐怕我还没有使用过它,但是请尝试查看官方文档 ,看起来他们拥有您想要的东西。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM