文本文件中的Scrapy start_urls

Question

I'm trying to crawl through urls and retrieve h1 of each url. 我正在尝试爬网URL并检索每个URL的h1 。 The url is stored in a text file. 网址存储在文本文件中。 The code is: 代码是：

class MySpider(CrawlSpider):
    name = "sitemaplocation"
    allowed_domains = ["xyz.nl"]
    f = open("locationlist.txt",'r')
    start_urls = [url.strip() for url in f.readlines()]
    f.close()


def parse(self, response):
    sel = Selector(response)

    title= sel.xpath("//h1[@class='no-bd']/text()").extract()
    print title

The code crawls through the site but doesn't print anything. 该代码在整个站点中进行爬网，但不输出任何内容。 Any help would be useful. 任何帮助将是有用的。

Answer 1

Try to place this: 尝试放置：

name = "sitemaplocation"
allowed_domains = ["xyz.nl"]
f = open("locationlist.txt",'r')
start_urls = [url.strip() for url in f.readlines()]
f.close()

into 成

__init__

method in MySpider class. MySpider类中的方法。

And also where do you call parse function? 还有在哪里调用解析函数？

Answer 2

Try inheriting your spider from Spider not from CrawlSpider : 尝试从Spider而不是CrawlSpider继承Spider ：

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. 编写爬网蜘蛛规则时，请避免将解析用作回调，因为CrawlSpider使用解析方法本身来实现其逻辑。 So if you override the parse method, the crawl spider will no longer work. 因此，如果您覆盖parse方法，则爬网蜘蛛将不再起作用。

文本文件中的Scrapy start_urls

问题描述

2 个解决方案

解决方案1
1 2014-04-09 10:48:03

解决方案2
0 2014-04-09 10:51:05

文本文件中的Scrapy start_urls

问题描述

2 个解决方案

解决方案1 1 2014-04-09 10:48:03

解决方案2 0 2014-04-09 10:51:05

解决方案1
1 2014-04-09 10:48:03

解决方案2
0 2014-04-09 10:51:05