简体   繁体   English

文本文件中的Scrapy start_urls

[英]Scrapy start_urls in text file

I'm trying to crawl through urls and retrieve h1 of each url. 我正在尝试爬网URL并检索每个URL的h1 The url is stored in a text file. 网址存储在文本文件中。 The code is: 代码是:

class MySpider(CrawlSpider):
    name = "sitemaplocation"
    allowed_domains = ["xyz.nl"]
    f = open("locationlist.txt",'r')
    start_urls = [url.strip() for url in f.readlines()]
    f.close()


def parse(self, response):
    sel = Selector(response)

    title= sel.xpath("//h1[@class='no-bd']/text()").extract()
    print title

The code crawls through the site but doesn't print anything. 该代码在整个站点中进行爬网,但不输出任何内容。 Any help would be useful. 任何帮助将是有用的。

Try to place this: 尝试放置:

name = "sitemaplocation"
allowed_domains = ["xyz.nl"]
f = open("locationlist.txt",'r')
start_urls = [url.strip() for url in f.readlines()]
f.close()

into

__init__

method in MySpider class. MySpider类中的方法。

And also where do you call parse function? 还有在哪里调用解析函数?

Try inheriting your spider from Spider not from CrawlSpider : 尝试从Spider而不是CrawlSpider继承Spider

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. 编写爬网蜘蛛规则时,请避免将解析用作回调,因为CrawlSpider使用解析方法本身来实现其逻辑。 So if you override the parse method, the crawl spider will no longer work. 因此,如果您覆盖parse方法,则爬网蜘蛛将不再起作用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM