简体   繁体   English

Scrapy:存储/删除当前的start_url?

[英]Scrapy: Store/scrape current start_url?

Background (can be skipped): 背景(可以跳过):

I am currently running two distinct scrapy crawlers. 我目前正在运行两个截然不同的抓取爬虫。

The 1st retrieves information about a product x and the 2nd retrieves other information about product x that is found on a url scraped by the 1st bot. 第一个检索有关产品x的信息,第二个检索有关在第一个机器人抓取的网址上找到的关于产品x的其他信息。

My pipeline concatenates each product's information into multiple text files, in which each product's information takes up one line of data and is broken up into multiple categories as distinct text files. 我的管道将每个产品的信息连接到多个文本文件中,其中每个产品的信息占用一行数据,并作为不同的文本文件分为多个类别。

Each bot obviously maintains information integrity since all information is parsed one link at a time (hence each text file's information is aligned line-by-line with other text files). 每个机器人显然都保持信息完整性,因为一次一次解析所有信息(因此每个文本文件的信息与其他文本文件逐行对齐)。 However, I understand scrapy uses a dynamic crawling mechanism that crawls websites based on their load time and not order in the start_url list. 但是,我了解scrapy使用动态爬网机制,该机制根据加载时间而不是start_url列表中的顺序对网站进行爬网。 Thus, my 2nd crawler's information does not line up with the other text files from the 1st crawler. 因此,第二个搜寻器的信息与第一个搜寻器的其他文本文件不一致。

One easy work-around for this is to scrape a "primary key" (mysql fanboys) variant of information that is found by both bots and can thus assist in aligning product information in a table by sorting the primary keys alphabetically and hence aligning the data manually. 一个简单的解决方法是,抓取两个机器人发现的信息的“主键”(mysql fanboys)变体,从而可以通过按字母顺序对主键进行排序并由此对齐数据来帮助在表中对齐产品信息。手动。

My current project leaves me in a difficult spot in terms of finding a primary key, however. 但是,当前的项目使我在寻找主键方面处于困境。 The 2nd crawler crawls websites with limited unique information, and hence my only shot at linking its findings back to the 1st crawler involves using the url identified by the 1st crawler and linking it to its identical start_url in the 2nd crawler. 第2个搜寻器使用有限的唯一信息搜寻网站,因此,我唯一将其发现链接回第一个搜寻器的步骤就是使用第一个搜寻器标识的url并将其链接到第二个搜寻器中相同的start_url。


Problem: 问题:

Is there a way to assign the start_url being crawled in each iteration of the xhtmlselector to a variable that can then be pushed into the pipeline with the item/field data crawled on that particular url (in instances where it cannot be found in the source code)? 有没有一种方法可以将在xhtmlselector的每次迭代中都将要爬网的start_url分配给一个变量,然后可以将该变量推送到管道中,并且该项目/字段数据将在该特定u​​rl上进行爬网(在源代码中找不到该实例/实例的情况下) )?

Here is my code: 这是我的代码:

    from scrapy.spider import BaseSpider
    from scrapy.selector import HtmlXPathSelector
    from Fleche_Noire.items import FlecheNoireItem
    import codecs

    class siteSpider(BaseSpider):
        name = "bbs"
        allowed_domains = ["http://www.samplewebsite.abc"]
        start_urls = [    
            'http://www.samplewebsite.abc/prod1',
            'http://www.samplewebsite.abc/prod2',
       ]



        def parse(self, response):
            hxs = HtmlXPathSelector(response)
            items = []
            item = FlecheNoireItem()
            item["brand"] = []
            item["age"] = []
            item["prodcode"] = hxs.select('//h1/text()').extract() or [' '] 
            item["description1"] = []
            item["description2"] = []
            item["product"] = []
            item["availability"] = []
            item["price"] = []
            item["URL"] = []
            item["imgurl"] = []
            items.append(item)
            return items

I'd like to be able to store the start_url as an item just like the h1 text found on the page. 我希望能够将start_url作为项目存储,就像在页面上找到的h1文本一样。

Thank you! 谢谢!

您可以从response.url或重定向的情况下获得它,甚至可以从response.request.url ,这意味着:

item["start_url"] = response.request.url

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM