简体   繁体   English

列出由 scrapy 中的 Xpath 检索到的元素不正确地逐项列出 output(对于,yield)

[英]List elements retrieved by Xpath in scrapy do not output correctly item by item(for,yield)

I am outputting the URL of the first page of the order results page of an exhibitor extracted from a specific EC site to a csv file, reading it in start_requests, and looping through it with a for statement.我将从特定 EC 站点提取的参展商的订单结果页面首页的 URL 输出到 csv 文件,在 start_requests 中读取它,并使用 for 语句循环遍历它。

Each order result page contains information on 30 products.每个订单结果页面包含 30 种产品的信息。

https://www.buyma.com/buyer/2597809/sales_1.html https://www.buyma.com/buyer/2597809/sales_1.html

itempage项目页面

Specify the links for the 30 items on each order results page and list?在每个订单结果页面和列表中指定 30 件商品的链接? type, and I tried to retrieve them one by one and store them in the item as shown in the code below, but it does not work.键入,我尝试将它们一一检索并存储在项目中,如下面的代码所示,但它不起作用。

class AllSaledataSpider(CrawlSpider):
name = 'all_salesdata_copy2'
allowed_domains = ['www.buyma.com']



def start_requests(self):
     with open('/Users/morni/researchtool/AllshoppersURL.csv', 'r', encoding='utf-8-sig') as f:
        reader = csv.reader(f)
        for row in reader:
            for n in range(1, 300): 
                url =str((row[2])[:-5]+'/sales_'+str(n)+'.html')
                yield scrapy.Request(
                    url=url,
                    callback=self.parse_firstpage_item,
                    dont_filter=True
                    )

def parse_firstpage_item(self, response): 
        loader = ItemLoader(item = ResearchtoolItem(), response = response)

        Conversion_date = response.xpath('//*[@id="buyeritemtable"]/div/ul/li[2]/p[3]/text()').getall()
        product_name = response.xpath('//*[@id="buyeritemtable"]/div/ul/li[2]/p[1]/a/text()').getall()
        product_URL = response.xpath('//*[@id="buyeritemtable"]/div/ul/li[2]/p[1]/a/@href').getall()

        for i in range(30):
            loader.add_value("Conversion_date", Conversion_date[i])
            loader.add_value("product_name", product_name[i])
            loader.add_value("product_URL", product_URL[i])
           
            yield loader.load_item()

Specify the links for the 30 items on each order results page and list?在每个订单结果页面和列表中指定 30 件商品的链接? type, and I tried to retrieve them one by one and store them in the item as shown in the code below, but it does not work.键入,我尝试将它们一一检索并存储在项目中,如下面的代码所示,但它不起作用。

The output is as follows, where each item contains multiple items of information at once. output如下,其中每一项同时包含多项信息。

Current status: {"product_name": ["product1", "product2"]), "Conversion_date":["Conversion_date1", "Conversion_date2" ], "product_URL":["product_URL1", "product_URL2"]},当前状态: {"product_name": ["product1", "product2"]), "Conversion_date":["Conversion_date1", "Conversion_date2" ], "product_URL":["product_URL1", "product_URL2"]},

Ideal: [{"product_name": "product1", "Conversion_date": Conversion_date1", "product_URL": "product_URL1"},{"product_name": "product2", "Conversion_date": Conversion_date2", "product_URL": "product_URL2"}]理想: [{"product_name": "product1", "Conversion_date": Conversion_date1", "product_URL": "product_URL1"},{"product_name": "product2", "Conversion_date": Conversion_date2", "product_URL": "product_URL2"}]

This may be due to my lack of understanding of basic for statements and yield.这可能是由于我对基本的 for 语句和产量缺乏了解。

You need to create a new loader each iteration每次迭代都需要创建一个新的加载器

for i in range(30):
    loader = ItemLoader(item = ResearchtoolItem(), response = response)
    loader.add_value("Conversion_date", Conversion_date[i])
    loader.add_value("product_name", product_name[i])
    loader.add_value("product_URL", product_URL[i])
    
    yield loader.load_item()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM