抓取多個網址時，json 數據的順序被弄亂了 Scrapy

Question

我是 scrapy 的新手。 我制作了一個腳本來從網站上刪除數據，它工作正常，我得到了 JSON 文件的結果，它看起來很完美。 現在，當我嘗試使用我的腳本來廢棄多個 URL（同一個站點）時，它可以工作，我可以為每個 URL 獲取 JSON 文件中的數據，但是有一個錯誤。 我的打印結構如下（在腳本中編碼）

[
{Title:,,,Description:,,,Brochure:}, #URL1
{titleDesc:,,,Content:},  #URL1
{attribute:} #URL1
]

當我將 2 個 URL 廢棄時，我得到了這個：

[
{Title:,,,Description:,,,Brochure:}, #URL1
{titleDesc:,,,Content:}, #URL1
{attribute:},#URL1
{Title:,,,Description:,,,Brochure:}, #URL2
{titleDesc:,,,Content:}, #URL2
{attribute:} #URL2
]

它仍然很好，但是當我添加更多時，結構混亂並變成這樣：

[
{Title:,,,Description:,,,Brochure:}, #URL1
{titleDesc:,,,Content:}, #URL1
{attribute:}, #URL1
{Title:,,,Description:,,,Brochure:}, #URL2
{Title:,,,Description:,,,Brochure:}, #URL3
{titleDesc:,,,Content:}, #URL2
{attribute:}, #URL2
{titleDesc:,,,Content:}, #URL3
{attribute:}
]

如果您仔細觀察，您會注意到第三個 URL 的標題在第二個的標題下方。 有人可以幫忙嗎？

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "attributes"
    start_urls = ["https://product.sanyglobal.com/concrete_machinery/truck_mixer/119/161/",
    "https://product.sanyglobal.com/concrete_machinery/truck_mixer/119/162/"]

    def parse(self, response):
        yield{
            "title": response.css ("div.sku-top-title::text").get(),
            "desc" : response.css ("div.sku-top-desc::text").get(),
            "brochure" :'brochure'  
        }
        for post in response.css(".el-collapse"):
            for i in range(len(post.css(".el-collapse-item__header"))):
                res=""
                lst=post.css(".value-el-desc")
                x=lst[i].css(".value-el-desc p::text").extract()
                for y in x:
                    res+=y.strip()+"&&"
                try:      
                    yield{         
                        "descTitle" : post.css('.el-collapse-item__header::text')[i].get().strip(),
                        "desc" :res 
                        }  
                except:
                    continue
            res=""
            
        
        for post in response.css(".lie-one-canshu"):
            try:       
                dicti = {"attribute" : post.css('.lie-one-canshu::text')[0].get().strip()}
                yield dicti                   
            except:
                continue

更新：我注意到該錯誤不是永久性的，有時我執行腳本並且結果很好。

Answer 1

Scrapy 是異步的，因此無法保證項目的順序是 output 或已處理，至少無論如何都不是開箱即用的。 如果您希望單個 url 中的所有 output 一起出現，那么我建議您在每次調用 parse 方法時只產生一項...

例如：

def parse(self, response):
    results = {
       'items': [{
           "title": response.css ("div.sku-top-title::text").get(),
           "desc" : response.css ("div.sku-top-desc::text").get(),
           "brochure" :'brochure'  
        }]
    }
    for post in response.css(".el-collapse"):
        for i in range(len(post.css(".el-collapse-item__header"))):
            res=""
            lst=post.css(".value-el-desc")
            x=lst[i].css(".value-el-desc p::text").extract()
            for y in x:
                res+=y.strip()+"&&"
            try:      
                results['items'].append({         
                    "descTitle" : post.css('.el-collapse-item__header::text')[i].get().strip(),
                    "desc" :res 
                 }) 
            except:
                continue
        res=""
            
        
    for post in response.css(".lie-one-canshu"):
        try:       
            result['items'].append({
                "attribute" : post.css('.lie-one-canshu::text')[0].get().strip()
            })
        except:
            continue
    yield results

抓取多個網址時，json 數據的順序被弄亂了 Scrapy

問題描述

1 個解決方案

解決方案1
1 已采納 2022-08-20 01:20:00

抓取多個網址時，json 數據的順序被弄亂了 Scrapy

問題描述

1 個解決方案

解決方案1 1 已采納 2022-08-20 01:20:00

解決方案1
1 已采納 2022-08-20 01:20:00