如何修改字典輸出

Question

我正在用 Scrapy 抓取一個頁面的新聞，它基本上是一個標題、元文本和文本摘要。 代碼實際上工作正常，但我的字典輸出有問題。 輸出首先顯示所有標題，然后顯示所有元文本，最后顯示所有文本摘要。 但我需要的是一個又一個帶有標題、元文本和文本摘要的新聞。 我猜 for 循環或選擇器有問題嗎？

謝謝你的幫助！

我的代碼：

import scrapy
class testspider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://oilprice.com/Latest-Energy-News/World-News']    

    def parse(self, response):
        all_news = response.xpath('//div[@class="tableGrid__column tableGrid__column--articleContent category"]')

        for singlenews in all_news:         
            title_item = singlenews.xpath('//div[@class="categoryArticle__content"]//a//text()').extract()
            meta_item = singlenews.xpath('//div[@class="categoryArticle__content"]//p[@class="categoryArticle__meta"]//text()').extract()
            extract_item = singlenews.xpath('//div[@class="categoryArticle__content"]//p[@class="categoryArticle__excerpt"]//text()').extract()      

            yield {
                'title_data' : title_item,
                'meta_data' :  meta_item,
                'extract_data' : extract_item        
            }

輸出：

{'title_data': ['Global Energy-Related CO2 Emissions Stopped Rising In 2019', 'BHP
 Is Now The World’s Top Copper Miner', 'U.S. Budget Proposal Includes Sale Of 15 
Mln Barrels Strategic Reserve Oil', ... , '**meta_data**': ['Feb 11, 2020 at 12:02
 | Tsvetana Paraskova', 'Feb 11, 2020 at 11:27 | MINING.com ', 'Feb 11, 2020 at 
09:59 | Irina Slav', ... , '**extract_data**': ['The world’s energy-related carbon
 dioxide (CO2) emissions remained flat in 2019, halting two years of emissions 
increases, as lower emissions in advanced economies offset growing emissions
 elsewhere, the International Energy…', 'BHP Group on Monday became the world’s 
largest copper miner based on production after Chile’s copper commission announced 
a slide in output at state-owned Codelco.\r\nHampered by declining grades 
Codelco…', 'The budget proposal President Trump released yesterday calls for the 
sale of 15 million barrels of oil from the Strategic Petroleum Reserve of the 
United States.\r\nThe proceeds from the…', ... , ']}

Answer 1

從您的輸出中，您的代碼似乎正在同時提取title 、 meta_data和extract_data並將其保存在一本字典中。 如果您想為您抓取的網站上的每個新聞項目准備一個字典，您應該首先獲取您需要的所有數據，然后根據需要將其解析為字典。 所以你的代碼看起來像這樣

def parse(self, response):
    all_news = response.xpath('//div[@class="tableGrid__column tableGrid__column--articleContent category"]')  
    titles = all_news.xpath('//div[@class="categoryArticle__content"]//a//text()').extract()
    meta_items = all_news.xpath('//div[@class="categoryArticle__content"]//p[@class="categoryArticle__meta"]//text()').extract()
    extract_items = all_news.xpath('//div[@class="categoryArticle__content"]//p[@class="categoryArticle__excerpt"]//text()').extract()      

    # at this point titles, meta_items and extract_items should be 3 concurrent lists of the same length and now you can parse them as you need

    news_items = []
    for i in range(len(titles)): 
        news = { 'title': titles[i], 'meta_data': meta_items[i], 'extract_data': extract_items[i] }
        news_items.append(news)
    return news_items

這應該會根據您的需要返回新聞帖子。

Answer 2

當您在 Xpath 中使用//時，搜索將在整個文檔中執行，然后行

title_item = singlenews.xpath('//div[@class="categoryArticle__content"]//a//text()').extract()

將返回一個列表，其中包含與此過濾器div[@class="categoryArticle__content]匹配的 div 中的所有文本

你需要做的是過濾相對路徑singlenews ，嘗試這樣的事情：

title_item = singlenews.xpath('./div[@class="categoryArticle__content"]//a//text()').extract()

參考： https : //devhints.io/xpath

如何修改字典輸出

問題描述

2 個解決方案

解決方案1
2 已采納 2020-02-11 20:22:03

解決方案2
0 2020-02-11 20:21:25

如何修改字典輸出

問題描述

2 個解決方案

解決方案1 2 已采納 2020-02-11 20:22:03

解決方案2 0 2020-02-11 20:21:25

解決方案1
2 已采納 2020-02-11 20:22:03

解決方案2
0 2020-02-11 20:21:25