简体   繁体   English

scrapy json在一行上输出所有项目

[英]scrapy json output all items on one line

I'm trying to get my output to look like the following in json format. 我试图让我的输出看起来像下面的json格式。

{"loser": "De Schepper K." ,"winner": "Herbert P.", "url":
"https://www.sofascore.com/tennis/2018-02-07"}

But I'm currently getting individual lines for each loser item and winner item. 但是我目前正在为每个失败者项目和获胜者项目分别设置一行。 I would like both winner and loser to be on the same line with the url. 我希望赢家和输家都与网址在同一行。

{"loser": "De Schepper K.", "url": 
"https://www.sofascore.com/tennis/2018-02-07"}
{"winner": "Herbert P.", "url": 
"https://www.sofascore.com/tennis/2018-02-07"}
{"loser": "Sugita Y.", "url": 
 "https://www.sofascore.com/tennis/2018-02-07"}

I'm not sure if it's my selectors that's causing this behaviour but I'd like to know how I can customise the pipelines so the loser, winner and date are all on the same json line 我不确定是不是我的选择器导致了这种现象,但我想知道如何自定义管道,所以失败者,获胜者和日期都在同一json行上

I've never extracted json format before so it's new to me. 我以前从未提取过json格式,因此对我来说是新的。 How do you specify what json keys and values will be on each line using custom pipeline? 您如何使用自定义管道指定每行上将包含哪些json键和值?

I also tried to use csv item exporter to do this and got strange behaviour too. 我还尝试使用csv项目导出程序来执行此操作,并且也得到了奇怪的行为。 ref Scrapy output is showing empty rows per column ref Scrapy输出显示每列为空行

Here's my spider.py 这是我的spider.py

import scrapy
from scrapy_splash import SplashRequest
from scrapejs.items import SofascoreItemLoader
from scrapy import Spider

import json
from scrapy.http import Request, FormRequest

    class MySpider(scrapy.Spider):
    name = "jsscraper"

    start_urls = ["https://www.sofascore.com/tennis/2018-02-07"]


    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url=url,
                            callback=self.parse,
                            endpoint='render.html',
                            args={'wait': 1.5})



    def parse(self, response):
            for row in response.css('.event-team'):
                    il = SofascoreItemLoader(selector=row)
                    il.add_css('winner' , '.event-team:nth-
                      child(2)::text')
                    il.add_css('loser' , '.event-team:nth-
                    child(1)::text')
                    il.add_value('url', response.url)

                    yield il.load_item()

items.py items.py

import scrapy

from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose
from operator import methodcaller
from scrapy import Spider, Request, Selector

class SofascoreItem(scrapy.Item):
    loser = scrapy.Field()
    winner = scrapy.Field()
    url = scrapy.Field()



class SofascoreItemLoader(ItemLoader):
    default_item_class = SofascoreItem
    default_input_processor = MapCompose(methodcaller('strip'))
    default_output_processor = TakeFirst()

pipeline.py pipe.py

import json
import codecs
from collections import OrderedDict

class JsonPipeline(object):

    def __init__(self):
        self.file = codecs.open('data_utf8.json' , 'w' , 
        encoding='utf-8')

    def process_item(self , item , spider):
        line = json.dumps(OrderedDict(item) , ensure_ascii=False , 
        sort_keys=False) + "\n"
        self.file.write(line)
        return item

    def close_spider(self , spider):
        self.file.close()

So I relooked your question, and I now find where the problem is: 因此,我回顾了您的问题,现在找到了问题所在:

for row in response.css('.event-team'):

With the above line, you will get many Selectors(or a SelectorList). 通过上面的行,您将获得许多选择器(或一个选择器列表)。 However, in each Selector or row, you can only get one Field: winner or loser. 但是,在每个选择器或每一行中,您只能获得一个字段:赢家或输家。 You can't get them both. 你不能两者兼得。 在此处输入图片说明

That's why there will be empty rows in your output. 这就是为什么输出中将有空行的原因。

Solution: try the following line: 解决方案:尝试以下行:

for row in response.css('div[class=“cell__section--main s-tennisCell curb-width”]')

The problem here is that you're looping over .event-team elements. 这里的问题是您正在遍历.event-team元素。
One of these elements can only be the winner or the loser, so you get an item for each. 这些元素之一只能是获胜者或失败者,因此您会为每个获得一个项目。

What you should be doing instead is loop over elements containing both ( .list-event seems like a good candidate), and extract both the winner and loser from those. 相反,您应该做的是循环遍历包含这两个元素的元素( .list-event似乎是一个不错的选择),然后从中提取赢家和输家。

This way, you'd have one loop per event, and as a result, one item per event. 这样,每个事件一个循环,因此每个事件一个循环。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM