簡體   English   中英

Scrapy output 項目 - 多種解析方法,每個項目一行

[英]Scrapy output items - multiple parse methods, one row per item

I am continuing a scrapy project from an earlier question: scrapy output item as 1 list element per row I have my scrapy code returning data from ufc events in one parse method and subsequently returning totals and round-by-round data for each event match in一個額外的解析方法(單獨的鏈接)。

在生成的 csv 文件中返回的抓取數據是正確的。 但是格式有問題:

event_name  event_date  event_loc   attendance  wclass  method  mthdtl  finround    fintime winner  loser   bout    fighters    method_txt  mthdtl_txt  m_finround  m_fintime   timefrmt    ref w_kd    l_kd    w_sigstr    l_sigstr    w_sigstr_perc   l_sigstr_perc   w_tot_str   l_tot_str   w_td    l_td    w_td_perc   l_td_perc   w_sub_att   l_sub_att   w_pass  l_pass  w_rev   l_rev   r1_w_kd r1_w_tot_str    r1_w_td r1_w_td_perc    r1_w_sub_att    r1_w_pass   r1_w_rev    r1_l_kd r1_l_tot_str    r1_l_td r1_l_td_perc    r1_l_sub_att    r1_l_pass   r1_l_rev    r1_w_sigstr r1_l_sigstr r1_w_sigstr_perc    r1_w_sigstr_perc    r1_w_sigstr_head    r1_l_sigstr_head    r1_w_sigstr_body    r1_l_sigstr_body    r1_w_sigstr_leg r1_l_sigstr_leg r1_w_sigstr_dist    r1_l_sigstr_dist    r1_w_sigstr_clinch  r1_l_sigstr_clinch  r1_w_sigstr_ground  r1_l_sigstr_ground  r2_w_kd r2_w_tot_str    r2_w_td r2_w_td_perc    r2_w_sub_att    r2_w_pass   r2_w_rev    r2_l_kd r2_l_tot_str    r2_l_td r2_l_td_perc    r2_l_sub_att    r2_l_pass   r2_l_rev    r2_w_sigstr r2_l_sigstr r2_w_sigstr_perc    r2_w_sigstr_perc    r2_w_sigstr_head    r2_l_sigstr_head    r2_w_sigstr_body    r2_l_sigstr_body    r2_w_sigstr_leg r2_l_sigstr_leg r2_w_sigstr_dist    r2_l_sigstr_dist    r2_w_sigstr_clinch  r2_l_sigstr_clinch  r2_w_sigstr_ground  r2_l_sigstr_ground  r3_w_kd r3_w_tot_str    r3_w_td r3_w_td_perc    r3_w_sub_att    r3_w_pass   r3_w_rev    r3_l_kd r3_l_tot_str    r3_l_td r3_l_td_perc    r3_l_sub_att    r3_l_pass   r3_l_rev    r3_w_sigstr r3_l_sigstr r3_w_sigstr_perc    r3_w_sigstr_perc    r3_w_sigstr_head    r3_l_sigstr_head    r3_w_sigstr_body    r3_l_sigstr_body    r3_w_sigstr_leg r3_l_sigstr_leg r3_w_sigstr_dist    r3_l_sigstr_dist    r3_w_sigstr_clinch  r3_l_sigstr_clinch  r3_w_sigstr_ground  r3_l_sigstr_ground  r4_w_kd r4_w_tot_str    r4_w_td r4_w_td_perc    r4_w_sub_att    r4_w_pass   r4_w_rev    r4_l_kd r4_l_tot_str    r4_l_td r4_l_td_perc    r4_l_sub_att    r4_l_pass   r4_l_rev    r4_w_sigstr r4_l_sigstr r4_w_sigstr_perc    r4_w_sigstr_perc    r4_w_sigstr_head    r4_l_sigstr_head    r4_w_sigstr_body    r4_l_sigstr_body    r4_w_sigstr_leg r4_l_sigstr_leg r4_w_sigstr_dist    r4_l_sigstr_dist    r4_w_sigstr_clinch  r4_l_sigstr_clinch  r4_w_sigstr_ground  r4_l_sigstr_ground  r5_w_kd r5_w_tot_str    r5_w_td r5_w_td_perc    r5_w_sub_att    r5_w_pass   r5_w_rev    r5_l_kd r5_l_tot_str    r5_l_td r5_l_td_perc    r5_l_sub_att    r5_l_pass   r5_l_rev    r5_w_sigstr r5_l_sigstr r5_w_sigstr_perc    r5_w_sigstr_perc    r5_w_sigstr_head    r5_l_sigstr_head    r5_w_sigstr_body    r5_l_sigstr_body    r5_w_sigstr_leg r5_l_sigstr_leg
UFC 241: Cormier vs. Miocic 2   August 17, 2019 Anaheim, California, USA    17,304  Heavyweight,,   KO/TKO  Punches 4   04:09   Stipe Miocic    Daniel Cormier                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
UFC 241: Cormier vs. Miocic 2   August 17, 2019 Anaheim, California, USA    17,304  Welterweight,   U-DEC       3   05:00   Nate Diaz   Anthony Pettis
UFC 241: Cormier vs. Miocic 2   August 17, 2019 Anaheim, California, USA    17,304  Middleweight,,  U-DEC       3   05:00   Paulo Costa Yoel Romero                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
                                            Welterweight Bout   Anthony Pettis,Nate Diaz    Decision - Unanimous        3   05:00   3 Rnd (5-5-5)   Mike Beltran,Guilherme Bravo,Derek Cleary,Ron McCarthy  0   1   69 of 133   114 of 201  51% 56% 86 of 153   205 of 306  0 of 0  1 of 1  0%  100%    1   0   0   4   2   1   0   23 of 41    0 of 0  0%  1   0   0   0   62 of 88    1 of
                                                                                                                                                                                                            14 of 31    22 of 42    45% 45% 9 of 22 15 of 33    2 of 2  5 of 6  3 of 7  2 of 3  9 of 24 9 of 23 5 of 7  6 of 9  0 of 0  7 of 10 0   40 of 70    0 of 0  0%  0   0   0   0   65 of 114   0 of 0  0%  0   0   0   36 of 66    54 of 100   54% 54% 28 of 55    45 of 87    7 of 9  7 of 11 1 of 2  2 of 2  26 of 54    29 of 63    10 of 12    25 of 37    0 of 0  0 of 0  0   23 of 42    0 of 0  0%  0   0   2   1   78 of 104   0 of 0  0%  0   2   1   19 of 36    38 of 59    52% 52% 17 of 34    34 of 52    1 of 1  4 of 6  1 of 1  0 of 1  11 of 24    13 of 23    5 of 8  12 of 17    3 of 4  13 of 19                                                                                                                                                                                                                        
                                            Middleweight Bout   Yoel Romero,Paulo Costa Decision - Unanimous        3   05:00   3 Rnd (5-5-5)   Jason Herzog,Guilherme Bravo,Ron McCarthy,Michael Bell  1   1   125 of 284  118 of 213  44% 55% 125 of 284  118 of 213  1 of 4  0 of 0  25% 0%  0   0   0   0   0   0   1   32 of 69    0 of 2  0%  0   0   0   1   37 of 69    0 of
                                                                                                                                                                                                            32 of 69    37 of 69    46% 46% 23 of 54    19 of 46    2 of 7  16 of 20    7 of 8  2 of 3  31 of 68    32 of 61    1 of 1  2 of 2  0 of 0  3 of 6  0   40 of 91    1 of 1  100%    0   0   0   0   37 of 71    0 of 0  0%  0   0   0   40 of 91    37 of 71    43% 43% 28 of 77    24 of 53    6 of 7  12 of 17    6 of 7  1 of 1  39 of 90    36 of 70    1 of 1  1 of 1  0 of 0  0 of 0  0   53 of 124   0 of 1  0%  0   0   0   0   44 of 73    0 of 0  0%  0   0   0   53 of 124   44 of 73    42% 42% 45 of 113   24 of 49    3 of 6  18 of 21    5 of 5  2 of 3  48 of 118   42 of 71    5 of 6  2 of 2  0 of 0  0 of 0                                                                                                                                                                                                                      
                                            UFC Heavyweight Title Bout  Daniel Cormier,Stipe Miocic KO/TKO  Punches to Head At Distance 4   04:09   5 Rnd (5-5-5-5-5)   Herb Dean   0   1   181 of 263  123 of 229  68% 53% 230 of 317  135 of 244  1 of 3  1 of 3  33% 33% 0   0   2   0   0   0   0   71 of 83    1 of 2  50% 0   2   0   0   9 of 18 0 of
                                                                                                                                                                                                            37 of 46    7 of 13 80% 80% 25 of 34    3 of 8  7 of 7  0 of 0  5 of 5  4 of 5  13 of 16    6 of 12 3 of 3  0 of 0  21 of 27    1 of 1  0   59 of 85    0 of 0  0%  0   0   0   0   48 of 84    0 of 0  0%  0   0   0   56 of 82    46 of 82    68% 68% 56 of 81    37 of 72    0 of 0  8 of 9  0 of 1  1 of 1  45 of 68    42 of 76    11 of 14    4 of 6  0 of 0  0 of 0  0   69 of 100   0 of 1  0%  0   0   0   0   40 of 73    1 of 3  33% 0   0   0   57 of 86    34 of 67    66% 66% 53 of 82    28 of 61    1 of 1  5 of 5  3 of 3  1 of 1  50 of 76    24 of 50    7 of 10 10 of 17    0 of 0  0 of 0  0   31 of 49    0 of 0  0%  0   0   0   1   38 of 69    0 of 0  0%  0   0   0   31 of 49    36 of 67    63% 63% 28 of 46    18 of 47    1 of 1  14 of 16    2 of 2  4 of 4  31 of 49    30 of 57    0 of 0  5 of 5  0 of 0  1 of 5

首先,來自第一個和第二個解析方法的項目出現在不同的行上。 這些第二項是一種子集,作為一個單獨的塊,完全位於第一個解析方法項的右側和下方。

隨后,在第二個解析方法項目中(在項目行的第一個塊的下方和右側),項目跳過一行以容納來自 if-elif-else 條件的逐輪數據。 此數據位於這些行之間。 我正在使用項目和項目加載器,但我目前沒有使用任何自定義項目管道。 我從命令行和 output 到 csv 運行蜘蛛:

 scrapy crawl stats -o stats.csv

縮寫代碼:

class StatsSpider(scrapy.Spider):
name = 'stats'
allowed_domains = ['ufcstats.com']
start_urls = ['http://ufcstats.com/statistics/events/completed?page=all']
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,}  
#ITEM_PIPELINES = {'stats.pipelines.StatsPipeline': 300,}
custom_settings = {# specifies exported fields and order
    'FEED_EXPORT_FIELDS': [ *extensive feed_export_fields* ]}

def parse(self, response):
    rev_orderd_events = response.css('tr.b-statistics__table-row')
    # full event_links
    # event_links = rev_orderd_events.css('i>a::attr(href)').extract()
    # for url in event_links:
    #     yield scrapy.Request(url=event_links, callback=self.parse_event)
    event_links = rev_orderd_events.css('i>a::attr(href)')[3].extract()
    # for links in event_links:
    #     yield scrapy.Request(url=links,callback=self.parse_event)
    yield scrapy.Request(url=event_links,callback=self.parse_event,dont_filter=True)
def parse_event(self, response):
    pg = response.css('div.l-page__container')
    for event in response.css('div.b-fight-details'):
        event_name = pg.css('h2.b-content__title>span::text').extract_first()
        event_date = event.css('ul.b-list__box-list>li:nth-child(1)::text').extract()
        event_loc  = event.css('ul.b-list__box-list>li:nth-child(2)::text').extract()
        attendance = event.css('ul.b-list__box-list>li:nth-child(3)::text').extract()
        child(odd)::text').extract()
        for fights in event.css('tr')[1:]: 
            il = ItemLoader(StatsItem(), selector=fights)
            il.add_value('event_name', event_name)
            il.add_value('event_date', event_date)
            il.add_value('event_loc', event_loc)
            il.add_value('attendance', attendance)
            il.add_css('winner', 'td.b-fight-details__table-col:nth-child(2) p.b-fight-details__table-text:nth-child(odd)>a::text')
            il.add_css('loser', 'td.b-fight-details__table-col:nth-child(2) p.b-fight-details__table-text:nth-child(even)>a::text')
            il.add_css('wclass','td.b-fight-details__table-col:nth-child(7)>p:nth-child(1)::text')
            il.add_css('method','td.b-fight-details__table-col:nth-child(8)>p:nth-child(odd)::text')
            il.add_css('mthdtl','td.b-fight-details__table-col:nth-child(8)>p:nth-child(even)::text')
            il.add_css('finround','td.b-fight-details__table-col:nth-child(9)>p:nth-child(odd)::text')
            il.add_css('fintime','td.b-fight-details__table-col:nth-child(10)>p:nth-child(odd)::text')
            yield il.load_item()

    match_links = pg.css('tr>td:nth-child(1) a::attr(href)').extract()
    for links in match_links:
        yield scrapy.Request(url=links, callback=self.parse_match)


def parse_match(self, response):
    section = response.css('section.b-statistics__section_details')
    f_dtl = section.css('div.b-fight-details')
    # m_event = section.css('h2>a::text').extract()
    m_info   = f_dtl.css('div.b-fight-details__fight div i::text').extract()
    m_fin_dtl    = f_dtl.css('div.b-fight-details__content>p::text').extract()
    ref =  f_dtl.css('div.b-fight-details__content i>span::text').extract()
    #table_rows  = f_dtl.css('tr.b-fight-details__table-row>td.b-fight-details__table-col>p::text').extract()
    #timefrmt = f_dtl.css('div.b-fight-details__fight div i::text')[15].extract()
    fighters = f_dtl.css('table:nth-child(1) tr.b-fight-details__table-row>td.b-fight-details__table-col>p>a::text').extract()
    m_totals = f_dtl.css('table:nth-child(1) tr.b-fight-details__table-row>td.b-fight-details__table-col>p::text').extract()
    rounds = f_dtl.css('table:nth-child(2) tr.b-fight-details__table-row>td.b-fight-details__table-col>p::text').extract()



    for info in section:
        il = ItemLoader(StatsItem(), selector=section)
        il.add_value('bout', m_info)
        il.add_value('method_txt', m_info)
        il.add_value('mthdtl_txt' , m_fin_dtl)
        il.add_value('m_finround' , m_info)
        il.add_value('m_fintime', m_info)
        il.add_value('timefrmt', m_info)
        il.add_value('ref', ref)
        il.add_value('fighters', fighters)


        il.add_value('w_kd',  m_totals)
        il.add_value('w_sigstr',  m_totals)
        il.add_value('w_sigstr_perc',  m_totals)
        il.add_value('w_tot_str',  m_totals)
        il.add_value('w_td',  m_totals)
        il.add_value('w_td_perc',  m_totals)
        il.add_value('w_sub_att',  m_totals)
        il.add_value('w_pass',  m_totals)
        il.add_value('w_rev',  m_totals)
        il.add_value('l_kd',  m_totals)
        il.add_value('l_sigstr',  m_totals)
        il.add_value('l_sigstr_perc',  m_totals)
        il.add_value('l_tot_str',  m_totals)
        il.add_value('l_td',  m_totals)
        il.add_value('l_td_perc',  m_totals)
        il.add_value('l_sub_att',  m_totals)
        il.add_value('l_pass',  m_totals)
        il.add_value('l_rev',  m_totals)

        il.add_value('r1_w_kd',  rounds)
        # il.add_value('r1_w_sigstr',  rounds)
        # il.add_value('r1_w_sigstr_perc',  rounds)
        il.add_value('r1_w_tot_str',  rounds)
        il.add_value('r1_w_td',  rounds)
        il.add_value('r1_w_td_perc',  rounds)            
        il.add_value('r1_w_sub_att',  rounds)
        il.add_value('r1_w_pass',  rounds)
        il.add_value('r1_w_rev',  rounds)
        il.add_value('r1_l_kd',  rounds)
        # il.add_value('r1_l_sigstr',  rounds)
        # il.add_value('r1_l_sigstr_perc',  rounds)
        il.add_value('r1_l_tot_str',  rounds)
        il.add_value('r1_l_td',  rounds)
        il.add_value('r1_l_td_perc',  rounds)            
        il.add_value('r1_l_sub_att',  rounds)
        il.add_value('r1_l_pass',  rounds)
        il.add_value('r1_l_rev',  rounds)
        yield il.load_item()

        if len(rounds) == 42:
            r1 = ItemLoader(round_1_items(), selector = section)
            r1...
            yield r1.load_item()

        elif len(rounds) == 84:
            r2 = ItemLoader(round_2_items(), selector = section)               
            r2...
            yield r2.load_item()

        elif len(rounds) == 126:
            r3 = ItemLoader(round_3_items(), selector = section)
            r3...
            yield r3.load_item()

        elif len(rounds) == 168:
            r4 = ItemLoader(round_4_items(), selector = section)
            r4...
            yield r4.load_item()

        elif len(rounds) == 210:
            r5 = ItemLoader(round_5_items(), selector = section)
            r5....
            yield r5.load_item()

        else:
            il = ItemLoader(StatsItem(), selector=section)
            il.add_value('rounders', rounds)
            yield il.load_item()

我希望每個項目都是 output 作為一個 csv 行。 所以如果 csv 當前 csv output 是這樣的:

1 (block of rows)
  2a
    2b (alternating total/round detail rows)

我希望我的 csv 是:

1 - 2a - 2b...

我花了一段時間才理解您的問題/問題,如果我的回答不正確,請道歉。

scrapy將在每次yield一個項目時向 output 寫入一個新行,因此只有在您擁有完整的StatsItem時才應該yield 如果必須從兩個不同的頁面解析您的數據,您可以在parse_event中創建您的項目,然后將其傳遞給parse_match function,使用cb_kwargs (在scrapy-1.7中引入)或Requestmeta參數進行部分填充。

所以在parse_event你會有

yield scrapy.Request(..., callback=self.parse_match, 
                     cb_kwargs={'item': il.load_item()})

然后您可以修改parse_match以將item作為參數

def parse_match(self, response, item):
    ...
    # Later on
    il = ItemLoader(item, selector=section)
    # Fill rest of item

總之,嘗試只做一次yield il.load_item()

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM