简体   繁体   English

在 Scrapy 中使用 For 循环将 Xpath 值附加到列表

[英]Appending Xpath Value to List Using For Loop in Scrapy

I'm looking to try and automate my html table scrape in Scrapy.我正在尝试在 Scrapy 中尝试自动化我的 html 表刮擦。 This is what I have so far:这是我到目前为止所拥有的:

import scrapy
import pandas as pd

class XGSpider(scrapy.Spider):

    name = 'expectedGoals'

    start_urls = [
        'https://fbref.com/en/comps/9/schedule/Premier-League-Scores-and-Fixtures',
    ]

    def parse(self, response):

        matches = []

        for row in response.xpath('//*[@id="sched_ks_3232_1"]//tbody/tr'):

            match = {
                'home': row.xpath('td[4]//text()').extract_first(),
                'homeXg': row.xpath('td[5]//text()').extract_first(),
                'score': row.xpath('td[6]//text()').extract_first(),
                'awayXg': row.xpath('td[7]//text()').extract_first(),
                'away': row.xpath('td[8]//text()').extract_first()
            }

            matches.append(match)

        x = pd.DataFrame(
            matches, columns=['home', 'homeXg', 'score', 'awayXg', 'away'])

        yield x.to_csv("xG.csv", sep=",", index=False)

It works fine, however as you can see I am hardcoding the keys ( home , homeXg , etc.) for the match object.它工作正常,但是如您所见,我正在对match object 的键( homehomeXg等)进行硬编码。 I'd like to automate scraping the keys to a list and then initialize a dict wih keys from said list.我想自动将键刮到列表中,然后用所述列表中的键初始化字典。 Problem is, I don't know how to loop through xpath by index.问题是,我不知道如何按索引遍历 xpath。 As an example,举个例子,

 headers = [] 
        for row in response.xpath('//*[@id="sched_ks_3260_1"]/thead/tr'): 
            yield{
                'first': row.xpath('th[1]/text()').extract_first(),
                'second': row.xpath('th[2]/text()').extract_first()
            }

Is it possible to stick th[1] , th[2] , th[3] etc. into a for loop, with the numbers as indexes, and appending the values to a list?是否可以将th[1]th[2]th[3]等粘贴到 for 循环中,将数字作为索引,并将值附加到列表中? eg例如

row.xpath('th[i]/text()').extract_first() ? row.xpath('th[i]/text()').extract_first()

Not tested but should work:未经测试,但应该可以工作:

column_index = 1
columns = {}
for column_node in response.xpath('//*[@id="sched_ks_3260_1"]/thead/tr/th'):
    column_name = column_node.xpath('./text()').extract_first()
    columns[column_name] = column_index
    column_index += 1
    matches = []

for row in response.xpath('//*[@id="sched_ks_3232_1"]//tbody/tr'):
    match = {}        
    for column_name in columns.keys():
        match[column_name] = row.xpath('./td[{index}]//text()'.format(index=columns[column_name])).extract_first()
    matches.append(match)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM