简体   繁体   English

使用scrapy:response.xpath()从HTML表中提取数据不会产生任何结果

[英]Extracting data from HTML table using scrapy: response.xpath() yields None

I've been building a web scraper in python 3 using the scrapy library and I'm running into a problem I don't understand. 我一直在使用scrapy库在python 3中构建一个web scraper,但遇到了我不明白的问题。 I've successfully scraped other tables using inspect element on the table to get the xpath variables. 我已经成功使用表上的inspect元素抓取了其他表以获取xpath变量。 However, with this table, I am unable to figure out how to extract the data from the table. 但是,使用此表,我无法弄清楚如何从表中提取数据。 I am new to HTML but not new to programming, so please help me if I'm way off here. 我是HTML的新手,但不是编程的新手,所以如果我不在这里,请给我帮助。

An example of this web page would be: http://land.elpasoco.com/ResidentialBuilding.aspx?schd=5317443025&bldg=1 该网页的示例为: http : //land.elpasoco.com/ResidentialBuilding.aspx?schd=5317443025&bldg=1

Inspecting the page and getting the xpath for the target table yields //*[@id="aspnetForm"]/table/tbody/tr[3]/td[1]/table/tbody/tr[1]/td/table/tbody/tr[3]/td/table 检查页面并获取目标表的xpath会生成//*[@id="aspnetForm"]/table/tbody/tr[3]/td[1]/table/tbody/tr[1]/td/table/tbody/tr[3]/td/table

However, using this in a scrapy shell response.xpath(target).extract() returns [] . 但是,在草率的shell response.xpath(target).extract()使用此方法。xpath response.xpath(target).extract()返回[] Trying to target any individual cells also appears to provide the same null result. 尝试针对任何单个单元格似乎也提供了相同的空结果。 My intended result would be a dataframe or dictionary correlating something like {'Dwelling Units': 1, 'Year Built': 2010 ... } Any help identifying where I'm going wrong would or how to get the data formatted as such would be appreciated. 我的预期结果将是一个数据框或字典,其与诸如{'Dwelling Units': 1, 'Year Built': 2010 ... }帮助我确定哪里出了问题或如何格式化数据。不胜感激。 Thanks! 谢谢!

import scrapy


class ResidentialRecordsSpider(scrapy.Spider):
    name = "residential_records"

    start_urls = [
        'http://land.elpasoco.com/ResidentialBuilding.aspx?schd=5317443025&bldg=1',
    ]

    def parse(self, response):
        for record in response.xpath('//table[@width="90%"]//td'):
            key = record.xpath('./strong/text()').extract_first(default='')
            value = record.xpath('./text()').extract_first(default='')

            yield { key: value }

Here you need to perform some data cleaning only 在这里,您只需要执行一些数据清理

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM