使用scrapy：response.xpath（）从HTML表中提取数据不会产生任何结果

Question

I've been building a web scraper in python 3 using the scrapy library and I'm running into a problem I don't understand. 我一直在使用scrapy库在python 3中构建一个web scraper，但遇到了我不明白的问题。 I've successfully scraped other tables using inspect element on the table to get the xpath variables. 我已经成功使用表上的inspect元素抓取了其他表以获取xpath变量。 However, with this table, I am unable to figure out how to extract the data from the table. 但是，使用此表，我无法弄清楚如何从表中提取数据。 I am new to HTML but not new to programming, so please help me if I'm way off here. 我是HTML的新手，但不是编程的新手，所以如果我不在这里，请给我帮助。

An example of this web page would be: http://land.elpasoco.com/ResidentialBuilding.aspx?schd=5317443025&bldg=1 该网页的示例为： http : //land.elpasoco.com/ResidentialBuilding.aspx?schd=5317443025&bldg=1

Inspecting the page and getting the xpath for the target table yields //*[@id="aspnetForm"]/table/tbody/tr[3]/td[1]/table/tbody/tr[1]/td/table/tbody/tr[3]/td/table 检查页面并获取目标表的xpath会生成//*[@id="aspnetForm"]/table/tbody/tr[3]/td[1]/table/tbody/tr[1]/td/table/tbody/tr[3]/td/table

However, using this in a scrapy shell response.xpath(target).extract() returns [] . 但是，在草率的shell response.xpath(target).extract()使用此方法。xpath response.xpath(target).extract()返回[] 。 Trying to target any individual cells also appears to provide the same null result. 尝试针对任何单个单元格似乎也提供了相同的空结果。 My intended result would be a dataframe or dictionary correlating something like {'Dwelling Units': 1, 'Year Built': 2010 ... } Any help identifying where I'm going wrong would or how to get the data formatted as such would be appreciated. 我的预期结果将是一个数据框或字典，其与诸如{'Dwelling Units': 1, 'Year Built': 2010 ... }帮助我确定哪里出了问题或如何格式化数据。不胜感激。 Thanks! 谢谢！

Answer 1

import scrapy


class ResidentialRecordsSpider(scrapy.Spider):
    name = "residential_records"

    start_urls = [
        'http://land.elpasoco.com/ResidentialBuilding.aspx?schd=5317443025&bldg=1',
    ]

    def parse(self, response):
        for record in response.xpath('//table[@width="90%"]//td'):
            key = record.xpath('./strong/text()').extract_first(default='')
            value = record.xpath('./text()').extract_first(default='')

            yield { key: value }

Here you need to perform some data cleaning only 在这里，您只需要执行一些数据清理

使用scrapy：response.xpath（）从HTML表中提取数据不会产生任何结果

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-06-07 05:04:11

使用scrapy：response.xpath（）从HTML表中提取数据不会产生任何结果

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-06-07 05:04:11

解决方案1
1 已采纳 2018-06-07 05:04:11