[英]Scrapy selector cannot return desired characters possibly due to Javascript
I'm trying to scrape data from this Chinese webpage http://bxt.harbin.gov.cn/hrb_bzbxt/disshow.php?id=551950 . 我正在尝试从此中文网页http://bxt.harbin.gov.cn/hrb_bzbxt/disshow.php?id=551950上抓取数据。
In Scrapy shell, I cannot get any text in any td elements. 在Scrapy shell中,我无法在任何td元素中获得任何文本。 For example,
response.xpath("/html/body/center[2]/table/tbody/tr[2]/td[3]/text()").extract()
returns an empty list. 例如,
response.xpath("/html/body/center[2]/table/tbody/tr[2]/td[3]/text()").extract()
返回一个空列表。 The same thing is returned for other similar commands too. 其他类似命令也返回相同的内容。 When I inspect the html more closely, I find this in the head element: "script language = "javascript". I'm not sure if this is the cause of the problem. Could anybody help me figure out? I searched Stackoverflow for related topics, but it's too complex for me to grasp. Thank you for your help!
当我更仔细地检查html时,我在head元素中找到了这个:“ script language =” javascript“。我不确定这是否是问题的起因。有人可以帮我解决吗?我在Stackoverflow上搜索了相关内容主题,但是这对我来说太复杂了,谢谢您的帮助!
the problem here is that you are using a full path to get to the information you want, this isn't necessary, so no need to follow html
-> body
-> center
, etc. 这里的问题是,您正在使用完整路径来获取所需的信息,这不是必需的,因此无需遵循
html
> body
> center
等。
You could just go directly to the td
information you need, with something like: 您可以直接输入所需的
td
信息,例如:
response.xpath('//td/text()')
which will return a list of selectors (every text inside a td
tag) to iterate with the information you need. 它将返回选择器列表(
td
标签中的每个文本)以迭代所需的信息。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.