[英]Scraping an Element using lxml and Xpath
The issue I'm having is scraping out the element itself. 我遇到的问题是刮除元素本身。 I'm able to scrape the first two (IncidentNbr and DispatchTime ) but I can't get the address... (1300 Dunn Ave) I want to be able to scrape that element but also have it dynamic enough so I'm not actually parsing for "1300 Dunn Ave" I'm parsing for that element.
我可以抓取前两个(IncidentNbr和DispatchTime),但我无法获取地址...(1300 Dunn Ave),我希望能够抓取该元素,但又要使其具有足够的动态性,所以我不能实际解析为“ 1300 Dunn Ave”,我正在解析该元素。 Here is the source code
这是源代码
<td><span id="lstCallsForService_ctrl0_lblIncidentNbr">150318182198</span></td>
<td><nobr><span id="lstCallsForService_ctrl0_lblDispatchTime">3-18 10:25</span></nobr></td>
<td>
<a id="lstCallsForService_ctrl0_lnkAddress" href="https://maps.google.com/?q=1300 DUNN AVE, Jacksonville, FL" target="_blank" style="text-decoration:underline;">1300 DUNN AVE</a>
</td>
And here is my code: 这是我的代码:
from lxml import html
import requests
page = requests.get('http://callsforservice.jaxsheriff.org/')
tree = html.fromstring(page.text)
callSignal = tree.xpath('//span[@id="lstCallsForService_ctrl0_lblIncidentNbr"]/text()')
dispatchTime = tree.xpath('//span[@id="lstCallsForService_ctrl0_lblDispatchTime"]/text()')
location = tree.xpath('//span[@id="lstCallsForService_ctrl0_lnkAddress"]/text()')
print 'Call Signal: ', callSignal
print "Dispatch Time: ", dispatchTime
print "Location: ", location
And this is my output: 这是我的输出:
Call Signal: ['150318182198']
Dispatch Time: ['3-18 10:25']
Location: []
Any idea on how I can scrape out the address? 关于如何抓取地址的任何想法吗?
This is the element you are looking for: 这是您要查找的元素:
<a id="lstCallsForService_ctrl0_lnkAddress"
href="https://maps.google.com/?q=1300 DUNN AVE, Jacksonville, FL"
target="_blank" style="text-decoration:underline;">1300 DUNN AVE</a>
As you can see, it is not a span
element. 如您所见,它不是
span
元素。 Your current XPath expression: 您当前的XPath表达式:
//span[@id="lstCallsForService_ctrl0_lnkAddress"]/text()
is looking for a span
element with this ID, when it should actually be selecting an a
element. 正在实际选择
a
元素时,正在寻找具有此ID的span
元素。 Use 采用
//a[@id="lstCallsForService_ctrl0_lnkAddress"]/text()
instead. 代替。 Then, the result should be
然后,结果应该是
Location: ['1300 DUNN AVE']
Please also read alecxe's answer which has more practical advice than mine. 还请阅读alecxe的答案,该答案比我的有更多实用建议。
First of all, it is an a
element, not a span
. 首先,它是
a
元素,而不是一个span
。 And you need a double slash before the text()
: 并且在
text()
之前需要双斜杠:
//a[@id="lstCallsForService_ctrl0_lnkAddress"]//text()
Why a double slash? 为什么要双斜杠? This is because in reality this
a
element has no direct text node children: 这是因为在现实中,这
a
元素没有直接的文本子节点:
<a id="lstCallsForService_ctrl0_lnkAddress" href="https://maps.google.com/?q=5100 CLEVELAND RD, Jacksonville, FL" target="_blank">
<u>5100 CLEVELAND RD</u>
</a>
You could also reach the text through u
tag: 您也可以通过
u
标签到达文本:
//a[@id="lstCallsForService_ctrl0_lnkAddress"]/u/text()
Besides, to scale the solution into multiple results: 此外,将解决方案扩展为多个结果:
id
attribute match using contains()
contains()
使用部分id
属性匹配查找单元格值 text_content()
method to get the text text_content()
方法获取文本 Implementation: 实现方式:
for item in tree.xpath('//tr[@class="closedCall"]'):
callSignal = item.xpath('.//span[contains(@id, "lblIncidentNbr")]')[0].text_content()
dispatchTime = item.xpath('.//span[contains(@id, "lblDispatchTime")]')[0].text_content()
location = item.xpath('.//a[contains(@id, "lnkAddress")]')[0].text_content()
print 'Call Signal: ', callSignal
print "Dispatch Time: ", dispatchTime
print "Location: ", location
print "------"
Prints: 印刷品:
Call Signal: 150318182333
Dispatch Time: 3-18 11:22
Location: 9600 APPLECROSS RD
------
Call Signal: 150318182263
Dispatch Time: 3-18 11:12
Location: 1100 E 1ST ST
------
...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.