使用lxml和Xpath刮取元素

Question

The issue I'm having is scraping out the element itself. 我遇到的问题是刮除元素本身。 I'm able to scrape the first two (IncidentNbr and DispatchTime ) but I can't get the address... (1300 Dunn Ave) I want to be able to scrape that element but also have it dynamic enough so I'm not actually parsing for "1300 Dunn Ave" I'm parsing for that element. 我可以抓取前两个（IncidentNbr和DispatchTime），但我无法获取地址...（1300 Dunn Ave），我希望能够抓取该元素，但又要使其具有足够的动态性，所以我不能实际解析为“ 1300 Dunn Ave”，我正在解析该元素。 Here is the source code 这是源代码

<td><span id="lstCallsForService_ctrl0_lblIncidentNbr">150318182198</span></td>
<td><nobr><span id="lstCallsForService_ctrl0_lblDispatchTime">3-18 10:25</span></nobr></td>
<td>
    <a id="lstCallsForService_ctrl0_lnkAddress" href="https://maps.google.com/?q=1300 DUNN AVE, Jacksonville, FL" target="_blank" style="text-decoration:underline;">1300 DUNN AVE</a>
</td>

And here is my code: 这是我的代码：

from lxml import html
import requests

page = requests.get('http://callsforservice.jaxsheriff.org/')
tree = html.fromstring(page.text)

callSignal = tree.xpath('//span[@id="lstCallsForService_ctrl0_lblIncidentNbr"]/text()')
dispatchTime = tree.xpath('//span[@id="lstCallsForService_ctrl0_lblDispatchTime"]/text()')
location = tree.xpath('//span[@id="lstCallsForService_ctrl0_lnkAddress"]/text()')



print 'Call Signal: ', callSignal
print "Dispatch Time: ", dispatchTime
print "Location: ", location

And this is my output: 这是我的输出：

Call Signal:  ['150318182198']
Dispatch Time:  ['3-18 10:25']
Location:  []

Any idea on how I can scrape out the address? 关于如何抓取地址的任何想法吗？

Answer 1

This is the element you are looking for: 这是您要查找的元素：

<a id="lstCallsForService_ctrl0_lnkAddress"
   href="https://maps.google.com/?q=1300 DUNN AVE, Jacksonville, FL"
   target="_blank" style="text-decoration:underline;">1300 DUNN AVE</a>

As you can see, it is not a span element. 如您所见，它不是span元素。 Your current XPath expression: 您当前的XPath表达式：

//span[@id="lstCallsForService_ctrl0_lnkAddress"]/text()

is looking for a span element with this ID, when it should actually be selecting an a element. 正在实际选择a元素时，正在寻找具有此ID的span元素。 Use 采用

//a[@id="lstCallsForService_ctrl0_lnkAddress"]/text()

instead. 代替。 Then, the result should be 然后，结果应该是

Location:  ['1300 DUNN AVE']

Please also read alecxe's answer which has more practical advice than mine. 还请阅读alecxe的答案，该答案比我的有更多实用建议。

Answer 2

First of all, it is an a element, not a span . 首先，它是a元素，而不是一个span 。 And you need a double slash before the text() : 并且在text()之前需要双斜杠：

//a[@id="lstCallsForService_ctrl0_lnkAddress"]//text()

Why a double slash? 为什么要双斜杠？ This is because in reality this a element has no direct text node children: 这是因为在现实中，这a元素没有直接的文本子节点：

<a id="lstCallsForService_ctrl0_lnkAddress" href="https://maps.google.com/?q=5100 CLEVELAND RD, Jacksonville, FL" target="_blank">
    <u>5100 CLEVELAND RD</u>
</a>

You could also reach the text through u tag: 您也可以通过u标签到达文本：

//a[@id="lstCallsForService_ctrl0_lnkAddress"]/u/text()

Besides, to scale the solution into multiple results: 此外，将解决方案扩展为多个结果：

iterate over table rows 遍历表行
for every row find the cell values using a partial id attribute match using contains() 对于每一行，使用contains()使用部分id属性匹配查找单元格值
use text_content() method to get the text 使用text_content()方法获取文本

Implementation: 实现方式：

for item in tree.xpath('//tr[@class="closedCall"]'):
    callSignal = item.xpath('.//span[contains(@id, "lblIncidentNbr")]')[0].text_content()
    dispatchTime = item.xpath('.//span[contains(@id, "lblDispatchTime")]')[0].text_content()
    location = item.xpath('.//a[contains(@id, "lnkAddress")]')[0].text_content()

    print 'Call Signal: ', callSignal
    print "Dispatch Time: ", dispatchTime
    print "Location: ", location
    print "------"

Prints: 印刷品：

Call Signal:  150318182333
Dispatch Time:  3-18 11:22
Location:  9600 APPLECROSS RD
------
Call Signal:  150318182263
Dispatch Time:  3-18 11:12
Location:  1100 E 1ST ST
------
...

使用lxml和Xpath刮取元素

问题描述

2 个解决方案

解决方案1
2 2015-03-18 15:29:09

解决方案2
2 已采纳 2015-03-18 15:30:47

使用lxml和Xpath刮取元素

问题描述

2 个解决方案

解决方案1 2 2015-03-18 15:29:09

解决方案2 2 已采纳 2015-03-18 15:30:47

解决方案1
2 2015-03-18 15:29:09

解决方案2
2 已采纳 2015-03-18 15:30:47