使用lxml和请求进行HTML爬取

Question

I was doing like this tutorial, http://docs.python-guide.org/en/latest/scenarios/scrape/ , to scrap an html table and it doesn't work well. 我的操作就像本教程http://docs.python-guide.org/en/latest/scenarios/scrape/一样，以抓取html表，但效果不佳。

My code: 我的代码：

import requests
from lxml import html

page = requests.get('http://www.dti.ufv.br/horario/horario.asp?ano=2015&semestre=1&depto=MAT')
tree = html.fromstring(page.text)

vaga = tree.xpath('/html/body/center/table/tbody/tr[2]/td/table[2]/tbody/tr[108]/td[9]')
print vaga

I think the problem is with XPath... I did it like the tutorial said using Google Chrome but it's not like in tutorial. 我认为问题出在XPath上……我是按照使用Google Chrome浏览器的教程进行操作的，但这与教程中的操作不同。 Anyone can help me get the right XPath? 任何人都可以帮助我获得正确的XPath吗？ Thanks guys! 多谢你们！

Answer 1

In HTML content, there is no tbody tag. 在HTML内容中，没有tbody标签。

In code we are considering tbody tag to find target tag. 在代码中，我们考虑使用tbody标签来查找目标标签。

vaga = tree.xpath('/html/body/center/table/tbody/tr[2]/td/table[2]/tbody/tr[108]/td[9]')

This will always return empty list because tbody tag is not present in HTML content. 由于tbody标签不存在于HTML内容中，因此它将始终返回空列表。

HTml content: HTml含量：

 <table width="760" border="0" cellspacing="0" cellpadding="0">
    <tr>
      <td><img src="img/topo.jpg" width="760" height="101"></td>
    </tr>
    <tr>
      <td background="img/conteudo.jpg"><p align="right"><img src="img/setas_voltar.jpg" width="8" height="7"> <font size="1"><strong><a href="javascript:history.back();">voltar</a>&nbsp;</strong></font></p>
        <TABLE WIDTH=100% BORDER=0 CELLSPACING=1 CELLPADDING=1>
        <TR>
          <TD align=center> <br>
              <font color="Black" size=2><b> Hor&aacute;rio de Aulas 2015/1</b></font><br>          </TD>
        </TR>
      </TABLE>

使用lxml和请求进行HTML爬取

问题描述

1 个解决方案

解决方案1
1 2015-02-24 15:44:46

使用lxml和请求进行HTML爬取

问题描述

1 个解决方案

解决方案1 1 2015-02-24 15:44:46

解决方案1
1 2015-02-24 15:44:46