简体   繁体   English

使用lxml和请求进行HTML爬取

[英]HTML Scraping with lxml and Requests

I was doing like this tutorial, http://docs.python-guide.org/en/latest/scenarios/scrape/ , to scrap an html table and it doesn't work well. 我的操作就像本教程http://docs.python-guide.org/en/latest/scenarios/scrape/一样 ,以抓取html表,但效果不佳。

My code: 我的代码:

import requests
from lxml import html

page = requests.get('http://www.dti.ufv.br/horario/horario.asp?ano=2015&semestre=1&depto=MAT')
tree = html.fromstring(page.text)

vaga = tree.xpath('/html/body/center/table/tbody/tr[2]/td/table[2]/tbody/tr[108]/td[9]')
print vaga

I think the problem is with XPath... I did it like the tutorial said using Google Chrome but it's not like in tutorial. 我认为问题出在XPath上……我是按照使用Google Chrome浏览器的教程进行操作的,但这与教程中的操作不同。 Anyone can help me get the right XPath? 任何人都可以帮助我获得正确的XPath吗? Thanks guys! 多谢你们!

In HTML content, there is no tbody tag. 在HTML内容中,没有tbody标签。

In code we are considering tbody tag to find target tag. 在代码中,我们考虑使用tbody标签来查找目标标签。

vaga = tree.xpath('/html/body/center/table/tbody/tr[2]/td/table[2]/tbody/tr[108]/td[9]')

This will always return empty list because tbody tag is not present in HTML content. 由于tbody标签不存在于HTML内容中,因此它将始终返回空列表。

HTml content: HTml含量:

 <table width="760" border="0" cellspacing="0" cellpadding="0">
    <tr>
      <td><img src="img/topo.jpg" width="760" height="101"></td>
    </tr>
    <tr>
      <td background="img/conteudo.jpg"><p align="right"><img src="img/setas_voltar.jpg" width="8" height="7"> <font size="1"><strong><a href="javascript:history.back();">voltar</a>&nbsp;</strong></font></p>
        <TABLE WIDTH=100% BORDER=0 CELLSPACING=1 CELLPADDING=1>
        <TR>
          <TD align=center> <br>
              <font color="Black" size=2><b> Hor&aacute;rio de Aulas 2015/1</b></font><br>          </TD>
        </TR>
      </TABLE>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM