[英]HTML Scraping with lxml and Requests
I was doing like this tutorial, http://docs.python-guide.org/en/latest/scenarios/scrape/ , to scrap an html table and it doesn't work well. 我的操作就像本教程http://docs.python-guide.org/en/latest/scenarios/scrape/一样 ,以抓取html表,但效果不佳。
My code: 我的代码:
import requests
from lxml import html
page = requests.get('http://www.dti.ufv.br/horario/horario.asp?ano=2015&semestre=1&depto=MAT')
tree = html.fromstring(page.text)
vaga = tree.xpath('/html/body/center/table/tbody/tr[2]/td/table[2]/tbody/tr[108]/td[9]')
print vaga
I think the problem is with XPath... I did it like the tutorial said using Google Chrome but it's not like in tutorial. 我认为问题出在XPath上……我是按照使用Google Chrome浏览器的教程进行操作的,但这与教程中的操作不同。 Anyone can help me get the right XPath? 任何人都可以帮助我获得正确的XPath吗? Thanks guys! 多谢你们!
In HTML content, there is no tbody
tag. 在HTML内容中,没有tbody
标签。
In code we are considering tbody
tag to find target tag. 在代码中,我们考虑使用tbody
标签来查找目标标签。
vaga = tree.xpath('/html/body/center/table/tbody/tr[2]/td/table[2]/tbody/tr[108]/td[9]')
This will always return empty list because tbody
tag is not present in HTML content. 由于tbody
标签不存在于HTML内容中,因此它将始终返回空列表。
HTml content: HTml含量:
<table width="760" border="0" cellspacing="0" cellpadding="0">
<tr>
<td><img src="img/topo.jpg" width="760" height="101"></td>
</tr>
<tr>
<td background="img/conteudo.jpg"><p align="right"><img src="img/setas_voltar.jpg" width="8" height="7"> <font size="1"><strong><a href="javascript:history.back();">voltar</a> </strong></font></p>
<TABLE WIDTH=100% BORDER=0 CELLSPACING=1 CELLPADDING=1>
<TR>
<TD align=center> <br>
<font color="Black" size=2><b> Horário de Aulas 2015/1</b></font><br> </TD>
</TR>
</TABLE>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.