简体   繁体   English

LXML网页抓取,格式错误的html

[英]LXML webpage scraping, mal-formated html

I am trying to scrape the article text from this web site http://sana.sy/eng/21/2013/01/07/pr-460536.htm , but its HTML is mal-formatted. 我正在尝试从此网站http://sana.sy/eng/21/2013/01/07/pr-460536.htm刮取文章文本,但其HTML格式错误。 Can Anyone show me how to get it right. 任何人都可以告诉我如何正确处理它。

this is the code
import urllib2
from lxml import etree
import StringIO

speachesurls = ["http://sana.sy/eng/21/2013/01/07/pr-460536.htm", "http://sana.sy/eng/21/2012/06/04/pr-423234.htm", "http://sana.sy/eng/21/2012/01/12/pr-393338.htm"]


# scrape the speaches

for url in speachesurls:
    result = urllib2.urlopen(url)
    html = result.read()
    parser = etree.HTMLParser()
    tree = etree.parse(StringIO.StringIO(html), parser)
    xpath = "//html/body/table[3]/tbody/tr[3]/td[4]/table[2]/tbody/tr/td[2]/table/tbody/tr/td/table/tbody/tr[2]/td/div/table/tbody/tr[2]/td/div/p"
    a = tree.find(xpath)
    print a.text_content() 

It's not a problem with lxml or malformed html, lxml's html parser can deal with that. lxml或格式不正确的html并不是问题,lxml的html解析器可以处理。

Your code works fine, it's just that your xpath expression doesn't match anything, so a will be None . 您的代码可以正常工作,只是您的xpath表达式不匹配任何内容,因此a将为None

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM