Python3刮板。直到最后才解析xpath

Question

I'm using lxml.html module 我正在使用lxml.html模块

from lxml import html   

page = html.parse('http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution')

# print(page.content)

unis = page.xpath('//tr/td[@valign="top" and @style="width: 50%;padding-right:15px"]/h3/text()')

print(unis.__len__())

with open('workfile.txt', 'w') as f:
    for uni in unis:
        f.write(uni + '\n')

The website right here ( http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution#Z ) is full of universities. 此处的网站（ http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution#Z ）遍布大学。

The problem is that it parses till the letter 'H' (244 unis). 问题在于它将解析到字母“ H”（244个unis）。 I can't understand why, as I see it parses all the HTML till the end. 我不明白为什么，因为我看到它会解析所有HTML直到最后。

I also documented my self that 244 is not a limit of a list or anything in python3. 我还记录了自己的经历，即244不是列表的限制或python3中的任何内容。

Answer 1

That HTML page simply isn't HTML, it's totally broken. 该HTML页面根本不是HTML，它已完全损坏。 But the following will do what you want. 但是以下将满足您的要求。 It uses the BeautifulSoup parser. 它使用BeautifulSoup解析器。

from lxml.html.soupparser import parse
import urllib

url = 'http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution'
page = parse(urllib.request.urlopen(url))
unis = page.xpath('//tr/td[@valign="top" and @style="width: 50%;padding-right:15px"]/h3/text()')

See http://lxml.de/lxmlhtml.html#really-broken-pages for more info. 有关更多信息，请参见http://lxml.de/lxmlhtml.html#really-broken-pages 。

Answer 2

For web-scraping i recommend you to use BeautifulSoup 4 With bs4 this is easily done: 对于网络抓取，我建议您将BeautifulSoup 4与bs4一起使用很容易做到：

from bs4 import BeautifulSoup
import urllib.request

universities = []
result = urllib.request.urlopen('http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution#Z')

soup = BeautifulSoup(result.read(),'html.parser')

table = soup.find_all(lambda tag: tag.name=='table')
for t in table:
    rows = t.find_all(lambda tag: tag.name=='tr')
    for r in rows:
        # there are also the A-Z headers -> check length
        # there are also empty headers -> check isspace()
        headers = r.find_all(lambda tag: tag.name=='h3' and tag.text.isspace()==False and len(tag.text.strip()) > 2)
        for h in headers:
            universities.append(h.text)

Python3刮板。直到最后才解析xpath

问题描述

2 个解决方案

解决方案1
1 已采纳 2016-04-28 21:00:56

解决方案2
1 2016-04-28 21:03:48

Python3刮板。 直到最后才解析xpath

问题描述

2 个解决方案

解决方案1 1 已采纳 2016-04-28 21:00:56

解决方案2 1 2016-04-28 21:03:48

Python3刮板。直到最后才解析xpath

解决方案1
1 已采纳 2016-04-28 21:00:56

解决方案2
1 2016-04-28 21:03:48