[英]Python3 scraper. Doesn't parse the xpath till the end
I'm using lxml.html module 我正在使用lxml.html模块
from lxml import html
page = html.parse('http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution')
# print(page.content)
unis = page.xpath('//tr/td[@valign="top" and @style="width: 50%;padding-right:15px"]/h3/text()')
print(unis.__len__())
with open('workfile.txt', 'w') as f:
for uni in unis:
f.write(uni + '\n')
The website right here ( http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution#Z ) is full of universities. 此处的网站( http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution#Z )遍布大学。
The problem is that it parses till the letter 'H' (244 unis). 问题在于它将解析到字母“ H”(244个unis)。 I can't understand why, as I see it parses all the HTML till the end.
我不明白为什么,因为我看到它会解析所有HTML直到最后。
I also documented my self that 244 is not a limit of a list or anything in python3. 我还记录了自己的经历,即244不是列表的限制或python3中的任何内容。
That HTML page simply isn't HTML, it's totally broken. 该HTML页面根本不是HTML,它已完全损坏。 But the following will do what you want.
但是以下将满足您的要求。 It uses the BeautifulSoup parser.
它使用BeautifulSoup解析器。
from lxml.html.soupparser import parse
import urllib
url = 'http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution'
page = parse(urllib.request.urlopen(url))
unis = page.xpath('//tr/td[@valign="top" and @style="width: 50%;padding-right:15px"]/h3/text()')
See http://lxml.de/lxmlhtml.html#really-broken-pages for more info. 有关更多信息,请参见http://lxml.de/lxmlhtml.html#really-broken-pages 。
For web-scraping i recommend you to use BeautifulSoup 4 With bs4 this is easily done: 对于网络抓取,我建议您将BeautifulSoup 4与bs4一起使用很容易做到:
from bs4 import BeautifulSoup
import urllib.request
universities = []
result = urllib.request.urlopen('http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution#Z')
soup = BeautifulSoup(result.read(),'html.parser')
table = soup.find_all(lambda tag: tag.name=='table')
for t in table:
rows = t.find_all(lambda tag: tag.name=='tr')
for r in rows:
# there are also the A-Z headers -> check length
# there are also empty headers -> check isspace()
headers = r.find_all(lambda tag: tag.name=='h3' and tag.text.isspace()==False and len(tag.text.strip()) > 2)
for h in headers:
universities.append(h.text)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.