简体   繁体   English

Python3刮板。 直到最后才解析xpath

[英]Python3 scraper. Doesn't parse the xpath till the end

I'm using lxml.html module 我正在使用lxml.html模块

from lxml import html   

page = html.parse('http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution')

# print(page.content)

unis = page.xpath('//tr/td[@valign="top" and @style="width: 50%;padding-right:15px"]/h3/text()')

print(unis.__len__())

with open('workfile.txt', 'w') as f:
    for uni in unis:
        f.write(uni + '\n')

The website right here ( http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution#Z ) is full of universities. 此处的网站( http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution#Z )遍布大学。

The problem is that it parses till the letter 'H' (244 unis). 问题在于它将解析到字母“ H”(244个unis)。 I can't understand why, as I see it parses all the HTML till the end. 我不明白为什么,因为我看到它会解析所有HTML直到最后。

I also documented my self that 244 is not a limit of a list or anything in python3. 我还记录了自己的经历,即244不是列表的限制或python3中的任何内容。

That HTML page simply isn't HTML, it's totally broken. 该HTML页面根本不是HTML,它已完全损坏。 But the following will do what you want. 但是以下将满足您的要求。 It uses the BeautifulSoup parser. 它使用BeautifulSoup解析器。

from lxml.html.soupparser import parse
import urllib

url = 'http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution'
page = parse(urllib.request.urlopen(url))
unis = page.xpath('//tr/td[@valign="top" and @style="width: 50%;padding-right:15px"]/h3/text()')

See http://lxml.de/lxmlhtml.html#really-broken-pages for more info. 有关更多信息,请参见http://lxml.de/lxmlhtml.html#really-broken-pages

For web-scraping i recommend you to use BeautifulSoup 4 With bs4 this is easily done: 对于网络抓取,我建议您将BeautifulSoup 4与bs4一起使用很容易做到:

from bs4 import BeautifulSoup
import urllib.request

universities = []
result = urllib.request.urlopen('http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution#Z')

soup = BeautifulSoup(result.read(),'html.parser')

table = soup.find_all(lambda tag: tag.name=='table')
for t in table:
    rows = t.find_all(lambda tag: tag.name=='tr')
    for r in rows:
        # there are also the A-Z headers -> check length
        # there are also empty headers -> check isspace()
        headers = r.find_all(lambda tag: tag.name=='h3' and tag.text.isspace()==False and len(tag.text.strip()) > 2)
        for h in headers:
            universities.append(h.text)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python3 beautifulsoup 不解析任何东西 - Python3 beautifulsoup doesn't parse anything 为什么我的 web 刮刀不工作? Python3 - 请求,BeautifulSoup - why doesn't my web scraper work? Python3 - requests, BeautifulSoup Google Search Crawler 和 Newspaper3k 库已组合在一个循环中以创建自动抓取工具。 但是代码不起作用..解决方案? - Google Search Crawler and Newspaper3k libraries have been combined inside a loop to create automated scraper. But code doesn't work.. Solution? Python3-parse_qs不会按预期分隔参数 - Python3 - parse_qs doesn't separate arguments as expected python3用xpath解析html部分 - python3 to parse html part with xpath PYTHON 基本文本浏览器/抓取工具。 如何删除空行但在段落之间至少保留一个 - PYTHON Basic Text Browser/Scraper. How to remove blank lines but keep at least one between paragraphs 为什么列表不会迭代到最后并在 python 中弹出所有项目 - Why the list doesn't iterate till end and pops all the items out in python 使用 Python (BeautifulSoup 4) 的网页抓取工具不起作用 - Web scraper with Python (BeautifulSoup 4) doesn't work request.get在python scraper中不起作用 - request.get doesn't work in python scraper Python 美汤 web 刮刀不返回标签内容 - Python beautiful soup web scraper doesn't return tag contents
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM