简体   繁体   English

如何使用 Python 和 lxml 解析本地 html 文件?

[英]How do I use Python and lxml to parse a local html file?

I am working with a local html file in python, and I am trying to use lxml to parse the file.我正在使用 python 中的本地 html 文件,并且我正在尝试使用 lxml 来解析该文件。 For some reason I can't get the file to load properly, and I'm not sure if this has to do with not having an http server set up on my local machine, etree usage, or something else.出于某种原因,我无法正确加载文件,我不确定这是否与我的本地机器上没有设置 http 服务器、etree 使用或其他原因有关。

My reference for this code was this: http://docs.python-guide.org/en/latest/scenarios/scrape/我对此代码的参考是: http : //docs.python-guide.org/en/latest/scenarios/scrape/

This could be a related problem: Requests : No connection adapters were found for, error in Python3这可能是一个相关的问题: Requests : No connection adapters were found for, error in Python3

Here is my code:这是我的代码:

from lxml import html
import requests

page = requests.get('C:\Users\...\sites\site_1.html')
tree = html.fromstring(page.text)

test = tree.xpath('//html/body/form/div[3]/div[3]/div[2]/div[2]/div/div[2]/div[2]/p[1]/strong/text()')

print test

The traceback that I'm getting reads:我得到的回溯如下:

C:\Python27\python.exe "C:/Users/.../extract_html/extract.py"
Traceback (most recent call last):
  File "C:/Users/.../extract_html/extract.py", line 4, in <module>
    page = requests.get('C:\Users\...\sites\site_1.html')
  File "C:\Python27\lib\site-packages\requests\api.py", line 69, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Python27\lib\site-packages\requests\api.py", line 50, in request
    response = session.request(method=method, url=url, **kwargs)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 465, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 567, in send
    adapter = self.get_adapter(url=request.url)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 641, in get_adapter
    raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for 'C:\Users\...\sites\site_1.html'

Process finished with exit code 1

You can see that it has something to do with a "connection adapter" but I'm not sure what that means.您可以看到它与“连接适配器”有关,但我不确定这意味着什么。

If the file is local, you shouldn't be using requests -- just open the file and read it in. requests expects to be talking to a web server.如果文件是本地文件,则不应使用requests ——只需打开文件并读入即可。 requests需要与 Web 服务器通信。

with open(r'C:\Users\...site_1.html', "r") as f:
    page = f.read()
tree = html.fromstring(page)

There is a better way for doing it: using parse function instead of fromstring有一个更好的方法:使用parse函数而不是fromstring

tree = html.parse("C:\Users\...site_1.html")
print(html.tostring(tree))

You can also try using Beautiful Soup您也可以尝试使用美汤

from bs4 import BeautifulSoup
f = open("filepath", encoding="utf8")     
soup = BeautifulSoup(f)
f.close()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM