如何使用 Python 和 lxml 解析本地 html 文件？

Question

我正在使用 python 中的本地 html 文件，並且我正在嘗試使用 lxml 來解析該文件。 出於某種原因，我無法正確加載文件，我不確定這是否與我的本地機器上沒有設置 http 服務器、etree 使用或其他原因有關。

我對此代碼的參考是： http : //docs.python-guide.org/en/latest/scenarios/scrape/

這可能是一個相關的問題： Requests : No connection adapters were found for, error in Python3

這是我的代碼：

from lxml import html
import requests

page = requests.get('C:\Users\...\sites\site_1.html')
tree = html.fromstring(page.text)

test = tree.xpath('//html/body/form/div[3]/div[3]/div[2]/div[2]/div/div[2]/div[2]/p[1]/strong/text()')

print test

我得到的回溯如下：

C:\Python27\python.exe "C:/Users/.../extract_html/extract.py"
Traceback (most recent call last):
  File "C:/Users/.../extract_html/extract.py", line 4, in <module>
    page = requests.get('C:\Users\...\sites\site_1.html')
  File "C:\Python27\lib\site-packages\requests\api.py", line 69, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Python27\lib\site-packages\requests\api.py", line 50, in request
    response = session.request(method=method, url=url, **kwargs)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 465, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 567, in send
    adapter = self.get_adapter(url=request.url)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 641, in get_adapter
    raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for 'C:\Users\...\sites\site_1.html'

Process finished with exit code 1

您可以看到它與“連接適配器”有關，但我不確定這意味着什么。

Answer 1

如果文件是本地文件，則不應使用requests ——只需打開文件並讀入即可。 requests需要與 Web 服務器通信。

with open(r'C:\Users\...site_1.html', "r") as f:
    page = f.read()
tree = html.fromstring(page)

Answer 2

有一個更好的方法：使用parse函數而不是fromstring

tree = html.parse("C:\Users\...site_1.html")
print(html.tostring(tree))

Answer 3

您也可以嘗試使用美湯

from bs4 import BeautifulSoup
f = open("filepath", encoding="utf8")     
soup = BeautifulSoup(f)
f.close()

如何使用 Python 和 lxml 解析本地 html 文件？

問題描述

3 個解決方案

解決方案1
29 已采納 2015-09-24 16:06:33

解決方案2
10 2019-01-27 13:52:04

解決方案3
3 2020-01-28 12:09:16

如何使用 Python 和 lxml 解析本地 html 文件？

問題描述

3 個解決方案

解決方案1 29 已采納 2015-09-24 16:06:33

解決方案2 10 2019-01-27 13:52:04

解決方案3 3 2020-01-28 12:09:16

解決方案1
29 已采納 2015-09-24 16:06:33

解決方案2
10 2019-01-27 13:52:04

解決方案3
3 2020-01-28 12:09:16