lxml.html找不到正文標簽

Question

我使用lxml.html解析各種html頁面。 現在，我認識到，至少在某些頁面上，盡管存在body標記，但沒有找到body標記，而湯很漂亮地找到了它（即使它使用lxml作為解析器）。

范例網頁： https ： //plus.google.com/ （剩下的內容）

import lxml.html
import bs4

html_string = """
    ... source code of https://plus.google.com/ (manually copied) ...
"""

# lxml fails (body is None)
body = lxml.html.fromstring(html_string).find('body')

# Beautiful soup using lxml parser succeeds
body = bs4.BeautifulSoup(html_string, 'lxml').find('body')

任何關於這里發生的事情的猜測都歡迎:)

更新：

問題似乎與編碼有關。

# working version
body = lxml.html.document_fromstring(html_string.encode('unicode-escape')).find('body')

Answer 1

您可以使用如下形式：

import requests
import lxml.html

html_string = requests.get("https://plus.google.com/").content
body = lxml.html.document_fromstring(html_string).find('body')

body變量包含body html元素

lxml.html找不到正文標簽

問題描述

1 個解決方案

解決方案1
1 已采納 2019-05-24 14:42:56

lxml.html找不到正文標簽

問題描述

1 個解決方案

解決方案1 1 已采納 2019-05-24 14:42:56

解決方案1
1 已采納 2019-05-24 14:42:56