Python BeautifulSoup output 异常间距和字符

Question

Im new to python.我是 python 的新手。

Im trying to parse data from a website using BeautifulSoup, I have successful used BeautifulSoup before.我尝试使用 BeautifulSoup 解析来自网站的数据，我之前已经成功使用过 BeautifulSoup。 However for this particular website the data returned has spaces between every character and lots of "&gt" characters as well.然而，对于这个特定的网站，返回的数据在每个字符之间都有空格，并且还有很多“&gt”字符。

The weird thing is if copy the page source and add it to my local apache instance and make a request to my local copy, then the output is perfect.奇怪的是，如果复制页面源并将其添加到我的本地 apache 实例并向我的本地副本发出请求，那么 output 是完美的。 I should mention that the difference between my local and the website:我应该提到我的本地和网站之间的区别：

my local does not use https我本地不使用 https
my local does not require authentication however the website does require Active Directory auth and I using requests_ntlm我的本地不需要身份验证，但是该网站确实需要 Active Directory 身份验证，而我使用 requests_ntlm


import requests
from requests_ntlm import HttpNtlmAuth
from bs4 import BeautifulSoup

r = requests.get("http://WEBSITE/CONTEXT/",auth=HttpNtlmAuth('DOMAIN\USER','PASS'))
content = r.text
soup = BeautifulSoup(content, 'lxml')
print(soup)

Answer 1

It looks like local server returns content encoded using UTF-8 and the main website use UTF-16.看起来本地服务器返回使用 UTF-8 编码的内容，而主网站使用 UTF-16。 It's suggests the main website in not configured correctly.这表明主网站配置不正确。 However it's possible to get around this issue with code.但是，可以使用代码解决此问题。

Python defaults the requests to the encoding to UTF-8. Python 默认请求编码为 UTF-8。 (I believe) this is based on the response headers. （我相信）这是基于响应头的。 The request has a method called apparent_encoding , which reads the stream and detects the correct encoding using chardet .该请求有一个名为visible_encoding的方法，它读取 stream 并使用chardet检测正确的编码。 However apparent_encoding does not get consumed, unless specified.但是，除非指定，否则不会消耗明显的编码。

Therefore by setting r.encoding = r.apparent_encoding , the request should download the text correctly across both environments.因此，通过设置r.encoding = r.apparent_encoding ，请求应该在两个环境中正确下载文本。

Code should look something like:代码应类似于：

r = requests.get("http://WEBSITE/CONTEXT/",auth=HttpNtlmAuth('DOMAIN\USER','PASS'))
r.encoding = r.apparent_encoding # Override the default encoding
content = r.text
r.raise_for_status() # Always check for server errors before consuming the data.
soup = BeautifulSoup(content, 'lxml') 
print(soup.prettify()) # Should match print(content) (minus indentation)

Python BeautifulSoup output 异常间距和字符

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-06-10 11:27:36

Python BeautifulSoup output 异常间距和字符

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-06-10 11:27:36

解决方案1
0 已采纳 2020-06-10 11:27:36