Python BeautifulSoup output unusual spacing and characters

Question

Im new to python.

Im trying to parse data from a website using BeautifulSoup, I have successful used BeautifulSoup before. However for this particular website the data returned has spaces between every character and lots of "&gt" characters as well.

The weird thing is if copy the page source and add it to my local apache instance and make a request to my local copy, then the output is perfect. I should mention that the difference between my local and the website:

my local does not use https
my local does not require authentication however the website does require Active Directory auth and I using requests_ntlm


import requests
from requests_ntlm import HttpNtlmAuth
from bs4 import BeautifulSoup

r = requests.get("http://WEBSITE/CONTEXT/",auth=HttpNtlmAuth('DOMAIN\USER','PASS'))
content = r.text
soup = BeautifulSoup(content, 'lxml')
print(soup)

Answer 1

It looks like local server returns content encoded using UTF-8 and the main website use UTF-16. It's suggests the main website in not configured correctly. However it's possible to get around this issue with code.

Python defaults the requests to the encoding to UTF-8. (I believe) this is based on the response headers. The request has a method called apparent_encoding , which reads the stream and detects the correct encoding using chardet . However apparent_encoding does not get consumed, unless specified.

Therefore by setting r.encoding = r.apparent_encoding , the request should download the text correctly across both environments.

Code should look something like:

r = requests.get("http://WEBSITE/CONTEXT/",auth=HttpNtlmAuth('DOMAIN\USER','PASS'))
r.encoding = r.apparent_encoding # Override the default encoding
content = r.text
r.raise_for_status() # Always check for server errors before consuming the data.
soup = BeautifulSoup(content, 'lxml') 
print(soup.prettify()) # Should match print(content) (minus indentation)

Python BeautifulSoup output unusual spacing and characters

Question

1 answers

solution1
0 ACCPTED 2020-06-10 11:27:36

Python BeautifulSoup output unusual spacing and characters

Question

1 answers

solution1 0 ACCPTED 2020-06-10 11:27:36

solution1
0 ACCPTED 2020-06-10 11:27:36