简体   繁体   中英

Python BeautifulSoup output unusual spacing and characters

Im new to python.

Im trying to parse data from a website using BeautifulSoup, I have successful used BeautifulSoup before. However for this particular website the data returned has spaces between every character and lots of "&gt" characters as well.

The weird thing is if copy the page source and add it to my local apache instance and make a request to my local copy, then the output is perfect. I should mention that the difference between my local and the website:

  1. my local does not use https
  2. my local does not require authentication however the website does require Active Directory auth and I using requests_ntlm

import requests
from requests_ntlm import HttpNtlmAuth
from bs4 import BeautifulSoup

r = requests.get("http://WEBSITE/CONTEXT/",auth=HttpNtlmAuth('DOMAIN\USER','PASS'))
content = r.text
soup = BeautifulSoup(content, 'lxml')
print(soup)

It looks like local server returns content encoded using UTF-8 and the main website use UTF-16. It's suggests the main website in not configured correctly. However it's possible to get around this issue with code.

Python defaults the requests to the encoding to UTF-8. (I believe) this is based on the response headers. The request has a method called apparent_encoding , which reads the stream and detects the correct encoding using chardet . However apparent_encoding does not get consumed, unless specified.

Therefore by setting r.encoding = r.apparent_encoding , the request should download the text correctly across both environments.

Code should look something like:

r = requests.get("http://WEBSITE/CONTEXT/",auth=HttpNtlmAuth('DOMAIN\USER','PASS'))
r.encoding = r.apparent_encoding # Override the default encoding
content = r.text
r.raise_for_status() # Always check for server errors before consuming the data.
soup = BeautifulSoup(content, 'lxml') 
print(soup.prettify()) # Should match print(content) (minus indentation)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM