Im new to python.
Im trying to parse data from a website using BeautifulSoup, I have successful used BeautifulSoup before. However for this particular website the data returned has spaces between every character and lots of ">" characters as well.
The weird thing is if copy the page source and add it to my local apache instance and make a request to my local copy, then the output is perfect. I should mention that the difference between my local and the website:
import requests
from requests_ntlm import HttpNtlmAuth
from bs4 import BeautifulSoup
r = requests.get("http://WEBSITE/CONTEXT/",auth=HttpNtlmAuth('DOMAIN\USER','PASS'))
content = r.text
soup = BeautifulSoup(content, 'lxml')
print(soup)
It looks like local server returns content encoded using UTF-8 and the main website use UTF-16. It's suggests the main website in not configured correctly. However it's possible to get around this issue with code.
Python defaults the requests to the encoding to UTF-8. (I believe) this is based on the response headers. The request has a method called apparent_encoding , which reads the stream and detects the correct encoding using chardet . However apparent_encoding does not get consumed, unless specified.
Therefore by setting r.encoding = r.apparent_encoding , the request should download the text correctly across both environments.
Code should look something like:
r = requests.get("http://WEBSITE/CONTEXT/",auth=HttpNtlmAuth('DOMAIN\USER','PASS'))
r.encoding = r.apparent_encoding # Override the default encoding
content = r.text
r.raise_for_status() # Always check for server errors before consuming the data.
soup = BeautifulSoup(content, 'lxml')
print(soup.prettify()) # Should match print(content) (minus indentation)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.