Im trying to retrieve the charset from a webpage(this will change all the time). At the moment Im using beautifulSoup to parse the page and then extract the charset from the header. This was working fine until I ran into a site that had.....
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
My code up until now and which was working with other pages is:
def get_encoding(soup):
encod = soup.meta.get('charset')
if encod == None:
encod = soup.meta.get('content-type')
if encod == None:
encod = soup.meta.get('content')
return encod
Would anyone have a good idea about how to add to this code to retrieve the charset from the above example. Would tokenizing it and trying to retrieve the charset that way be an idea? and how would you go about it without having to change the whole function? Right now the above code is returning "text/html; charset=utf-8" which is causing a LookupError because this is an unknown encoding.
Thanks
The final code that I ended up using:
def get_encoding(soup):
encod = soup.meta.get('charset')
if encod == None:
encod = soup.meta.get('content-type')
if encod == None:
content = soup.meta.get('content')
match = re.search('charset=(.*)', content)
if match:
encod = match.group(1)
else:
dic_of_possible_encodings = chardet.detect(unicode(soup))
encod = dic_of_possible_encodings['encoding']
return encod
import re
def get_encoding(soup):
if soup and soup.meta:
encod = soup.meta.get('charset')
if encod == None:
encod = soup.meta.get('content-type')
if encod == None:
content = soup.meta.get('content')
match = re.search('charset=(.*)', content)
if match:
encod = match.group(1)
else:
raise ValueError('unable to find encoding')
else:
raise ValueError('unable to find encoding')
return encod
In my case soup.meta
only returns the first meta
-tag found in the soup. Here is @Fruit's answer extended to find the charset
in any meta
-tag within the given html
.
from bs4 import BeautifulSoup
import re
def get_encoding(soup):
encoding = None
if soup:
for meta_tag in soup.find_all("meta"):
encoding = meta_tag.get('charset')
if encoding: break
else:
encoding = meta_tag.get('content-type')
if encoding: break
else:
content = meta_tag.get('content')
if content:
match = re.search('charset=(.*)', content)
if match:
encoding = match.group(1)
break
if encoding:
# cast to str if type(encoding) == bs4.element.ContentMetaAttributeValue
return str(encoding).lower()
soup = BeautifulSoup(html)
print(get_encoding_from_meta(soup))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.