[英]how i can open htmls file as utf-8 in python?
我正在嘗試在 python 中以 utf-8 的形式打開文件。 我在 htmls 路徑中有列表,我創建列表的代碼工作:
def get_all_htmls(directory_path):
return glob.iglob(os.path.join(directory_path,'*.html'))
directory_path=r'C:\Users\astar\Project\Articles\Articles'
links = []
for html_path in get_all_htmls(directory_path):
links.append(html_path)
但是,現在在這段代碼中:
for link in links:
f=codecs.open(r'link','r','utf-8')
document= BeautifulSoup(f)
不適用於所有 html,我能做什么?
如果它適用於您的某些文件,但不是所有文件,這意味着其中一些文件在 utf-8 中正確編碼,而其他文件可能以其他編碼編碼(例如“ISO-8859-8”,用於希伯來語)。 你不會說出了什么問題,這使得很難在代碼中給你一個准確的答案,但是如果你在那個調用中得到一個UnicodeDecodeError
異常,你可以創建一個循環來嘗試所有合適的編碼,直到一個成功:
for link in links:
for encoding in ("utf-8", "iso-8859-8", "latin-1"):
try:
f=codecs.open(link,'r','utf-8')
document= BeautifulSoup(f)
except UnicodeDecodeError:
print(f"{encoding} failed for {link}, trying next encoding")
else:
print(f"Successfully read {link} as an {encoding} file")
break
else: # for-level else, entered if no "break" statement was executed,
#and therefore, if no codec worked (although latin-1, in special, will always succeed)
print(f"could not correctly read {link} with any of the avaliable encodings. skipping file")
continue
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.