我如何在 python 中將 htmls 文件打開為 utf-8？

Question

我正在嘗試在 python 中以 utf-8 的形式打開文件。 我在 htmls 路徑中有列表，我創建列表的代碼工作：

def get_all_htmls(directory_path):
    return glob.iglob(os.path.join(directory_path,'*.html'))

directory_path=r'C:\Users\astar\Project\Articles\Articles'
links = []
for html_path in get_all_htmls(directory_path):
    links.append(html_path)

但是，現在在這段代碼中：

for link in links:
    f=codecs.open(r'link','r','utf-8')
    document= BeautifulSoup(f)

不適用於所有 html，我能做什么？

Answer 1

如果它適用於您的某些文件，但不是所有文件，這意味着其中一些文件在 utf-8 中正確編碼，而其他文件可能以其他編碼編碼（例如“ISO-8859-8”，用於希伯來語）。你不會說出了什么問題，這使得很難在代碼中給你一個准確的答案，但是如果你在那個調用中得到一個UnicodeDecodeError異常，你可以創建一個循環來嘗試所有合適的編碼，直到一個成功：

for link in links:
    for encoding in ("utf-8", "iso-8859-8", "latin-1"):
        try:
            f=codecs.open(link,'r','utf-8')
            document= BeautifulSoup(f)
        except UnicodeDecodeError:
            print(f"{encoding} failed for {link}, trying next encoding")
        else:
            print(f"Successfully read {link} as an {encoding} file") 
            break
    else: # for-level else, entered if no "break" statement was executed, 
          #and therefore, if no codec worked (although latin-1, in special, will always succeed)
         print(f"could not correctly read {link} with any of the avaliable encodings. skipping file")
         continue

我如何在 python 中將 htmls 文件打開為 utf-8？

問題描述

1 個解決方案

解決方案1
1 2022-01-10 14:56:49

我如何在 python 中將 htmls 文件打開為 utf-8？

問題描述

1 個解決方案

解決方案1 1 2022-01-10 14:56:49

解決方案1
1 2022-01-10 14:56:49