如何使用 Python 從本地保存的 HTML 文件中讀取源 HTLM 代碼？

Question

我是 HTML 和漂亮湯的新手。 我正在嘗試讀取 Python 中本地保存的 HTML 文件，並測試了以下代碼：

with open(file_path) as fp:
    soup = BeautifulSoup(fp)

print(soup)

output 看起來很奇怪，這里是其中的一部分：

<html><body><p>ÿþh t m l &gt; 
 
 
 
 h e a d &gt; 
 
 m e t a   h t t p - e q u i v = C o n t e n t - T y p e   c o n t e n t = " t e x t / h t m l ;   c h a r s e t = u n i c o d e " &gt; 
 
 m e t a   n a m e = G e n e r a t o r   c o n t e n t = " M i c r o s o f t   W o r d   1 5   ( f i l t e r e d ) " &gt; 
 
 s t y l e &gt; 
 
 ! - - 
 
   / *   F o n t   D e f i n i t i o n s   * /

原始 HTML 代碼類似於

<html>

<head>
<meta http-equiv=Content-Type content="text/html; charset=unicode">
<meta name=Generator content="Microsoft Word 15 (filtered)">
<style>
<!--
 /* Font Definitions */
 @font-face
    {font-family:"Cambria Math";
    panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
    {font-family:Calibri;

任何人都可以幫助我或分享一些想法嗎？

謝謝！

Answer 1

試試print(soup.prettify()) 。 美化方法很有幫助，並顯示格式化的 HTML 內容。

根據文檔：

prettify() 方法會將 Beautiful Soup 解析樹轉換為格式良好的 Unicode 字符串，每個標簽和每個字符串都有單獨的一行：

資料來源：美麗的湯文檔

Answer 2

首先，讓我們討論一下為什么您無法fetch所需Output 。 這是因為當您在BeautifulSoup中parsing數據時。 您的Code中可能存在一些空格、符號等。 因此，針對這種情況的適當Solution方案如下所述：-

需要的解決方案：-使用soup.prettify()
適當的解決方案：-一起使用HTML Parser和soup.prettify()

要了解有關HTML Parser和soup.prettify更多信息：-單擊此處

方法 1（通過在當前`Code`中使用`soup.prettify()` ）：-

# File Path of 'HTML' File
file_path = 'demo.html'

# Fetch 'HTML' Code Using 'BeautifulSoup'
with open(file_path) as fp:
    soup = BeautifulSoup(fp)

# Print 'HTML' Code using 'prettify' Format
print(soup.prettify())

# Output of above cell:-
<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="Microsoft Word 15 (filtered)" name="Generator"/>
  <style>
   <!--
 /* Font Definitions */
 @font-face
    {font-family:"Cambria Math";
    panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
    {font-family:Calibri;
  </style>
 </head>
</html>

方法 2（通過使用`HTML Parser`和`soup.prettify()` ）：-

# Import all-important Libraries
from bs4 import BeautifulSoup
import html5lib

# Open Our 'HTML' File
html_page = open('demo.html', 'r')

# Parse it to 'HTML' Format
soup = BeautifulSoup(html_page, "html5lib")

# Print Scraped 'HTML' Code
print(soup.prettify())

# Output of above cell:-
<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="Microsoft Word 15 (filtered)" name="Generator"/>
  <style>
   <!--
 /* Font Definitions */
 @font-face
    {font-family:"Cambria Math";
    panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
    {font-family:Calibri;
  </style>
 </head>
</html>

希望此解決方案對您有所幫助。

如何使用 Python 從本地保存的 HTML 文件中讀取源 HTLM 代碼？

問題描述

2 個解決方案

解決方案1
1 2021-04-30 14:05:55

解決方案2
1 已采納 2021-04-30 14:30:53

方法 1（通過在當前`Code`中使用`soup.prettify()` ）：-

方法 2（通過使用`HTML Parser`和`soup.prettify()` ）：-

如何使用 Python 從本地保存的 HTML 文件中讀取源 HTLM 代碼？

問題描述

2 個解決方案

解決方案1 1 2021-04-30 14:05:55

解決方案2 1 已采納 2021-04-30 14:30:53

方法 1（通過在當前Code中使用soup.prettify() ）：-

方法 2（通過使用HTML Parser和soup.prettify() ）：-

解決方案1
1 2021-04-30 14:05:55

解決方案2
1 已采納 2021-04-30 14:30:53

方法 1（通過在當前`Code`中使用`soup.prettify()` ）：-

方法 2（通過使用`HTML Parser`和`soup.prettify()` ）：-