简体   繁体   English

如何使用 Python 从本地保存的 HTML 文件中读取源 HTLM 代码?

[英]How to read the source HTLM code from a locally saved HTML file using Python?

I'm new to HTML and beautiful soup.我是 HTML 和漂亮汤的新手。 I am trying to read a locally saved HTML file in Python and I tested the following code:我正在尝试读取 Python 中本地保存的 HTML 文件,并测试了以下代码:

with open(file_path) as fp:
    soup = BeautifulSoup(fp)

print(soup)

The output looks weird and here is a part of it: output 看起来很奇怪,这里是其中的一部分:

<html><body><p>ÿþh t m l &gt; 
 
 
 
 h e a d &gt; 
 
 m e t a   h t t p - e q u i v = C o n t e n t - T y p e   c o n t e n t = " t e x t / h t m l ;   c h a r s e t = u n i c o d e " &gt; 
 
 m e t a   n a m e = G e n e r a t o r   c o n t e n t = " M i c r o s o f t   W o r d   1 5   ( f i l t e r e d ) " &gt; 
 
 s t y l e &gt; 
 
 ! - - 
 
   / *   F o n t   D e f i n i t i o n s   * /

The original HTML code is something like原始 HTML 代码类似于

<html>

<head>
<meta http-equiv=Content-Type content="text/html; charset=unicode">
<meta name=Generator content="Microsoft Word 15 (filtered)">
<style>
<!--
 /* Font Definitions */
 @font-face
    {font-family:"Cambria Math";
    panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
    {font-family:Calibri;

Can anyone help me or share some thoughts?任何人都可以帮助我或分享一些想法吗?

Thank you!谢谢!

Try print(soup.prettify()) .试试print(soup.prettify()) The prettify method is helpful and displays the formatted HTML content.美化方法很有帮助,并显示格式化的 HTML 内容。

According to the documentation:根据文档:

The prettify() method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string: prettify() 方法会将 Beautiful Soup 解析树转换为格式良好的 Unicode 字符串,每个标签和每个字符串都有单独的一行:

Source: Beautiful Soup Documentation资料来源: 美丽的汤文档

First of all, let's discuss why you are not able to fetch desired Output .首先,让我们讨论一下为什么您无法fetch所需Output It is because when you parsing data in BeautifulSoup .这是因为当您在BeautifulSoupparsing数据时。 There might be some Spaces, Symbols, etc. presented in your Code .您的Code中可能存在一些空格、符号等 So, the appropriate Solution for this scenario was stated below:-因此,针对这种情况的适当Solution方案如下所述:-

  • Needed Solution:- Use soup.prettify()需要的解决方案:-使用soup.prettify()
  • Appropriate Solution:- Use HTML Parser and soup.prettify() together适当的解决方案:-一起使用HTML Parsersoup.prettify()

To Learn more about HTML Parser and soup.prettify :- Click Here要了解有关HTML Parsersoup.prettify更多信息:-单击此处


Approach 1 (By using soup.prettify() in your Current Code ):-方法 1(通过在当前Code中使用soup.prettify() ):-

# File Path of 'HTML' File
file_path = 'demo.html'

# Fetch 'HTML' Code Using 'BeautifulSoup'
with open(file_path) as fp:
    soup = BeautifulSoup(fp)

# Print 'HTML' Code using 'prettify' Format
print(soup.prettify())
# Output of above cell:-
<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="Microsoft Word 15 (filtered)" name="Generator"/>
  <style>
   <!--
 /* Font Definitions */
 @font-face
    {font-family:"Cambria Math";
    panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
    {font-family:Calibri;
  </style>
 </head>
</html>

Approach 2 (By using HTML Parser and soup.prettify() ):-方法 2(通过使用HTML Parsersoup.prettify() ):-

# Import all-important Libraries
from bs4 import BeautifulSoup
import html5lib

# Open Our 'HTML' File
html_page = open('demo.html', 'r')

# Parse it to 'HTML' Format
soup = BeautifulSoup(html_page, "html5lib")

# Print Scraped 'HTML' Code
print(soup.prettify())
# Output of above cell:-
<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="Microsoft Word 15 (filtered)" name="Generator"/>
  <style>
   <!--
 /* Font Definitions */
 @font-face
    {font-family:"Cambria Math";
    panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
    {font-family:Calibri;
  </style>
 </head>
</html>

Hope this Solution helps you.希望此解决方案对您有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM