How to read the source HTLM code from a locally saved HTML file using Python?

Question

I'm new to HTML and beautiful soup. I am trying to read a locally saved HTML file in Python and I tested the following code:

with open(file_path) as fp:
    soup = BeautifulSoup(fp)

print(soup)

The output looks weird and here is a part of it:

<html><body><p>ÿþh t m l &gt; 
 
 
 
 h e a d &gt; 
 
 m e t a   h t t p - e q u i v = C o n t e n t - T y p e   c o n t e n t = " t e x t / h t m l ;   c h a r s e t = u n i c o d e " &gt; 
 
 m e t a   n a m e = G e n e r a t o r   c o n t e n t = " M i c r o s o f t   W o r d   1 5   ( f i l t e r e d ) " &gt; 
 
 s t y l e &gt; 
 
 ! - - 
 
   / *   F o n t   D e f i n i t i o n s   * /

The original HTML code is something like

<html>

<head>
<meta http-equiv=Content-Type content="text/html; charset=unicode">
<meta name=Generator content="Microsoft Word 15 (filtered)">
<style>
<!--
 /* Font Definitions */
 @font-face
    {font-family:"Cambria Math";
    panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
    {font-family:Calibri;

Can anyone help me or share some thoughts?

Thank you!

Answer 1

Try print(soup.prettify()) . The prettify method is helpful and displays the formatted HTML content.

According to the documentation:

The prettify() method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string:

Source: Beautiful Soup Documentation

Answer 2

First of all, let's discuss why you are not able to fetch desired Output . It is because when you parsing data in BeautifulSoup . There might be some Spaces, Symbols, etc. presented in your Code . So, the appropriate Solution for this scenario was stated below:-

Needed Solution:- Use soup.prettify()
Appropriate Solution:- Use HTML Parser and soup.prettify() together

To Learn more about HTML Parser and soup.prettify :- Click Here

Approach 1 (By using `soup.prettify()` in your Current `Code` ):-

# File Path of 'HTML' File
file_path = 'demo.html'

# Fetch 'HTML' Code Using 'BeautifulSoup'
with open(file_path) as fp:
    soup = BeautifulSoup(fp)

# Print 'HTML' Code using 'prettify' Format
print(soup.prettify())

# Output of above cell:-
<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="Microsoft Word 15 (filtered)" name="Generator"/>
  <style>
   <!--
 /* Font Definitions */
 @font-face
    {font-family:"Cambria Math";
    panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
    {font-family:Calibri;
  </style>
 </head>
</html>

Approach 2 (By using `HTML Parser` and `soup.prettify()` ):-

# Import all-important Libraries
from bs4 import BeautifulSoup
import html5lib

# Open Our 'HTML' File
html_page = open('demo.html', 'r')

# Parse it to 'HTML' Format
soup = BeautifulSoup(html_page, "html5lib")

# Print Scraped 'HTML' Code
print(soup.prettify())

# Output of above cell:-
<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="Microsoft Word 15 (filtered)" name="Generator"/>
  <style>
   <!--
 /* Font Definitions */
 @font-face
    {font-family:"Cambria Math";
    panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
    {font-family:Calibri;
  </style>
 </head>
</html>

Hope this Solution helps you.

How to read the source HTLM code from a locally saved HTML file using Python?

Question

2 answers

solution1
1 2021-04-30 14:05:55

solution2
1 ACCPTED 2021-04-30 14:30:53

Approach 1 (By using `soup.prettify()` in your Current `Code` ):-

Approach 2 (By using `HTML Parser` and `soup.prettify()` ):-

How to read the source HTLM code from a locally saved HTML file using Python?

Question

2 answers

solution1 1 2021-04-30 14:05:55

solution2 1 ACCPTED 2021-04-30 14:30:53

Approach 1 (By using soup.prettify() in your Current Code ):-

Approach 2 (By using HTML Parser and soup.prettify() ):-

solution1
1 2021-04-30 14:05:55

solution2
1 ACCPTED 2021-04-30 14:30:53

Approach 1 (By using `soup.prettify()` in your Current `Code` ):-

Approach 2 (By using `HTML Parser` and `soup.prettify()` ):-