I'm new to HTML and beautiful soup. I am trying to read a locally saved HTML file in Python and I tested the following code:
with open(file_path) as fp:
soup = BeautifulSoup(fp)
print(soup)
The output looks weird and here is a part of it:
<html><body><p>ÿþh t m l >
h e a d >
m e t a h t t p - e q u i v = C o n t e n t - T y p e c o n t e n t = " t e x t / h t m l ; c h a r s e t = u n i c o d e " >
m e t a n a m e = G e n e r a t o r c o n t e n t = " M i c r o s o f t W o r d 1 5 ( f i l t e r e d ) " >
s t y l e >
! - -
/ * F o n t D e f i n i t i o n s * /
The original HTML code is something like
<html>
<head>
<meta http-equiv=Content-Type content="text/html; charset=unicode">
<meta name=Generator content="Microsoft Word 15 (filtered)">
<style>
<!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
Can anyone help me or share some thoughts?
Thank you!
Try print(soup.prettify())
. The prettify method is helpful and displays the formatted HTML content.
According to the documentation:
The prettify() method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string:
Source: Beautiful Soup Documentation
First of all, let's discuss why you are not able to fetch
desired Output
. It is because when you parsing
data in BeautifulSoup
. There might be some Spaces, Symbols, etc. presented in your Code
. So, the appropriate Solution
for this scenario was stated below:-
soup.prettify()
HTML Parser
and soup.prettify()
togetherTo Learn more about
HTML Parser
andsoup.prettify
:- Click Here
soup.prettify()
in your Current Code
):-# File Path of 'HTML' File
file_path = 'demo.html'
# Fetch 'HTML' Code Using 'BeautifulSoup'
with open(file_path) as fp:
soup = BeautifulSoup(fp)
# Print 'HTML' Code using 'prettify' Format
print(soup.prettify())
# Output of above cell:-
<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="Microsoft Word 15 (filtered)" name="Generator"/>
<style>
<!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
</style>
</head>
</html>
HTML Parser
and soup.prettify()
):-# Import all-important Libraries
from bs4 import BeautifulSoup
import html5lib
# Open Our 'HTML' File
html_page = open('demo.html', 'r')
# Parse it to 'HTML' Format
soup = BeautifulSoup(html_page, "html5lib")
# Print Scraped 'HTML' Code
print(soup.prettify())
# Output of above cell:-
<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="Microsoft Word 15 (filtered)" name="Generator"/>
<style>
<!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
</style>
</head>
</html>
Hope this Solution helps you.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.