简体   繁体   中英

parse unbalanced html file with Beautiful Soup 4

I am parsing partial of html files that are not with balanced html tags.

Say the first line is missing in this partial html file. Is it possible that Beautiful Soup can still parse the rest of the files, and I can still extract the information insides of the different tags?

Thanks so much for the help.

Example Domain</title>   <!-- <====missing tag in this line -->

<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<style type="text/css">
body {
    background-color: #f0f0f2;
    margin: 0;
    padding: 0;
    font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;

}
div {
    width: 600px;
    margin: 5em auto;
    padding: 50px;
    background-color: #fff;
    border-radius: 1em;
}
a:link, a:visited {
    color: #38488f;
    text-decoration: none;
}
@media (max-width: 700px) {
    body {
        background-color: #fff;
    }
    div {
        width: auto;
        margin: 0 auto;
        border-radius: 0;
        padding: 1em;
    }
}
</style>    

Use any advanced parser ( html5lib is more robust, but slower). The results will differ:

soup = BeautifulSoup(open('foo.html'), 'lxml')
#<html><body><p>Example Domain   <!-- <====missing tag in this line -->
#<meta charset="utf-8"/>

soup = BeautifulSoup(open('foo.html'), 'html5lib')
#<html><head></head><body>Example Domain   <!-- <====missing tag in this line -->
#
#<meta charset="utf-8"/>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM