parse unbalanced html file with Beautiful Soup 4

Question

I am parsing partial of html files that are not with balanced html tags.

Say the first line is missing in this partial html file. Is it possible that Beautiful Soup can still parse the rest of the files, and I can still extract the information insides of the different tags?

Thanks so much for the help.

Example Domain</title>   <!-- <====missing tag in this line -->

<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<style type="text/css">
body {
    background-color: #f0f0f2;
    margin: 0;
    padding: 0;
    font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;

}
div {
    width: 600px;
    margin: 5em auto;
    padding: 50px;
    background-color: #fff;
    border-radius: 1em;
}
a:link, a:visited {
    color: #38488f;
    text-decoration: none;
}
@media (max-width: 700px) {
    body {
        background-color: #fff;
    }
    div {
        width: auto;
        margin: 0 auto;
        border-radius: 0;
        padding: 1em;
    }
}
</style>

Answer 1

Use any advanced parser ( html5lib is more robust, but slower). The results will differ:

soup = BeautifulSoup(open('foo.html'), 'lxml')
#<html><body><p>Example Domain   <!-- <====missing tag in this line -->
#<meta charset="utf-8"/>

soup = BeautifulSoup(open('foo.html'), 'html5lib')
#<html><head></head><body>Example Domain   <!-- <====missing tag in this line -->
#
#<meta charset="utf-8"/>

parse unbalanced html file with Beautiful Soup 4

Question

1 answers

solution1
0 2017-01-23 18:56:11

parse unbalanced html file with Beautiful Soup 4

Question

1 answers

solution1 0 2017-01-23 18:56:11

solution1
0 2017-01-23 18:56:11