用Beautiful Soup 4解析不平衡的html文件

Question

I am parsing partial of html files that are not with balanced html tags. 我正在解析部分不带有平衡html标签的html文件。

Say the first line is missing in this partial html file. 假设此部分html文件中缺少第一行。 Is it possible that Beautiful Soup can still parse the rest of the files, and I can still extract the information insides of the different tags? Beautiful Soup是否仍然可以解析其余文件，并且我仍然可以提取不同标签内部的信息？

Thanks so much for the help. 非常感谢帮忙。

Example Domain</title>   <!-- <====missing tag in this line -->

<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<style type="text/css">
body {
    background-color: #f0f0f2;
    margin: 0;
    padding: 0;
    font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;

}
div {
    width: 600px;
    margin: 5em auto;
    padding: 50px;
    background-color: #fff;
    border-radius: 1em;
}
a:link, a:visited {
    color: #38488f;
    text-decoration: none;
}
@media (max-width: 700px) {
    body {
        background-color: #fff;
    }
    div {
        width: auto;
        margin: 0 auto;
        border-radius: 0;
        padding: 1em;
    }
}
</style>

Answer 1

Use any advanced parser ( html5lib is more robust, but slower). 使用任何高级解析器（ html5lib更强大，但速度更慢）。 The results will differ: 结果将有所不同：

soup = BeautifulSoup(open('foo.html'), 'lxml')
#<html><body><p>Example Domain   <!-- <====missing tag in this line -->
#<meta charset="utf-8"/>

soup = BeautifulSoup(open('foo.html'), 'html5lib')
#<html><head></head><body>Example Domain   <!-- <====missing tag in this line -->
#
#<meta charset="utf-8"/>

用Beautiful Soup 4解析不平衡的html文件

问题描述

1 个解决方案

解决方案1
0 2017-01-23 18:56:11

用Beautiful Soup 4解析不​​平衡的html文件

问题描述

1 个解决方案

解决方案1 0 2017-01-23 18:56:11

用Beautiful Soup 4解析不平衡的html文件

解决方案1
0 2017-01-23 18:56:11