简体   繁体   English

用Beautiful Soup 4解析不​​平衡的html文件

[英]parse unbalanced html file with Beautiful Soup 4

I am parsing partial of html files that are not with balanced html tags. 我正在解析部分不带有平衡html标签的html文件。

Say the first line is missing in this partial html file. 假设此部分html文件中缺少第一行。 Is it possible that Beautiful Soup can still parse the rest of the files, and I can still extract the information insides of the different tags? Beautiful Soup是否仍然可以解析其余文件,并且我仍然可以提取不同标签内部的信息?

Thanks so much for the help. 非常感谢帮忙。

Example Domain</title>   <!-- <====missing tag in this line -->

<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<style type="text/css">
body {
    background-color: #f0f0f2;
    margin: 0;
    padding: 0;
    font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;

}
div {
    width: 600px;
    margin: 5em auto;
    padding: 50px;
    background-color: #fff;
    border-radius: 1em;
}
a:link, a:visited {
    color: #38488f;
    text-decoration: none;
}
@media (max-width: 700px) {
    body {
        background-color: #fff;
    }
    div {
        width: auto;
        margin: 0 auto;
        border-radius: 0;
        padding: 1em;
    }
}
</style>    

Use any advanced parser ( html5lib is more robust, but slower). 使用任何高级解析器( html5lib更强大,但速度更慢)。 The results will differ: 结果将有所不同:

soup = BeautifulSoup(open('foo.html'), 'lxml')
#<html><body><p>Example Domain   <!-- <====missing tag in this line -->
#<meta charset="utf-8"/>

soup = BeautifulSoup(open('foo.html'), 'html5lib')
#<html><head></head><body>Example Domain   <!-- <====missing tag in this line -->
#
#<meta charset="utf-8"/>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM