简体繁体 English

与仅使用lxml相比，使用Beautiful Soup解析XML有什么好处吗？

[英]Are there any benefits of using Beautiful Soup to parse XML over using lxml alone?

原文 2015-07-10 23:34:45 2 2 python/ xml/ beautifulsoup/ lxml

I use Beautiful Soup often to parse HTML files, so when I recently needed to parse an XML file, I chose to use it. 我经常使用Beautiful Soup解析HTML文件，因此当我最近需要解析XML文件时，我选择使用它。 However, because I'm parsing an extremely large file, it failed. 但是，由于我正在解析一个非常大的文件，因此失败了。 When researching why it failed, I was led to this question: Loading huge XML files and dealing with MemoryError . 在研究失败的原因时，我想到了一个问题：加载巨大的XML文件并处理MemoryError 。

This leads me to my question: If lxml can handle large files and Beautiful Soup cannot, are there any benefits of using Beautiful Soup instead of simply using using lxml directly? 这引出了我的问题：如果lxml可以处理大文件而Beautiful Soup无法处理，那么使用Beautiful Soup而不是直接使用lxml有什么好处吗？

2 个解决方案

If you look at this link about BeautifulSoup Parser : 如果您查看有关BeautifulSoup Parser的链接：

"BeautifulSoup" is a Python package that parses broken HTML, while "lxml" does so faster but with high quality HTML/XML. “ BeautifulSoup”是一个Python程序包，用于解析损坏的 HTML，而“ lxml”的运行速度更快，但具有高质量的HTML / XML。 So if you're dealing with the first one you're better off with BS... but the advantage of having "lxml" is that you're able to get the soupparser . 因此，如果您要处理第一个，最好使用BS ...，但是拥有“ lxml”的优点是您可以得到soupparser 。

From that link I provided at the top it shows how you can use the capabilities of "BS" with "lxml" 通过我在顶部提供的链接，它显示了如何将“ BS”功能与“ lxml”一起使用

So in the end ... you are better off with "lxml". 因此，最后 ...最好使用“ lxml”。

lxml is very fast, and is relatively memory efficient. lxml速度非常快，并且具有相对较高的内存效率。 BeautifulSoup by itself scores less well on the efficiency end, but is built to be compatible with non-standard / broken html and xml, meaning it is ultimately more versatile. BeautifulSoup本身在效率方面的得分较低，但是其构建目的是与非标准/残破的html和xml兼容，这意味着它最终将变得更加通用。

Which you choose to use is really just dependent on your use-case -- web scraping? 您选择使用哪个实际上仅取决于您的用例-Web抓取？ probably BS. 大概是BS。 Parsing machine-written structured metadata? 解析机器编写的结构化元数据？ lxml is a great choice. lxml是一个不错的选择。

There is also the learning-curve to consider when making the switch - the two systems implement search and navigation strategies in slightly different ways; 进行切换时还需要考虑学习曲线-两个系统以略有不同的方式实现搜索和导航策略； enough to make learning one system after starting with the other a non-trivial task. 足以使从另一个系统开始学习一个系统变得不容易。