beautifulsoup行为不一致

Question

我对在两个不同环境中编写的以下HTML抓取代码的行为感到完全困惑，并且需要帮助来找到这种差异的根本原因 。

import sys
import bs4
import md5
import logging
from urllib2 import urlopen
from platform import platform

# Log particulars of the environment
logging.warning("OS platform is %s" %platform())
logging.warning("Python version is %s" %sys.version)
logging.warning("BeautifulSoup is at %s and its version is %s" %(bs4.__file__, bs4.__version__))

# Open web-page and read HTML
url = 'http://www.ncbi.nlm.nih.gov/Traces/wgs/?val=JXIG&size=all'
response = urlopen(url)
html = response.read()

# Calculate MD5 to ensure that the same string was downloaded
print "MD5 sum for html string downloaded is %s" %md5.new(html).hexdigest()

# Make beautiful soup
soup = bs4.BeautifulSoup(html, 'html')
contigsTable = soup.find("table", {"class" : "zebra"})
contigs = []

# Parse table in soup to find all records
for row in contigsTable.findAll('tr'):
    column = row.findAll('td')
    if len(column) > 2:
        contigs.append(column[1])

# Expect identical results on any machine that this is run
print "Number of contigs identified is %s" %len(contigs)

在机器1上，此命令返回：

WARNING:root:OS platform is Linux-3.10.10-031010-generic-x86_64-with-Ubuntu-12.04-precise   
WARNING:root:Python version is 2.7.3 (default, Jun 22 2015, 19:33:41)  
[GCC 4.6.3]  
WARNING:root:BeautifulSoup is at /usr/local/lib/python2.7/dist-packages/bs4/__init__.pyc and its version is 4.3.2  
MD5 sum for html string downloaded is ca76b381df706a2d6443dd76c9d27adf  

Number of contigs identified is 630

在机器2上，此完全相同的代码运行以返回：

WARNING:root:OS platform is Linux-2.6.32-431.46.2.el6.nersc.x86_64-x86_64-with-debian-6.0.6
WARNING:root:Python version is 2.7.4 (default, Apr 17 2013, 10:26:13) 
[GCC 4.6.3]
WARNING:root:BeautifulSoup is at /global/homes/i/img/.local/lib/python2.7/site-packages/bs4/__init__.pyc and its version is 4.3.2
MD5 sum for html string downloaded is ca76b381df706a2d6443dd76c9d27adf

Number of contigs identified is 462

计算的重叠群数目不同。 请注意，相同的代码会解析HTML表，以在两个彼此没有显着差异的不幸的环境中产生不同的结果。 手动检查确认在机器2上返回的结果不正确，但迄今为止尚无法解释。

有没有类似的经历？ 您是否注意到此代码有任何问题，还是应该完全停止信任BeautifulSoup ？

Answer 1

您正在体验BeaufitulSoup为您指定的“ html”标记类型自动选择的解析器之间的差异。 选择哪个解析器取决于当前Python环境中可用的模块：

如果不指定任何内容，则将获得已安装的最佳HTML解析器。 Beautiful Soup将lxml的解析器评为最佳，然后是html5lib的解析器，然后是Python的内置解析器。

要在各个平台上具有一致的行为，请明确：

soup = BeautifulSoup(html, "html.parser")
soup = BeautifulSoup(html, "html5lib")
soup = BeautifulSoup(html, "lxml")

另请参阅：安装解析器。

beautifulsoup行为不一致

问题描述

在机器1上，此命令返回：

在机器2上，此完全相同的代码运行以返回：

1 个解决方案

解决方案1
4 已采纳 2015-09-18 05:40:20

beautifulsoup行为不一致

问题描述

在机器1上，此命令返回：

在机器2上，此完全相同的代码运行以返回：

1 个解决方案

解决方案1 4 已采纳 2015-09-18 05:40:20

解决方案1
4 已采纳 2015-09-18 05:40:20