使用Beautiful Soup找到特定的课程

Question

我正在尝试使用Beautiful Soup从Zillow那里获取住房价格数据。

我按属性ID获取网页，例如。 http://www.zillow.com/homes/for_sale/18429834_zpid/

当我尝试find_all()函数时，我没有得到任何结果：

results = soup.find_all('div', attrs={"class":"home-summary-row"})

但是，如果我使用HTML并将其缩小到我想要的位，例如：

<html>
    <body>
        <div class=" status-icon-row for-sale-row home-summary-row">
        </div>
        <div class=" home-summary-row">
            <span class=""> $1,342,144 </span>
        </div>
    </body>
</html>

我得到2个结果，两个<div> s与class home-summary-row 。 所以，我的问题是，为什么我在搜索整页时没有得到任何结果？

工作范例：

from bs4 import BeautifulSoup
import requests

zpid = "18429834"
url = "http://www.zillow.com/homes/" + zpid + "_zpid/"
response = requests.get(url)
html = response.content
#html = '<html><body><div class=" status-icon-row for-sale-row home-summary-row"></div><div class=" home-summary-row"><span class=""> $1,342,144 </span></div></body></html>'
soup = BeautifulSoup(html, "html5lib")

results = soup.find_all('div', attrs={"class":"home-summary-row"})
print(results)

Answer 1

您的HTML格式不正确，在这种情况下，选择正确的解析器至关重要。 在BeautifulSoup ，目前有3种可用的HTML解析器可以不同方式处理和处理损坏的HTML ：

html.parser （内置，无需额外的模块）
lxml （最快，需要安装lxml ）
html5lib （最宽松，需要安装html5lib ）

解析器文档页面之间的差异更详细地描述了差异。 在您的情况下，为了证明差异：

>>> from bs4 import BeautifulSoup
>>> import requests
>>> 
>>> zpid = "18429834"
>>> url = "http://www.zillow.com/homes/" + zpid + "_zpid/"
>>> response = requests.get(url)
>>> html = response.content
>>> 
>>> len(BeautifulSoup(html, "html5lib").find_all('div', attrs={"class":"home-summary-row"}))
0
>>> len(BeautifulSoup(html, "html.parser").find_all('div', attrs={"class":"home-summary-row"}))
3
>>> len(BeautifulSoup(html, "lxml").find_all('div', attrs={"class":"home-summary-row"}))
3

正如您所看到的，在您的情况下， html.parser和lxml都可以完成这项工作，但html5lib却没有。

Answer 2

import requests
from bs4 import BeautifulSoup

zpid = "18429834"
url = "http://www.zillow.com/homes/" + zpid + "_zpid/"

r = requests.get(url)

soup = BeautifulSoup(r.content, "lxml")

g_data = soup.find_all("div", {"class": "home-summary-row"})

print g_data[1].text

#for item in g_data:
#        print item("span")[0].text
#        print '\n'

我也有这个工作 - 但看起来有人打败了我。

无论如何要去发布。

Answer 3

根据W3.org Validator ，HTML存在许多问题，例如杂散结束标记和跨多行分割的标记。 例如：

<a 
href="http://www.zillow.com/danville-ca-94526/sold/"  title="Recent home sales" class=""  data-za-action="Recent Home Sales"  >

这种标记可以使BeautifulSoup解析HTML变得更加困难。

您可能想尝试运行某些东西来清理HTML，例如从每行末尾删除换行符和尾随空格。 BeautifulSoup还可以为您清理HTML树：

from BeautifulSoup import BeautifulSoup
tree = BeautifulSoup(bad_html)
good_html = tree.prettify()

使用Beautiful Soup找到特定的课程

问题描述

3 个解决方案

解决方案1
5 2017-01-17 02:29:51

解决方案2
4 2017-01-17 01:18:58

解决方案3
3 已采纳 2017-01-17 01:11:07

使用Beautiful Soup找到特定的课程

问题描述

3 个解决方案

解决方案1 5 2017-01-17 02:29:51

解决方案2 4 2017-01-17 01:18:58

解决方案3 3 已采纳 2017-01-17 01:11:07

解决方案1
5 2017-01-17 02:29:51

解决方案2
4 2017-01-17 01:18:58

解决方案3
3 已采纳 2017-01-17 01:11:07