Python 使用 Beautiful Soup HTML 解析返回一个空列表

Question

我目前正在从事一个涉及 web 抓取房地产网站（用于教育目的）的项目。 我正在从房屋列表中获取数据，例如地址、价格、卧室等。

在使用打印 function 构建和测试之后（它成功地工作，）。 我现在正在为列表中的每个数据点构建一个字典。 我将该字典存储在一个列表中，以便最终使用 Pandas 创建一个表并发送到 CSV。

这是我的问题。 我的列表显示一个没有错误的空字典。 请注意，我已经成功抓取了数据，并且在使用打印 function 时看到了数据。 现在，在将每个数据点添加到字典并将其放入列表后，它什么也不显示。 这是我的代码：

import requests
from bs4 import BeautifulSoup

r=requests.get("https://www.century21.com/real-estate/colorado-springs-co/LCCOCOLORADOSPRINGS/")
c=r.content

soup=BeautifulSoup(c,"html.parser")

all=soup.find_all("div", {"class":"infinite-item"})
all[0].find("a",{"class":"listing-price"}).text.replace("\n","").replace(" ","")

l=[]
for item in all:
    d={} 
    try: 
        d["Price"]=item.find("a",{"class":"listing-price"}.text.replace("\n","").replace(" ",""))
        d["Address"]=item.find("div",{"class":"property-address"}).text.replace("\n","").replace(" ","")
        d["City"]=item.find_all("div",{"class":"property-city"})[0].text.replace("\n","").replace(" ","")
        try: 
            d["Beds"]=item.find("div",{"class":"property-beds"}).find("strong").text
        except: 
            d["Beds"]=None
        try: 
            d["Baths"]=item.find("div",{"class":"property-baths"}).find("strong").text
        except: 
            d["Baths"]=None
        try: 
            d["Area"]=item.find("div",{"class":"property-sqft"}).find("strong").text
        except: 
             d["Area"]=None
    except: 
        pass
    l.append(d)

当我调用l （包含我的字典的列表）时 - 这就是我得到的：

[{},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {}]

我正在使用 Python 3.8.2 和 Beautiful Soup 4。任何想法或帮助将不胜感激。 谢谢！

Answer 1

这可以更简洁地完成您想要的并且更加 Pythonic（使用嵌套列表理解）：

import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.century21.com/real-estate/colorado-springs-co/LCCOCOLORADOSPRINGS/")
c = r.content

soup = BeautifulSoup(c, "html.parser")

css_classes = [
    "listing-price",
    "property-address",
    "property-city",
    "property-beds",
    "property-baths",
    "property-sqft",
]

pl = [{css_class.split('-')[1]: item.find(class_=css_class).text.strip() # this shouldn't error if not found
       for css_class in css_classes} # find each class in the class list
       for item in soup.find_all('div', class_='property-card-primary-info')] # find each property card div

print(pl)

Output：

[{'address': '512 Silver Oak Grove',
  'baths': '6 baths',
  'beds': '4 beds',
  'city': 'Colorado Springs CO 80906',
  'price': '$1,595,000',
  'sqft': '6,958 sq. ft'},
 {'address': '8910 Edgefield Drive',
  'baths': '5 baths',
  'beds': '5 beds',
  'city': 'Colorado Springs CO 80920',
  'price': '$499,900',
  'sqft': '4,557 sq. ft'},
 {'address': '135 Mayhurst Avenue',
  'baths': '3 baths',
  'beds': '3 beds',
  'city': 'Colorado Springs CO 80906',
  'price': '$420,000',
  'sqft': '1,889 sq. ft'},
 {'address': '7925 Bard Court',
  'baths': '4 baths',
  'beds': '5 beds',
  'city': 'Colorado Springs CO 80920',
  'price': '$405,000',
  'sqft': '3,077 sq. ft'},
 {'address': '7641 N Sioux Circle',
  'baths': '3 baths',
  'beds': '4 beds',
  'city': 'Colorado Springs CO 80915',
  'price': '$389,900',
  'sqft': '3,384 sq. ft'},
 ...
]

Answer 2

您应该使用 function 来完成重复性工作。 这将使您的代码更清晰。 我已经管理了这段代码，它正在工作：

import requests
from bs4 import BeautifulSoup

def find_div_and_get_value(soup, html_balise, attributes):
    return soup.find(html_balise, attrs=attributes).text.replace("\n","").strip()

def find_div_and_get_value2(soup, html_balise, attributes):
    return soup.find(html_balise, attrs=attributes).find('strong').text.replace("\n","").strip()


r=requests.get("https://www.century21.com/real-estate/colorado-springs-co/LCCOCOLORADOSPRINGS/")
c=r.content
soup = BeautifulSoup(c,"html.parser")
houses = soup.findAll("div", {"class":"infinite-item"})

l=[]
for house in houses:
    try:
        d = {}
        d["Price"] = find_div_and_get_value(house, 'a', {"class": "listing-price"})
        d["Address"] = find_div_and_get_value(house, 'div', {"class": "property-address"})
        d["City"] = find_div_and_get_value(house, 'div', {"class":"property-city"})
        d["Beds"] = find_div_and_get_value2(house, 'div', {"class":"property-beds"})
        d["Baths"] = find_div_and_get_value2(house, 'div', {"class":"property-baths"})
        d["Area"] = find_div_and_get_value2(house, 'div', {"class":"property-sqft"})
        l.append(d)
    except:
        break

for house in l:
    print(house)

Python 使用 Beautiful Soup HTML 解析返回一个空列表

问题描述

2 个解决方案

解决方案1
2 已采纳 2020-05-18 16:11:54

解决方案2
1 2020-05-18 16:37:29

Python 使用 Beautiful Soup HTML 解析返回一个空列表

问题描述

2 个解决方案

解决方案1 2 已采纳 2020-05-18 16:11:54

解决方案2 1 2020-05-18 16:37:29

解决方案1
2 已采纳 2020-05-18 16:11:54

解决方案2
1 2020-05-18 16:37:29