简体   繁体   English

Python 使用 Beautiful Soup HTML 解析返回一个空列表

[英]Python Returning An Empty List Using Beautiful Soup HTML Parsing

I'm currently working on a project that involves web scraping a real estate website (for educational purposes).我目前正在从事一个涉及 web 抓取房地产网站(用于教育目的)的项目。 I'm taking data from home listings like address, price, bedrooms, etc.我正在从房屋列表中获取数据,例如地址、价格、卧室等。

After building and testing along the way with the print function (it worked successfully,).在使用打印 function 构建和测试之后(它成功地工作,)。 I'm now building a dictionary for each data point in the listing.我现在正在为列表中的每个数据点构建一个字典。 I'm storing that dictionary in a list in order to eventually use Pandas to create a table and send to a CSV.我将该字典存储在一个列表中,以便最终使用 Pandas 创建一个表并发送到 CSV。

Here is my problem.这是我的问题。 My list is displaying an empty dictionary with no error.我的列表显示一个没有错误的空字典。 Please note, I've successfully scraped the data already and have seen the data when using the print function.请注意,我已经成功抓取了数据,并且在使用打印 function 时看到了数据。 Now its displaying nothing after adding each data point to a dictionary and putting it in a list.现在,在将每个数据点添加到字典并将其放入列表后,它什么也不显示。 Here is my code:这是我的代码:

import requests
from bs4 import BeautifulSoup

r=requests.get("https://www.century21.com/real-estate/colorado-springs-co/LCCOCOLORADOSPRINGS/")
c=r.content

soup=BeautifulSoup(c,"html.parser")

all=soup.find_all("div", {"class":"infinite-item"})
all[0].find("a",{"class":"listing-price"}).text.replace("\n","").replace(" ","")

l=[]
for item in all:
    d={} 
    try: 
        d["Price"]=item.find("a",{"class":"listing-price"}.text.replace("\n","").replace(" ",""))
        d["Address"]=item.find("div",{"class":"property-address"}).text.replace("\n","").replace(" ","")
        d["City"]=item.find_all("div",{"class":"property-city"})[0].text.replace("\n","").replace(" ","")
        try: 
            d["Beds"]=item.find("div",{"class":"property-beds"}).find("strong").text
        except: 
            d["Beds"]=None
        try: 
            d["Baths"]=item.find("div",{"class":"property-baths"}).find("strong").text
        except: 
            d["Baths"]=None
        try: 
            d["Area"]=item.find("div",{"class":"property-sqft"}).find("strong").text
        except: 
             d["Area"]=None
    except: 
        pass
    l.append(d)

When I call l (the list that contains my dictionary) - this is what I get:当我调用l (包含我的字典的列表)时 - 这就是我得到的:

[{},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {}]

I'm using Python 3.8.2 with Beautiful Soup 4. Any ideas or help with this would be greatly appreciated.我正在使用 Python 3.8.2 和 Beautiful Soup 4。任何想法或帮助将不胜感激。 Thanks!谢谢!

This does what you want much more concisely and is more pythonic (using nested list comprehension):这可以更简洁地完成您想要的并且更加 Pythonic(使用嵌套列表理解):

import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.century21.com/real-estate/colorado-springs-co/LCCOCOLORADOSPRINGS/")
c = r.content

soup = BeautifulSoup(c, "html.parser")

css_classes = [
    "listing-price",
    "property-address",
    "property-city",
    "property-beds",
    "property-baths",
    "property-sqft",
]

pl = [{css_class.split('-')[1]: item.find(class_=css_class).text.strip() # this shouldn't error if not found
       for css_class in css_classes} # find each class in the class list
       for item in soup.find_all('div', class_='property-card-primary-info')] # find each property card div

print(pl)

Output: Output:

[{'address': '512 Silver Oak Grove',
  'baths': '6 baths',
  'beds': '4 beds',
  'city': 'Colorado Springs CO 80906',
  'price': '$1,595,000',
  'sqft': '6,958 sq. ft'},
 {'address': '8910 Edgefield Drive',
  'baths': '5 baths',
  'beds': '5 beds',
  'city': 'Colorado Springs CO 80920',
  'price': '$499,900',
  'sqft': '4,557 sq. ft'},
 {'address': '135 Mayhurst Avenue',
  'baths': '3 baths',
  'beds': '3 beds',
  'city': 'Colorado Springs CO 80906',
  'price': '$420,000',
  'sqft': '1,889 sq. ft'},
 {'address': '7925 Bard Court',
  'baths': '4 baths',
  'beds': '5 beds',
  'city': 'Colorado Springs CO 80920',
  'price': '$405,000',
  'sqft': '3,077 sq. ft'},
 {'address': '7641 N Sioux Circle',
  'baths': '3 baths',
  'beds': '4 beds',
  'city': 'Colorado Springs CO 80915',
  'price': '$389,900',
  'sqft': '3,384 sq. ft'},
 ...
]

You should use function to do the repetitive job.您应该使用 function 来完成重复性工作。 This would make your code clearer.这将使您的代码更清晰。 I've managed this code, that is working:我已经管理了这段代码,它正在工作:

import requests
from bs4 import BeautifulSoup

def find_div_and_get_value(soup, html_balise, attributes):
    return soup.find(html_balise, attrs=attributes).text.replace("\n","").strip()

def find_div_and_get_value2(soup, html_balise, attributes):
    return soup.find(html_balise, attrs=attributes).find('strong').text.replace("\n","").strip()


r=requests.get("https://www.century21.com/real-estate/colorado-springs-co/LCCOCOLORADOSPRINGS/")
c=r.content
soup = BeautifulSoup(c,"html.parser")
houses = soup.findAll("div", {"class":"infinite-item"})

l=[]
for house in houses:
    try:
        d = {}
        d["Price"] = find_div_and_get_value(house, 'a', {"class": "listing-price"})
        d["Address"] = find_div_and_get_value(house, 'div', {"class": "property-address"})
        d["City"] = find_div_and_get_value(house, 'div', {"class":"property-city"})
        d["Beds"] = find_div_and_get_value2(house, 'div', {"class":"property-beds"})
        d["Baths"] = find_div_and_get_value2(house, 'div', {"class":"property-baths"})
        d["Area"] = find_div_and_get_value2(house, 'div', {"class":"property-sqft"})
        l.append(d)
    except:
        break

for house in l:
    print(house)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM