简体   繁体   中英

Python Returning An Empty List Using Beautiful Soup HTML Parsing

I'm currently working on a project that involves web scraping a real estate website (for educational purposes). I'm taking data from home listings like address, price, bedrooms, etc.

After building and testing along the way with the print function (it worked successfully,). I'm now building a dictionary for each data point in the listing. I'm storing that dictionary in a list in order to eventually use Pandas to create a table and send to a CSV.

Here is my problem. My list is displaying an empty dictionary with no error. Please note, I've successfully scraped the data already and have seen the data when using the print function. Now its displaying nothing after adding each data point to a dictionary and putting it in a list. Here is my code:

import requests
from bs4 import BeautifulSoup

r=requests.get("https://www.century21.com/real-estate/colorado-springs-co/LCCOCOLORADOSPRINGS/")
c=r.content

soup=BeautifulSoup(c,"html.parser")

all=soup.find_all("div", {"class":"infinite-item"})
all[0].find("a",{"class":"listing-price"}).text.replace("\n","").replace(" ","")

l=[]
for item in all:
    d={} 
    try: 
        d["Price"]=item.find("a",{"class":"listing-price"}.text.replace("\n","").replace(" ",""))
        d["Address"]=item.find("div",{"class":"property-address"}).text.replace("\n","").replace(" ","")
        d["City"]=item.find_all("div",{"class":"property-city"})[0].text.replace("\n","").replace(" ","")
        try: 
            d["Beds"]=item.find("div",{"class":"property-beds"}).find("strong").text
        except: 
            d["Beds"]=None
        try: 
            d["Baths"]=item.find("div",{"class":"property-baths"}).find("strong").text
        except: 
            d["Baths"]=None
        try: 
            d["Area"]=item.find("div",{"class":"property-sqft"}).find("strong").text
        except: 
             d["Area"]=None
    except: 
        pass
    l.append(d)

When I call l (the list that contains my dictionary) - this is what I get:

[{},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {}]

I'm using Python 3.8.2 with Beautiful Soup 4. Any ideas or help with this would be greatly appreciated. Thanks!

This does what you want much more concisely and is more pythonic (using nested list comprehension):

import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.century21.com/real-estate/colorado-springs-co/LCCOCOLORADOSPRINGS/")
c = r.content

soup = BeautifulSoup(c, "html.parser")

css_classes = [
    "listing-price",
    "property-address",
    "property-city",
    "property-beds",
    "property-baths",
    "property-sqft",
]

pl = [{css_class.split('-')[1]: item.find(class_=css_class).text.strip() # this shouldn't error if not found
       for css_class in css_classes} # find each class in the class list
       for item in soup.find_all('div', class_='property-card-primary-info')] # find each property card div

print(pl)

Output:

[{'address': '512 Silver Oak Grove',
  'baths': '6 baths',
  'beds': '4 beds',
  'city': 'Colorado Springs CO 80906',
  'price': '$1,595,000',
  'sqft': '6,958 sq. ft'},
 {'address': '8910 Edgefield Drive',
  'baths': '5 baths',
  'beds': '5 beds',
  'city': 'Colorado Springs CO 80920',
  'price': '$499,900',
  'sqft': '4,557 sq. ft'},
 {'address': '135 Mayhurst Avenue',
  'baths': '3 baths',
  'beds': '3 beds',
  'city': 'Colorado Springs CO 80906',
  'price': '$420,000',
  'sqft': '1,889 sq. ft'},
 {'address': '7925 Bard Court',
  'baths': '4 baths',
  'beds': '5 beds',
  'city': 'Colorado Springs CO 80920',
  'price': '$405,000',
  'sqft': '3,077 sq. ft'},
 {'address': '7641 N Sioux Circle',
  'baths': '3 baths',
  'beds': '4 beds',
  'city': 'Colorado Springs CO 80915',
  'price': '$389,900',
  'sqft': '3,384 sq. ft'},
 ...
]

You should use function to do the repetitive job. This would make your code clearer. I've managed this code, that is working:

import requests
from bs4 import BeautifulSoup

def find_div_and_get_value(soup, html_balise, attributes):
    return soup.find(html_balise, attrs=attributes).text.replace("\n","").strip()

def find_div_and_get_value2(soup, html_balise, attributes):
    return soup.find(html_balise, attrs=attributes).find('strong').text.replace("\n","").strip()


r=requests.get("https://www.century21.com/real-estate/colorado-springs-co/LCCOCOLORADOSPRINGS/")
c=r.content
soup = BeautifulSoup(c,"html.parser")
houses = soup.findAll("div", {"class":"infinite-item"})

l=[]
for house in houses:
    try:
        d = {}
        d["Price"] = find_div_and_get_value(house, 'a', {"class": "listing-price"})
        d["Address"] = find_div_and_get_value(house, 'div', {"class": "property-address"})
        d["City"] = find_div_and_get_value(house, 'div', {"class":"property-city"})
        d["Beds"] = find_div_and_get_value2(house, 'div', {"class":"property-beds"})
        d["Baths"] = find_div_and_get_value2(house, 'div', {"class":"property-baths"})
        d["Area"] = find_div_and_get_value2(house, 'div', {"class":"property-sqft"})
        l.append(d)
    except:
        break

for house in l:
    print(house)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM