I have:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url='https://www.zoopla.co.uk/for-sale/property/london/west-wickham/?q=West%20Wickham%2C%20London&results_sort=newest_listings&search_source=home'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html,'html.parser')
containers = page_soup.findAll("div",{"class":"listing-results-wrapper"})
listing_price = []
listing_nobed = []
for c in containers:
listing_price.append(c.findAll("a",{"class":"listing-results-price text-price"}))
listing_nobed.append(c.findAll("h3",{"class":"listing-results-attr"}))
print(listing_price[0])
print('----------------------------')
print(listing_nobed[0])
results:
[<a class="listing-results-price text-price" href="/for-sale/details/50924268">
£500,000
<span class="price-modifier">Offers over</span>
</a>]
----------------------------
[<h3 class="listing-results-attr">
<span class="num-icon num-beds" title="3 bedrooms"><span class="interface"></span>3</span> <span class="num-icon num-baths" title="1 bathroom"><span class="interface"></span>1</span> <span class="num-icon num-reception" title="2 reception rooms"><span class="interface"></span>2</span>
</h3>]
I want:
Price NoBeds NoBaths NoRec
500,000 3 1 2
xxx x x NaN
Where xxx is the price, etc. Some of the values do not have a tag, so if that is the case, then show NaN or 0
I tried Python - Beautiful Soup - Remove Tags to to extract the (3,1,2) values to no avail.
To extract the price, I thought of using regex, but found many comments here do not recommend it.
I am still trying to understand html tags and data extractions, so any suggestions are greatly appreciated.
You can use next()
to find any next elements and for cleaning text()
strip()
from bs4 import BeautifulSoup as soup
import requests
my_url='https://www.zoopla.co.uk/for-sale/property/london/west-wickham/?q=West%20Wickham%2C%20London&results_sort=newest_listings&search_source=home'
req = requests.get(my_url)
page_soup = soup(req.content,'html.parser')
containers = page_soup.findAll("div",{"class":"listing-results-wrapper"})
for c in containers:
a = c.find("a",{"class":"listing-results-price text-price"})
b = c.find("h3",{"class":"listing-results-attr"})
NoBedsx = b.find('span',{'class':'num-icon num-beds'})
NoBathsx = b.find('span',{'class':'num-icon num-baths'})
NoRecx = b.find('span',{'class':'num-icon num-reception'})
if a:
Price = a.next.strip().encode('utf-8')
if NoBedsx:
NoBeds = NoBedsx.next.next.encode('utf-8')
if NoBathsx:
NoBaths = NoBathsx.next.next.encode('utf-8')
if NoRecx:
NoRec = NoRecx.next.next.encode('utf-8')
print('{} {} {} {}'.format(Price,NoBeds,NoBaths,NoRec))
Output:
Price NoBeds NoBaths NoRec
£500,000 3 1 2
£337,500 4 2 1
£875,000 5 2 2
£695,000 4 1 2
£190,000 1 1 1
£670,000 4 2 1
£610,000 3 2 2
£675,000 4 2 1
£580,000 4 2 1
£850,000 5 2 1
£185,000 1 2 1
£760,000 5 2 1
£675,000 3 2 1
£142,000 1 2 1
£550,000 2 2 1
£817,000 4 2 1
£139,000 1 2 1
£625,000 3 1 2
£145,000 1 1 2
£725,000 4 1 2
£799,995 4 1 2
£575,000 3 1 2
£465,000 3 1 2
£725,000 4 2 2
£465,000 4 2 2
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.