简体   繁体   中英

Extract information from html tags using beautiful soup python

I have:

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url='https://www.zoopla.co.uk/for-sale/property/london/west-wickham/?q=West%20Wickham%2C%20London&results_sort=newest_listings&search_source=home'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html,'html.parser')

containers = page_soup.findAll("div",{"class":"listing-results-wrapper"}) 

listing_price = []
listing_nobed = []

for c in containers:
    listing_price.append(c.findAll("a",{"class":"listing-results-price text-price"}))
    listing_nobed.append(c.findAll("h3",{"class":"listing-results-attr"}))

print(listing_price[0])
print('----------------------------')
print(listing_nobed[0])

results:

[<a class="listing-results-price text-price" href="/for-sale/details/50924268">




        £500,000







                <span class="price-modifier">Offers over</span>
</a>]
----------------------------
[<h3 class="listing-results-attr">
<span class="num-icon num-beds" title="3 bedrooms"><span class="interface"></span>3</span> <span class="num-icon num-baths" title="1 bathroom"><span class="interface"></span>1</span> <span class="num-icon num-reception" title="2 reception rooms"><span class="interface"></span>2</span>
</h3>]

I want:

Price   NoBeds NoBaths NoRec
500,000 3      1       2
xxx     x      x       NaN

Where xxx is the price, etc. Some of the values do not have a tag, so if that is the case, then show NaN or 0

I tried Python - Beautiful Soup - Remove Tags to to extract the (3,1,2) values to no avail.

To extract the price, I thought of using regex, but found many comments here do not recommend it.

I am still trying to understand html tags and data extractions, so any suggestions are greatly appreciated.

You can use next() to find any next elements and for cleaning text() strip()

from bs4 import BeautifulSoup as soup
import requests
my_url='https://www.zoopla.co.uk/for-sale/property/london/west-wickham/?q=West%20Wickham%2C%20London&results_sort=newest_listings&search_source=home'

req = requests.get(my_url)
page_soup = soup(req.content,'html.parser')

containers = page_soup.findAll("div",{"class":"listing-results-wrapper"}) 

for c in containers:
    a = c.find("a",{"class":"listing-results-price text-price"})
    b = c.find("h3",{"class":"listing-results-attr"})

    NoBedsx = b.find('span',{'class':'num-icon num-beds'})
    NoBathsx = b.find('span',{'class':'num-icon num-baths'})
    NoRecx = b.find('span',{'class':'num-icon num-reception'})

    if a:
        Price = a.next.strip().encode('utf-8')
    if NoBedsx:
        NoBeds = NoBedsx.next.next.encode('utf-8')
    if NoBathsx:
        NoBaths = NoBathsx.next.next.encode('utf-8')
    if NoRecx:
        NoRec = NoRecx.next.next.encode('utf-8')
    print('{} {} {} {}'.format(Price,NoBeds,NoBaths,NoRec))

Output:

Price  NoBeds NoBaths NoRec
£500,000 3 1 2
£337,500 4 2 1
£875,000 5 2 2
£695,000 4 1 2
£190,000 1 1 1
£670,000 4 2 1
£610,000 3 2 2
£675,000 4 2 1
£580,000 4 2 1
£850,000 5 2 1
£185,000 1 2 1
£760,000 5 2 1
£675,000 3 2 1
£142,000 1 2 1
£550,000 2 2 1
£817,000 4 2 1
£139,000 1 2 1
£625,000 3 1 2
£145,000 1 1 2
£725,000 4 1 2
£799,995 4 1 2
£575,000 3 1 2
£465,000 3 1 2
£725,000 4 2 2
£465,000 4 2 2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM