简体   繁体   English

使用美丽的汤python从html标记中提取信息

[英]Extract information from html tags using beautiful soup python

I have: 我有:

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url='https://www.zoopla.co.uk/for-sale/property/london/west-wickham/?q=West%20Wickham%2C%20London&results_sort=newest_listings&search_source=home'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html,'html.parser')

containers = page_soup.findAll("div",{"class":"listing-results-wrapper"}) 

listing_price = []
listing_nobed = []

for c in containers:
    listing_price.append(c.findAll("a",{"class":"listing-results-price text-price"}))
    listing_nobed.append(c.findAll("h3",{"class":"listing-results-attr"}))

print(listing_price[0])
print('----------------------------')
print(listing_nobed[0])

results: 结果:

[<a class="listing-results-price text-price" href="/for-sale/details/50924268">




        £500,000







                <span class="price-modifier">Offers over</span>
</a>]
----------------------------
[<h3 class="listing-results-attr">
<span class="num-icon num-beds" title="3 bedrooms"><span class="interface"></span>3</span> <span class="num-icon num-baths" title="1 bathroom"><span class="interface"></span>1</span> <span class="num-icon num-reception" title="2 reception rooms"><span class="interface"></span>2</span>
</h3>]

I want: 我想要:

Price   NoBeds NoBaths NoRec
500,000 3      1       2
xxx     x      x       NaN

Where xxx is the price, etc. Some of the values do not have a tag, so if that is the case, then show NaN or 0 其中xxx是价格等。其中一些值没有标签,因此,如果是这种情况,则显示NaN或0

I tried Python - Beautiful Soup - Remove Tags to to extract the (3,1,2) values to no avail. 我尝试使用Python-Beautiful Soup-Remove Tags提取(3,1,2)值无济于事。

To extract the price, I thought of using regex, but found many comments here do not recommend it. 为了提取价格,我想到了使用正则表达式,但是在这里发现很多评论不推荐它。

I am still trying to understand html tags and data extractions, so any suggestions are greatly appreciated. 我仍在尝试了解html标签和数据提取,因此非常感谢任何建议。

You can use next() to find any next elements and for cleaning text() strip() 您可以使用next()查找下一个元素并清理text() strip()

from bs4 import BeautifulSoup as soup
import requests
my_url='https://www.zoopla.co.uk/for-sale/property/london/west-wickham/?q=West%20Wickham%2C%20London&results_sort=newest_listings&search_source=home'

req = requests.get(my_url)
page_soup = soup(req.content,'html.parser')

containers = page_soup.findAll("div",{"class":"listing-results-wrapper"}) 

for c in containers:
    a = c.find("a",{"class":"listing-results-price text-price"})
    b = c.find("h3",{"class":"listing-results-attr"})

    NoBedsx = b.find('span',{'class':'num-icon num-beds'})
    NoBathsx = b.find('span',{'class':'num-icon num-baths'})
    NoRecx = b.find('span',{'class':'num-icon num-reception'})

    if a:
        Price = a.next.strip().encode('utf-8')
    if NoBedsx:
        NoBeds = NoBedsx.next.next.encode('utf-8')
    if NoBathsx:
        NoBaths = NoBathsx.next.next.encode('utf-8')
    if NoRecx:
        NoRec = NoRecx.next.next.encode('utf-8')
    print('{} {} {} {}'.format(Price,NoBeds,NoBaths,NoRec))

Output: 输出:

Price  NoBeds NoBaths NoRec
£500,000 3 1 2
£337,500 4 2 1
£875,000 5 2 2
£695,000 4 1 2
£190,000 1 1 1
£670,000 4 2 1
£610,000 3 2 2
£675,000 4 2 1
£580,000 4 2 1
£850,000 5 2 1
£185,000 1 2 1
£760,000 5 2 1
£675,000 3 2 1
£142,000 1 2 1
£550,000 2 2 1
£817,000 4 2 1
£139,000 1 2 1
£625,000 3 1 2
£145,000 1 1 2
£725,000 4 1 2
£799,995 4 1 2
£575,000 3 1 2
£465,000 3 1 2
£725,000 4 2 2
£465,000 4 2 2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM