简体   繁体   中英

Scrape pages using beautiful soup

I have two slightly different urls: https://www.booli.se/annons/2278076 , https://www.booli.se/bostad/507292

Difference between the first and the second page: In the second page there is no Utropspris (estimated price).

In the first link I will get estimated price (Utropspris) using the following code:

in[1]= soup.findAll('span', class_='property__base-info__value')[1].text.strip() 
out[1]= u'3 800 000 kr\n\t\t\t64 407 kr/m\xb2'

However in the second link using the same code I used above I will get fee (Avgift)

in[2]= soup.findAll('span', class_='property__base-info__value')[1].text.strip() 
out[2]= u'4 425 kr/m\xe5n'

How can I tell my code that in the second link when I used the same code, this is not estimated price (Utropspris) anymore. This is fee (avgift), save it as fee and for the estimated price write NA. This is part of my code that might be helpful.

url=https://www.booli.se/bostad/507292
import requests
from bs4 import BeautifulSoup
request = requests.get(url)
soup = BeautifulSoup(request.text,'lxml')
soup.findAll('span', class_='property__base-info__value')[1].text.strip()

One idea would be to also scrape the tag associated with the value. As I see from the site, every row that contains information is into a li item with class="property__base-info__item" .

So in the first link, you have a span with class="property__base-info__unit" whose text value is the 'Utropspris' and a span with property__base-info__value which is the value you have already acquired.

You can do something like that:

elements = soup.findAll('li', class_='property__base-info__item')
pairs = {}
for element in elements:
   tag = element.find('span', class_='property__base-info__unit').text
   value = element.find('span', class_='property__base-info__value').text
   pairs[tag] = value

I have not tested the code on my own, but the idea is to iterate through the list of items, and get both the tag and the value of the tag. Then, you can save the pairs in a dictionary and handle accordingly the cases the way you like.

You can actually find difference between them. There is also a span before your data.

 <span class="property__base-info__unit">Utropspris</span>

As you see you can scrap this element also. If span content is Utropspris which means data is Utropspris if not which means Avgit.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM