I'm using requests and BeautifulSoup to scrape data from a real estate website. It has several numbered "pages" that show dozens of apartaments. I wrote a loop that run across all these pages and collect data from the apartments, but unfortunately they use javascript, and, because of that, the code only returns the apartments of the first page. I also tried something with selenium, but ran across the same problem.
Thanks a lot in advance for any suggestions!
Here's the code:
# Create empty lists to append data scraped from URL
# Number of lists depends on the number of features you want to extract
lista_preco = []
lista_endereco = []
lista_tamanho = []
lista_quartos = []
lista_banheiros = []
lista_vagas = []
lista_condominio = []
lista_amenidades = []
lista_fotos = []
lista_sites = []
n_pages = 0
for page in range(1, 15):
n_pages += 1
url = "https://www.vivareal.com.br/venda/bahia/salvador/apartamento_residencial/"+'?pagina='+str(page)
url = requests.get(url)
soup = BeautifulSoup(url.content, 'html.parser')
house_containers = soup.find_all('div', {'class' :'js-card-selector'})
if house_containers != []:
for container in house_containers:
# Price
price = container.find_all('section', class_='property-card__values')[0].text
try:
price = int(price[:price.find('C')].replace('R$', '').replace('.','').strip())
except:
price = 0
lista_preco.append(price)
# Zone
location = container.find_all('span', class_='property-card__address')[0].text
location = location.strip()
lista_endereco.append(location)
# Size
size = container.find_all('span', class_='property-card__detail-value js-property-card-value property-card__detail-area js-property-card-detail-area')[0].text
if '-' not in size:
size = int(size[:size.find('m')].replace(',','').strip())
else:
size = int(size[:size.find('-')].replace(',','').strip())
lista_tamanho.append(size)
# Rooms
quartos = container.find_all('li', class_='property-card__detail-item property-card__detail-room js-property-detail-rooms')[0].text
quartos = quartos[:quartos.find('Q')].strip()
if '-' in quartos:
quartos = quartos[:quartos.find('-')].strip()
lista_quartos.append(int(quartos))
# Bathrooms
banheiros = container.find_all('li', class_='property-card__detail-item property-card__detail-bathroom js-property-detail-bathroom')[0].text
banheiros = banheiros[:banheiros.find('B')].strip()
if '-' in banheiros:
banheiros = banheiros[:banheiros.find('-')].strip()
lista_banheiros.append(int(banheiros))
# Garage
vagas = container.find_all('li', class_='property-card__detail-item property-card__detail-garage js-property-detail-garages')[0].text
vagas = vagas[:vagas.find('V')].strip()
if '--' in vagas:
vagas = '0'
lista_vagas.append(int(vagas))
# Condomínio
condominio = container.find_all('section', class_='property-card__values')[0].text
try:
condominio = int(condominio[condominio.rfind('R$'):].replace('R$','').replace('.','').strip())
except:
condominio = 0
lista_condominio.append(condominio)
# Amenidades
try:
amenidades = container.find_all('ul', class_='property-card__amenities')[0].text
amenidades = amenidades.split()
except:
amenidades = 'Zero'
lista_amenidades.append(amenidades)
# url
link = 'https://www.vivareal.com.br/' + container.find_all('a')[0].get('href')[1:-1]
lista_sites.append(link)
# image
#p = str(container.find_all('img')[0])
#p
#2x size thumbnail
#imgurl = p[p.find('https'):p.rfind('data-src')]
#imgurl.replace('"', '').strip()
#lista_fotos.append(imgurl)
else:
break
time.sleep(randint(1,2))
print('You scraped {} pages containing {} properties.'.format(n_pages, len(lista_preco)))```
You DO have a choice. There is no need to use Selenium as you can access the data through the api.
There is a restriction on the site that only allows you to paginate through max 10,000 listings. There's far more data that is returned than what you want there, so it's up to you to look through that json response and see if there anything more you want to add:
Code:
import pandas as pd
import requests
import math
import time
import random
url = 'https://glue-api.vivareal.com/v2/listings'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
'x-domain': 'www.vivareal.com.br'}
payload = {
'addressCity': 'Salvador',
'addressLocationId': 'BR>Bahia>NULL>Salvador',
'addressNeighborhood': '',
'addressState': 'Bahia',
'addressCountry': 'Brasil',
'addressStreet': '',
'addressZone': '',
'addressPointLat': '-12.977738',
'addressPointLon': '-38.501636',
'business': 'SALE',
'facets': 'amenities',
'unitTypes': 'APARTMENT',
'unitSubTypes': 'UnitSubType_NONE,DUPLEX,LOFT,STUDIO,TRIPLEX',
'unitTypesV3': 'APARTMENT',
'usageTypes': 'RESIDENTIAL',
'listingType': 'USED',
'parentId': 'null',
'categoryPage': 'RESULT',
'size': '350',
'from': '0',
'q': '',
'developmentsSize': '5',
'__vt': '',
'levels': 'CITY,UNIT_TYPE',
'ref': '/venda/bahia/salvador/apartamento_residencial/',
'pointRadius':''}
def get_num_of_listings(priceMin, priceMax, payload, url, previous_priceMax, jsonData, previous_jsonData):
randInt = random.uniform(5.1, 7.9)
payload.update({'from':'0'})
#time.sleep(randInt)
if priceMax > 2500000:
priceMax = 100000000
payload.update({'priceMin':'%s' %priceMin,'priceMax':'%s' %priceMax})
jsonData = requests.get(url, headers=headers, params=payload).json()
listings_count = jsonData['search']['totalCount']
if listings_count < 10000:
if priceMax < 100000000:
print ('Price range %s - %s returns %s listings.' %(priceMin, priceMax, listings_count))
previous_jsonData = jsonData
previous_priceMax = priceMax
priceMax += 25000
listings_count, priceMin, priceMax, previous_priceMax, jsonData, previous_jsonData = get_num_of_listings(priceMin, priceMax, payload, url, previous_priceMax, jsonData, previous_jsonData)
else:
previous_jsonData = jsonData
previous_priceMax = 100000000
priceMin = previous_priceMax + 1
priceMax = priceMin + 250000 - 1
return listings_count, priceMin, priceMax, previous_priceMax, jsonData, previous_jsonData
rows = []
priceMin = 1
priceMax = 250000
finished = False
aquired = []
while finished == False:
randInt = random.uniform(5.1, 7.9)
listings_count, priceMin, priceMax, previous_priceMax, jsonData, previous_jsonData = get_num_of_listings(priceMin, priceMax, payload, url, None, None, None)
total_pages = math.ceil(previous_jsonData['search']['totalCount'] / 350)
for page in range(1, total_pages+1):
if page == 1:
idx=0
jsonData = previous_jsonData
else:
idx = 350*page
payload.update({'from':'%s' %idx})
if idx == 9800:
payload.update({'size':200})
else:
payload.update({'size':350})
if idx > 9800:
continue
#time.sleep(randInt)
jsonData = requests.get(url, headers=headers, params=payload).json()
listings = jsonData['search']['result']['listings']
for listing in listings:
listingId = listing['listing']['id']
if listingId in aquired:
continue
zone = listing['listing']['address']['zone']
size = listing['listing']['usableAreas'][0]
bedrooms = listing['listing']['bedrooms'][0]
bathrooms = listing['listing']['bathrooms'][0]
if listing['listing']['parkingSpaces'] != []:
parking = listing['listing']['parkingSpaces'][0]
else:
parking = None
price = listing['listing']['pricingInfos'][0]['price']
try:
condoFee = listing['listing']['pricingInfos'][0]['monthlyCondoFee']
except:
condoFee = None
amenities = listing['listing']['amenities']
listingUrl = 'https://www.vivareal.com.br' + listing['link']['href']
row = {
'Id':listingId,
'Zone' : zone,
'Size' : size,
'Bedrooms' : bedrooms,
'Bathrooms': bathrooms,
'Garage' : parking,
'Price': price,
'Condominio' : condoFee,
'Amenidades' : amenities,
'url' : listingUrl}
aquired.append(listingId)
rows.append(row)
print('Page %s of %s' %(page, total_pages))
if priceMax > 100000000:
print('Done')
finished = True
df = pd.DataFrame(rows)
Output:
IPdb [3]: print(df)
Id ... url
0 2511396476 ... https://www.vivareal.com.br/imovel/apartamento...
1 2494354474 ... https://www.vivareal.com.br/imovel/apartamento...
2 2504461896 ... https://www.vivareal.com.br/imovel/apartamento...
3 2508574459 ... https://www.vivareal.com.br/imovel/apartamento...
4 2511489082 ... https://www.vivareal.com.br/imovel/apartamento...
... ... ...
26244 94618731 ... https://www.vivareal.com.br/imovel/apartamento...
26245 93437597 ... https://www.vivareal.com.br/imovel/apartamento...
26246 79341843 ... https://www.vivareal.com.br/imovel/apartamento...
26247 2455978575 ... https://www.vivareal.com.br/imovel/apartamento...
26248 2509913182 ... https://www.vivareal.com.br/imovel/apartamento...
[26249 rows x 10 columns]
Unfortunately, I believe there is NO choice for you to do this. The reason is that with new frontend technologies the html is rendered asynchronous and it requires "real" environment for javascript to be able to run and load the page. For example, with Ajax, you will need a real browser (Chrome, Firefox) to make it work. So, my suggestion is you should keep digging deeper into Selenium and mimic the click event to click each page (clicking on page number like 1..2..3 until the end) then wait until the data loaded, then read the html and extract the data you need. Regards.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.