简体   繁体   中英

How to scrape a webpage that uses javascript?

I'm using requests and BeautifulSoup to scrape data from a real estate website. It has several numbered "pages" that show dozens of apartaments. I wrote a loop that run across all these pages and collect data from the apartments, but unfortunately they use javascript, and, because of that, the code only returns the apartments of the first page. I also tried something with selenium, but ran across the same problem.

Thanks a lot in advance for any suggestions!

Here's the code:

# Create empty lists to append data scraped from URL
# Number of lists depends on the number of features you want to extract

lista_preco = []
lista_endereco = []
lista_tamanho = []
lista_quartos = []
lista_banheiros = []
lista_vagas = []
lista_condominio = []
lista_amenidades = []
lista_fotos = []
lista_sites = []

n_pages = 0

for page in range(1, 15):
    n_pages += 1
    url = "https://www.vivareal.com.br/venda/bahia/salvador/apartamento_residencial/"+'?pagina='+str(page)
    url = requests.get(url)
    soup = BeautifulSoup(url.content, 'html.parser')
    house_containers = soup.find_all('div', {'class' :'js-card-selector'})
    if house_containers != []:
        for container in house_containers:
            
            # Price
            price = container.find_all('section', class_='property-card__values')[0].text
            try:
                price = int(price[:price.find('C')].replace('R$', '').replace('.','').strip())
            except:
                price = 0
            lista_preco.append(price)

            # Zone
            location = container.find_all('span', class_='property-card__address')[0].text
            location = location.strip()
            lista_endereco.append(location)

            # Size
            size = container.find_all('span', class_='property-card__detail-value js-property-card-value property-card__detail-area js-property-card-detail-area')[0].text
            if '-' not in size:
                size = int(size[:size.find('m')].replace(',','').strip())
            else:
                size = int(size[:size.find('-')].replace(',','').strip())
            lista_tamanho.append(size)

            # Rooms
            quartos = container.find_all('li', class_='property-card__detail-item property-card__detail-room js-property-detail-rooms')[0].text
            quartos = quartos[:quartos.find('Q')].strip()
            if '-' in quartos:
                quartos = quartos[:quartos.find('-')].strip()
            lista_quartos.append(int(quartos))
            
            # Bathrooms
            banheiros = container.find_all('li', class_='property-card__detail-item property-card__detail-bathroom js-property-detail-bathroom')[0].text
            banheiros = banheiros[:banheiros.find('B')].strip()
            if '-' in banheiros:
                banheiros = banheiros[:banheiros.find('-')].strip()
            lista_banheiros.append(int(banheiros))
            
            # Garage
            vagas = container.find_all('li', class_='property-card__detail-item property-card__detail-garage js-property-detail-garages')[0].text
            vagas = vagas[:vagas.find('V')].strip()
            if '--' in vagas:
                vagas = '0'
            lista_vagas.append(int(vagas))

            # Condomínio
            condominio = container.find_all('section', class_='property-card__values')[0].text
            try:
                condominio = int(condominio[condominio.rfind('R$'):].replace('R$','').replace('.','').strip())
            except:
                condominio = 0
            lista_condominio.append(condominio)

            # Amenidades
            try:
                amenidades = container.find_all('ul', class_='property-card__amenities')[0].text
                amenidades = amenidades.split()
            except:
                amenidades = 'Zero'
            lista_amenidades.append(amenidades)

            # url
            link = 'https://www.vivareal.com.br/' + container.find_all('a')[0].get('href')[1:-1]
            lista_sites.append(link)

            # image
            #p = str(container.find_all('img')[0])
            #p

            #2x size thumbnail

            #imgurl = p[p.find('https'):p.rfind('data-src')]
            #imgurl.replace('"', '').strip()
            #lista_fotos.append(imgurl)
    else:
        break
    
    time.sleep(randint(1,2))
    
print('You scraped {} pages containing {} properties.'.format(n_pages, len(lista_preco)))```

You DO have a choice. There is no need to use Selenium as you can access the data through the api.

There is a restriction on the site that only allows you to paginate through max 10,000 listings. There's far more data that is returned than what you want there, so it's up to you to look through that json response and see if there anything more you want to add:

Code:

import pandas as pd
import requests
import math
import time
import random

url = 'https://glue-api.vivareal.com/v2/listings'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
           'x-domain': 'www.vivareal.com.br'}
payload = {
'addressCity': 'Salvador',
'addressLocationId': 'BR>Bahia>NULL>Salvador',
'addressNeighborhood': '',
'addressState': 'Bahia',
'addressCountry': 'Brasil',
'addressStreet': '',
'addressZone': '',
'addressPointLat': '-12.977738',
'addressPointLon': '-38.501636',
'business': 'SALE',
'facets': 'amenities',
'unitTypes': 'APARTMENT',
'unitSubTypes': 'UnitSubType_NONE,DUPLEX,LOFT,STUDIO,TRIPLEX',
'unitTypesV3': 'APARTMENT',
'usageTypes': 'RESIDENTIAL',
'listingType': 'USED',
'parentId': 'null',
'categoryPage': 'RESULT',
'size': '350',
'from': '0',
'q': '',
'developmentsSize': '5',
'__vt': '',
'levels': 'CITY,UNIT_TYPE',
'ref': '/venda/bahia/salvador/apartamento_residencial/',
'pointRadius':''}


def get_num_of_listings(priceMin, priceMax, payload, url, previous_priceMax, jsonData, previous_jsonData):
    randInt = random.uniform(5.1, 7.9)
    payload.update({'from':'0'})
    #time.sleep(randInt)
    if priceMax > 2500000:
        priceMax = 100000000
    payload.update({'priceMin':'%s' %priceMin,'priceMax':'%s' %priceMax})
    jsonData = requests.get(url, headers=headers, params=payload).json()
    listings_count = jsonData['search']['totalCount']
    
    if listings_count < 10000:
        if priceMax < 100000000:
            print ('Price range %s - %s returns %s listings.' %(priceMin, priceMax, listings_count))
            previous_jsonData = jsonData
            previous_priceMax = priceMax
            priceMax += 25000
            listings_count, priceMin, priceMax, previous_priceMax, jsonData, previous_jsonData = get_num_of_listings(priceMin, priceMax, payload, url, previous_priceMax, jsonData, previous_jsonData)
        else:
            previous_jsonData = jsonData
            previous_priceMax = 100000000
        
    priceMin = previous_priceMax + 1
    priceMax = priceMin + 250000 - 1
    return listings_count, priceMin, priceMax, previous_priceMax, jsonData, previous_jsonData
    

rows = []
priceMin = 1
priceMax = 250000
finished = False
aquired = []
while finished == False:
    randInt = random.uniform(5.1, 7.9)
    listings_count, priceMin, priceMax, previous_priceMax, jsonData, previous_jsonData = get_num_of_listings(priceMin, priceMax, payload, url, None, None, None)
    total_pages = math.ceil(previous_jsonData['search']['totalCount'] / 350)
        
    for page in range(1, total_pages+1):
        if page == 1:
            idx=0
            jsonData = previous_jsonData
        else:
            idx = 350*page
            payload.update({'from':'%s' %idx})
            if idx == 9800:
                payload.update({'size':200})
            else:
                payload.update({'size':350})
             
            if idx > 9800:
                continue
            #time.sleep(randInt)
            jsonData = requests.get(url, headers=headers, params=payload).json()
        
        listings = jsonData['search']['result']['listings']
        for listing in listings:
            listingId = listing['listing']['id']
            if listingId in aquired:
                continue
            zone = listing['listing']['address']['zone']
            size = listing['listing']['usableAreas'][0]
            bedrooms = listing['listing']['bedrooms'][0]
            bathrooms = listing['listing']['bathrooms'][0]
            if listing['listing']['parkingSpaces'] != []:
                parking = listing['listing']['parkingSpaces'][0]
            else:
                parking = None
            price = listing['listing']['pricingInfos'][0]['price']
            try:
                condoFee =  listing['listing']['pricingInfos'][0]['monthlyCondoFee']
            except:
                condoFee =  None
            amenities = listing['listing']['amenities']
            listingUrl = 'https://www.vivareal.com.br' + listing['link']['href']
                
            row = {
            'Id':listingId,
            'Zone' : zone,
            'Size' : size,
            'Bedrooms' : bedrooms,
            'Bathrooms': bathrooms,
            'Garage' : parking,
            'Price': price,
            'Condominio' : condoFee,
            'Amenidades' : amenities,
            'url' : listingUrl}
            
            aquired.append(listingId)

            rows.append(row)
        print('Page %s of %s' %(page, total_pages))
    if priceMax > 100000000:
        print('Done')
        finished = True
    
df = pd.DataFrame(rows)

Output:

IPdb [3]: print(df)
               Id  ...                                                url
0      2511396476  ...  https://www.vivareal.com.br/imovel/apartamento...
1      2494354474  ...  https://www.vivareal.com.br/imovel/apartamento...
2      2504461896  ...  https://www.vivareal.com.br/imovel/apartamento...
3      2508574459  ...  https://www.vivareal.com.br/imovel/apartamento...
4      2511489082  ...  https://www.vivareal.com.br/imovel/apartamento...
          ...  ...                                                ...
26244    94618731  ...  https://www.vivareal.com.br/imovel/apartamento...
26245    93437597  ...  https://www.vivareal.com.br/imovel/apartamento...
26246    79341843  ...  https://www.vivareal.com.br/imovel/apartamento...
26247  2455978575  ...  https://www.vivareal.com.br/imovel/apartamento...
26248  2509913182  ...  https://www.vivareal.com.br/imovel/apartamento...

[26249 rows x 10 columns]

Unfortunately, I believe there is NO choice for you to do this. The reason is that with new frontend technologies the html is rendered asynchronous and it requires "real" environment for javascript to be able to run and load the page. For example, with Ajax, you will need a real browser (Chrome, Firefox) to make it work. So, my suggestion is you should keep digging deeper into Selenium and mimic the click event to click each page (clicking on page number like 1..2..3 until the end) then wait until the data loaded, then read the html and extract the data you need. Regards.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM