简体   繁体   中英

Beautiful Soup only returning the first 10 listings using soup.select(), What could be the issue here?

import requests
import lxml
from bs4 import BeautifulSoup

LISTINGS_URL = 'https://shorturl.at/ceoAB'
headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                      "Chrome/95.0.4638.69 Safari/537.36 ",
        "Accept-Language": "en-US,en;q=0.9"
}

response = requests.get(LISTINGS_URL, headers=headers)
listings = response.text


class DataScraper:
    def __init__(self):
        self.soup = BeautifulSoup(listings, "html.parser")
def get_links(self):
    for a in self.soup.select(".list-card-top a"):
        print(a)
    # listing_text = [link.getText() for link in links]

def get_address(self):
    pass

def get_prices(self):
    pass

I Have Used the correct css selectors, even trying to find the elements using attrs in find_all() What I am trying to achieve is to parse in all the anchor tags then to fetch the href links for the specific listings however it is only returning the first 10

You can make a GET request to this endpoint and fetch the data you need.

https://www.zillow.com/search/GetSearchPageState.htm?searchQueryState={"pagination":{"currentPage":1},"mapBounds":{"west":-123.33522421253342,"east":-121.44008261097092,"south":37.041584214606814,"north":38.39290664366326},"isMapVisible":false,"filterState":{"price":{"max":872627},"beds":{"min":1},"isForSaleForeclosure":{"value":false},"monthlyPayment":{"max":3000},"isAuction":{"value":false},"isNewConstruction":{"value":false},"isForRent":{"value":true},"isForSaleByOwner":{"value":false},"isComingSoon":{"value":false},"isForSaleByAgent":{"value":false}},"isListVisible":true,"mapZoom":9}&wants={"cat1":["listResults"]}

Change the "currentPage" url parameter value in the above URL to fetch data from different pages.

Since the response is JSON , you can easily parse it and extract the information using json module.

Website is using probably lazy loading , so you can either use something like selenium/puppeteer or use an API of this website (will be an easier way). To do this you need to make a GET request to an url which starts with https://www.zillow.com/search/GetSearchPageState.htm (see in your dev tools in browser), parse JSON response and you have your href link under cat1.searchResults.listResults[index in array].detailUrl .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM