简体   繁体   中英

Scrap page links

I want to scrap this site page link https://kw.com/agent/search/IL/Chicago but this page inspection doesn't have any div class or a href link. I don't understand which function needs to call to scrap these 652 agent links.

My code:

import requests
from bs4 import BeautifulSoup

url = 'https://kw.com/agent/search/IL/Chicago'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')

urls = []
for link in soup.find_all('a'):
    print(link.get('href'))

This code is working for other pages but this site looks complicated to me. How can I collect these site links?

import requests
import json

url = "https://api-endpoint.cons-prod-us-central1.kw.com/graphql"
payload = {
    "operationName": "searchAgentsQuery",
    "variables": {
        "searchCriteria": {
            "searchTerms": {
                "param1": "IL",
                "param2": "Chicago"
            }
        },
        "first": 652,  # Change this to change the number of agents returned
        "after": "",
        "queryId": "0.5045631892889886"
    },
    "query": "query searchAgentsQuery($searchCriteria: AgentSearchCriteriaInput, $first: Int, $after: String) {\n  SearchAgentQuery(searchCriteria: $searchCriteria) {\n    result {\n      agents(first: $first, after: $after) {\n        edges {\n          node {\n            ...AgentProfileFragment\n            __typename\n          }\n          __typename\n        }\n        pageInfo {\n          ...PageInfoFragment\n          __typename\n        }\n        totalCount\n        __typename\n      }\n      __typename\n    }\n    __typename\n  }\n}\n\nfragment PageInfoFragment on PageInfo {\n  endCursor\n  hasNextPage\n  __typename\n}\n\nfragment AgentProfileFragment on AgentProfileType {\n  id\n  name {\n    full\n    given\n    initials\n    __typename\n  }\n  image\n  location {\n    address {\n      state\n      city\n      __typename\n    }\n    __typename\n  }\n  realEstateEntity {\n    name\n    __typename\n  }\n  specialties\n  languages\n  isAgentLuxuryEnabled\n  phone {\n    entries {\n      ... on ContactSetEntryMobile {\n        number\n        __typename\n      }\n      ... on ContactSetEntryEmail {\n        email\n        __typename\n      }\n      __typename\n    }\n    __typename\n  }\n  agentLicenses {\n    licenseNumber\n    state\n    __typename\n  }\n  marketCenter {\n    market_center_name\n    market_center_address1\n    market_center_address2\n    __typename\n  }\n  __typename\n}\n"
}

headers = {
    "x-datadog-trace-id":"3293080049736028717",
    "x-datadog-origin":"rum",
    "x-datadog-sampling-priority":"1",
    "x-datadog-parent-id":"8148745964431063043",
    "x-shared-secret":"MjFydHQ0dndjM3ZAI0ZHQCQkI0BHIyM=",
    "x-datadog-sampled":"1"
}


result = requests.post(url, json=payload, headers=headers)
obj = json.loads(result.text)
edges = obj['data']['SearchAgentQuery']['result']['agents']['edges'][0]
print(edges['node']['name']['full'])

This code will grab all 652 agents at once and prints the full name of the first agent. One agent looks like this in the response:

"node": {
        "id": "UPA-6762861760017670144-8",
        "name": {
            "full": "Maryam Abdi",
            "given": "Maryam",
            "initials": "MA",
            "__typename": "BasePersonNameType"
        },
        "image": "https://storage.googleapis.com/attachment-prod-e2ad/754026/c7grro22skjb1jt7u73g.jpg",
        "location": {
            "address": {
                "state": "IL",
                "city": "Chicago",
                "__typename": "AddressType"
            },
            "__typename": "LocatorType"
        },
        "realEstateEntity": null,
        "specialties": [],
        "languages": [
            "English"
        ],
        "isAgentLuxuryEnabled": false,
        "phone": {
            "entries": [
                {
                    "__typename": "ContactSetEntryLandline"
                },
                {
                    "number": "4253753652",
                    "__typename": "ContactSetEntryMobile"
                },
                {
                    "email": "maryam.abdi@kw.com",
                    "__typename": "ContactSetEntryEmail"
                }
            ],
            "__typename": "ContactSetType"
        },
        "agentLicenses": [
            {
                "licenseNumber": "PENDING",
                "state": "IL",
                "__typename": "AgentLicense"
            }
        ],
        "marketCenter": {
            "market_center_name": "KW ONEChicago ",
            "market_center_address1": "2211 N. Elston Suite 104",
            "market_center_address2": "Chicago IL 60614",
            "__typename": "MarketingProfileMarketCenterType"
        },
        "__typename": "AgentProfileType"
    },
    "__typename": "AgentProfileEdge"
}

To find out which requests your browser makes, you can simply filter out requests in the Network tab Fetch/XHR . One of them is called graphql which is the important one.

I can recommend Postman and the associated browser extension .

If you don't want grab all agents at once or don't know how many there are. The response of all requests to the graphql endpoint contains a pageInfo which has the fields endCursor and hasNextPage .

"pageInfo": {
    "endCursor": "49",
    "hasNextPage": true,
    "__typename": "PageInfo"
}

Now you can check if hasNextPage is true if so, do another request. In each following request the field after in your payload must be set to the endCursor value of the previous Request.

For example:

result = requests.post(url, json=payload, headers=headers)
obj = json.loads(result.text)
pageInfo = obj['data']['SearchAgentQuery']['result']['agents']['pageInfo']
while pageInfo['hasNextPage']:
    # do something with the data
    # For example print full names:
    edges = obj['data']['SearchAgentQuery']['result']['agents']['edges']
    for edge in edges:
        print(edge['node']['name']['full'])
    # update the after value
    after = pageInfo['endCursor']
    payload['variables']['after'] = after
    # new request
    result = requests.post(url, json=payload, headers=headers)
    obj = json.loads(result.text)
    pageInfo = obj['data']['SearchAgentQuery']['result']['agents']['pageInfo']

# do something with the data of the last request

Obviously you need to change the first field to a smaller number eg 50 Output:

Maryam Abdi
Ariana Abercrumbie
STEVE ACOBA
McKinley Adams
Lenora Adds
Maruf Adeyemo
Angela Akins
Henriette Akins
Melanie Alcaraz
Kira Alexander
Biekhal Alkhalifa
Maria Allen
Stephanie Allen
Alejandro Almaraz
Mark Altamore
Adam Alvanos
Peter Ambrosino
...

You can extract all the names from the page by going over all the agent cards.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By

url = 'https://kw.com/agent/search/IL/Chicago'
chrome_path = r"C:\Users\hpoddar\Desktop\Tools\chromedriver_win32\chromedriver.exe"

s = Service(chrome_path)
driver = webdriver.Chrome(service=s)
driver.get(url)
agent_names = driver.find_elements(By.CLASS_NAME, 'AgentCard')
for agent in agent_names:
    print(agent.find_element(By.CLASS_NAME, 'AgentCard__name').text)

This gives us

Maryam Abdi
Ariana Abercrumbie
STEVE ACOBA
McKinley Adams
Lenora Adds
Maruf Adeyemo
Angela Akins
Henriette Akins
Melanie Alcaraz
Kira Alexander
Biekhal Alkhalifa
Maria Allen
Stephanie Allen
Alejandro Almaraz
Mark Altamore
Adam Alvanos
Peter Ambrosino
Miguel Amesquita
Deval Amin
Steve Anderson
Alicia Andrade
Brittany Andrews
Sandy Andros
Christopher Anthony
Joel Anthony
Perry Apawu
Mark Apel
Niko Apostal
Vashti Araia
Joe Ariano
Charlie Arnold
Harjit Ashta
John P. Astorina
Sharron Atterberry
Dave Auffarth
David Augustyn
Joseph Avola
Ravid Ayala
Shavell Banks
Julie Bapst
Maria Barba
Christina Barbaro
Deonte Barbee
Adam Bartosic
Stephanie Basa
Bart Basinski
Nicole Basso
Peter Batarseh
Nicholas Batsakis
Steven Battista

The requests module brings the pure html page (without javascript). So you'll need to use selenium instead.

For instance:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from shutil import which
from selenium.webdriver.common.by import By

url = 'https://kw.com/agent/search/IL/Chicago'

chrome_options = Options()
chrome_options.add_argument('--headless')

chrome_path = which('chromedriver')

driver = webdriver.Chrome(executable_path=chrome_path, options=chrome_options)

driver.get(url)

links = driver.find_elements(By.TAG_NAME, value="a")

Actually , all required data is generating from API. Each agent link/url contains a unique Id and this id value with the domain name is the agent link/details page link.

Example:

import requests

api_url = "https://api-endpoint.cons-prod-us-central1.kw.com/graphql"
data={"operationName":"searchAgentsQuery","variables":{"searchCriteria":{"searchTerms":{"param1":"IL","param2":"Chicago"}},"first":50,"after":"99","queryId":"0.8691595723322416"},"query":"query searchAgentsQuery($searchCriteria: AgentSearchCriteriaInput, $first: Int, $after: String) {\n  SearchAgentQuery(searchCriteria: $searchCriteria) {\n    result {\n      agents(first: $first, after: $after) {\n        edges {\n          node {\n            ...AgentProfileFragment\n            __typename\n          }\n          __typename\n        }\n        pageInfo {\n          ...PageInfoFragment\n          __typename\n        }\n        totalCount\n        __typename\n      }\n      __typename\n    }\n    __typename\n  }\n}\n\nfragment PageInfoFragment on PageInfo {\n  endCursor\n  hasNextPage\n  __typename\n}\n\nfragment AgentProfileFragment on AgentProfileType {\n  id\n  name {\n    full\n    given\n    initials\n    __typename\n  }\n  image\n  location {\n    address {\n      state\n      city\n      __typename\n    }\n    __typename\n  }\n  realEstateEntity {\n    name\n    __typename\n  }\n  specialties\n  languages\n  isAgentLuxuryEnabled\n  phone {\n    entries {\n      ... on ContactSetEntryMobile {\n        number\n        __typename\n      }\n      ... on ContactSetEntryEmail {\n        email\n        __typename\n      }\n      __typename\n    }\n    __typename\n  }\n  agentLicenses {\n    licenseNumber\n    state\n    __typename\n  }\n  marketCenter {\n    market_center_name\n    market_center_address1\n    market_center_address2\n    __typename\n  }\n  __typename\n}\n"}
headers={
        'content-type': 'application/json',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
        'x-datadog-origin': 'rum',
        'x-datadog-parent-id': '5420198475190660541',
        'x-datadog-sampled': '1',
        'x-datadog-sampling-priority': '1',
        'x-datadog-trace-id': '1837163169752685118',
        'x-shared-secret': 'MjFydHQ0dndjM3ZAI0ZHQCQkI0BHIyM='
}

res = requests.post(api_url,headers=headers,json=data)
data = res.json()['data']['SearchAgentQuery']['result']['agents']['edges']

for item in data:
        link='https://kw.com/agent/' + item['node']['id']
        print(link)

Output:

https://kw.com/agent/UPA-6587385404419399681-8
https://kw.com/agent/UPA-6587385313789222917-3
https://kw.com/agent/UPA-6704789234247561216-6
https://kw.com/agent/UPA-6587385427490459656-4
https://kw.com/agent/UPA-6587385454284918792-0
https://kw.com/agent/UPA-6882009464351350784-8
https://kw.com/agent/UPA-6937439716674322432-5
https://kw.com/agent/UPA-6587385379476373510-1
https://kw.com/agent/UPA-6853411032351416320-2
https://kw.com/agent/UPA-6587385065789456390-4
https://kw.com/agent/UPA-6587385175436890114-3
https://kw.com/agent/UPA-6942951019140222976-1
https://kw.com/agent/UPA-6808491123018551296-7
https://kw.com/agent/UPA-6587385273946116100-8
https://kw.com/agent/UPA-6587385281007677447-9
https://kw.com/agent/UPA-6592268954554945544-5
https://kw.com/agent/UPA-6587385270364864517-7
https://kw.com/agent/UPA-6856325267405185024-3
https://kw.com/agent/UPA-6804158392167718912-3
https://kw.com/agent/UPA-6638843865929490435-1
https://kw.com/agent/UPA-6587384999272361984-6
https://kw.com/agent/UPA-6592267095708119045-4
https://kw.com/agent/UPA-6587385271389274119-4
https://kw.com/agent/UPA-6587385271385079815-8
https://kw.com/agent/UPA-6587385288161681409-1
https://kw.com/agent/UPA-6587385375965011973-7
https://kw.com/agent/UPA-6587385274994008066-1
https://kw.com/agent/UPA-6913250263682408448-6
https://kw.com/agent/UPA-6587385272597565443-9
https://kw.com/agent/UPA-6859526404702093312-9
https://kw.com/agent/UPA-6587385390518407175-2
https://kw.com/agent/UPA-6587385436077776899-8
https://kw.com/agent/UPA-6587384956740640770-9
https://kw.com/agent/UPA-6587385297339674632-1
https://kw.com/agent/UPA-6587385390593904641-1
https://kw.com/agent/UPA-6811013526642786304-3
https://kw.com/agent/UPA-6932834317516042240-9
https://kw.com/agent/UPA-6587385437068947458-5
https://kw.com/agent/UPA-6587385380989808647-6
https://kw.com/agent/UPA-6892926376478015488-5
https://kw.com/agent/UPA-6905262704995926016-2
https://kw.com/agent/UPA-6592947303925784578-6
https://kw.com/agent/UPA-6587385393920495624-5
https://kw.com/agent/UPA-6783788552269369344-7
https://kw.com/agent/UPA-6710285049427382272-8
https://kw.com/agent/UPA-6844700377378430976-0
https://kw.com/agent/UPA-6934540598372548608-6
https://kw.com/agent/UPA-6711387287014834176-1
https://kw.com/agent/UPA-6587385367301132290-0
https://kw.com/agent/UPA-6714648183099023360-3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM