簡體   English   中英

無法使用請求抓取 graphql 頁面

[英]Can't scrape a graphql page using requests

我正在嘗試使用請求模塊從網頁中抓取公司名稱及其相應的鏈接。

盡管內容非常動態,但我可以注意到它們在window.props旁邊的大括號內可用。

所以,我想挖出那部分並使用 json 處理它,但我看到"字符而不是引號" 。這就是我的意思:

{\u0022firms\u0022: [{\u0022index\u0022: 1, \u0022slug\u0022: \u0022zjjz\u002Datelier\u0022, \u0022name\u0022:

我試過:

import re
import json
import requests
from bs4 import BeautifulSoup

link = 'https://architizer.com/firms/'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
    r = s.get(link)
    items = re.findall(r'window.props[^"]+(.*?);',r.text)[0].strip('"').replace('\u0022', '\'')
    print(items)

如何使用請求從該網頁中抓取不同公司的名稱和鏈接?

嗯,那很有趣。

您正在處理由GraphQL提供支持的頁面,因此您必須正確模擬請求。

此外,他們希望您發送一個Referer Header和一個csfr令牌。 這可以很容易地從初始HTML提取出來並在后續請求中重用。

這是我的看法:

import time

import requests
from bs4 import BeautifulSoup

link = 'https://architizer.com/firms/'
query = """{ allFirmsWithProjects( first: 6, after: "6", firmType: "Architecture / Design Firm", firmName: "All Firm Names", projectType: "All Project Types", projectLocation: "All Project Locations", firmLocation: "All Firm Locations", orderBy: "recently-featured", affiliationSlug: "", ) { firms: edges { cursor node { index id: firmId slug: firmSlug name: firmName projectsCount: firmProjectsCount lastProjectDate: firmLastProjectDate media: firmLogoUrl projects { edges { node { slug: slug media: heroUrl mediaId: heroId isHiddenFromListings } } } } } pageInfo { hasNextPage endCursor } totalCount } }"""


def query_graphql(page_number: int = 6) -> dict:
    q = query.replace(f'after: "6"', f'after: "{str(page_number)}"')
    return s.post(
        "https://architizer.com/api/v3.0/graphql",
        json={"query": q},
    ).json()


def has_next_page(graphql_response: dict) -> bool:
    return graphql_response["data"]["allFirmsWithProjects"]["pageInfo"]["hasNextPage"]


def get_next_page(graphql_response: dict) -> int:
    return graphql_response["data"]["allFirmsWithProjects"]["pageInfo"]["endCursor"]


def get_firms_data(graphql_response: dict) -> list:
    return graphql_response["data"]["allFirmsWithProjects"]["firms"]


def parse_firms_data(firms: list) -> str:
    return "\n".join(firm["node"]["name"] for firm in firms)


def wait_a_bit(wait_for: float = 1.5):
    time.sleep(wait_for)


with requests.Session() as s:
    s.headers["user-agent"] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36"
    s.headers["referer"] = "https://architizer.com/firms/"

    csrf_token = BeautifulSoup(
        s.get(link).text, "html.parser"
    ).find("input", {"name": "csrfmiddlewaretoken"})["value"]

    s.headers.update({"x-csrftoken": csrf_token})

    response = query_graphql()
    while True:
        if not has_next_page(response):
            break
        print(parse_firms_data(get_firms_data(response)))
        wait_a_bit()
        response = query_graphql(get_next_page(response))

這應該輸出,只是為了這個例子,公司的名字:

Brooks + Scarpa Architects
Studio Saxe
NiMa Design
Best Practice Architecture
Gensler
Inca Hernandez
kaa studio
Taller Sintesis
Coryn Kempster and Julia Jamrozik
Franklin Azzi Architecture
Wittman Estes
Masfernandez Arquitectos
MATIAS LOPEZ LLOVET
SRG Partnership, Inc.
GANA Arquitectura
Meyer & Associates Architects, Urban Designers
Steyn Studio
BGLA architecture | urban design

and so on ...

你能幫我做一個類似的嗎,我試着抓取頁面https://www.homedepot.com/b/Appliances-Washers-Dryers-Washing-Machines-Portable-Washing-Machines/N-5yc1vZc496

我嘗試將發布請求發送到“https://www.homedepot.com/federation-gateway/graphql?opname=searchModel”

但我總是收到錯誤 403 禁止。

他們也使用 Graphql,我嘗試應用相同的原則,但我無法在 HTML 中找到令牌,我還注意到每次刷新頁面時 cookie 都會發生變化

我希望你能花一些時間研究這個並給我一些指導。 謝謝你,艾哈邁德

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM