[英]Failed to scrape tabular content from a webpage using requests module
我正在嘗試使用請求模塊從網頁中抓取表格內容。 該頁面的內容是高度動態的但是,根據開發工具,可以通過 api 訪問它。 我試圖用適當的參數模仿相同的發布請求,但我總是得到狀態403
。
import requests
from pprint import pprint
start_url = 'https://opensea.io/rankings'
link = 'https://api.opensea.io/graphql/'
payload = {"id":"rankingsQuery","query":"query rankingsQuery(\n $chain: [ChainScalar!]\n $count: Int!\n $cursor: String\n $sortBy: CollectionSort\n $parents: [CollectionSlug!]\n $createdAfter: DateTime\n) {\n ...rankings_collections\n}\n\nfragment rankings_collections on Query {\n collections(after: $cursor, chains: $chain, first: $count, sortBy: $sortBy, parents: $parents, createdAfter: $createdAfter, sortAscending: false, includeHidden: true, excludeZeroVolume: true) {\n edges {\n node {\n createdDate\n name\n slug\n logo\n stats {\n floorPrice\n marketCap\n numOwners\n totalSupply\n sevenDayChange\n sevenDayVolume\n oneDayChange\n oneDayVolume\n thirtyDayChange\n thirtyDayVolume\n totalVolume\n id\n }\n id\n __typename\n }\n cursor\n }\n pageInfo {\n endCursor\n hasNextPage\n }\n }\n}\n","variables":{"chain":None,"count":100,"cursor":"YXJyYXljb25uZWN0aW9uOjk5","sortBy":"SEVEN_DAY_VOLUME","parents":None,"createdAfter":None}}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
s.headers['x-api-key'] = '2f6f419a083c46de9d83ce3dbe7db601'
s.headers['x-build-id'] = 'cplNDIqD8Uy8MvANX90r9'
s.headers['referer'] = 'https://opensea.io/'
res = s.post(link,json=payload)
pprint(res.status_code)
print(res.json())
如何使用請求模塊從該網頁中抓取表格內容?
您可以從腳本標簽中對其進行正則表達式,然后重建表。 有一些列格式要做。
import requests, re, json
import pandas as pd
r = requests.get('https://opensea.io/rankings')
data = json.loads(re.search(r'window\.__wired__=([^<]*)', r.text).group(1))
items = [v for v in data['records'].values() if v['__typename'] in ['CollectionType', 'CollectionStatsType']]
d = {i['name']:j for i, j in zip(items[::2], items[1::2])}
df = pd.DataFrame.from_dict(d, orient='index')
print(df)
我不認為 graphql 查詢是你想要的。 那里有一個返回數據的 GET 查詢。
試試吧
res = s.get('https://api.opensea.io/tokens/?limit=100')
我認為 opensea 使用 CloudFlare 來保護其 API .. 嘗試通過ScrapeNinja或 Puppeteer 啟動您的請求 - 這種方式似乎可以正常工作。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.