简体   繁体   中英

Failed to scrape tabular content from a webpage using requests module

I'm trying to scrape tabular content from a webpage using requests module. The content of that page is heavily dynamic However, it can be accessed via an api according to dev tools. I'm trying to mimic the same issuing a post requests with appropriate parameters but I always get status 403 .

import requests
from pprint import pprint

start_url = 'https://opensea.io/rankings'
link = 'https://api.opensea.io/graphql/'
payload = {"id":"rankingsQuery","query":"query rankingsQuery(\n  $chain: [ChainScalar!]\n  $count: Int!\n  $cursor: String\n  $sortBy: CollectionSort\n  $parents: [CollectionSlug!]\n  $createdAfter: DateTime\n) {\n  ...rankings_collections\n}\n\nfragment rankings_collections on Query {\n  collections(after: $cursor, chains: $chain, first: $count, sortBy: $sortBy, parents: $parents, createdAfter: $createdAfter, sortAscending: false, includeHidden: true, excludeZeroVolume: true) {\n    edges {\n      node {\n        createdDate\n        name\n        slug\n        logo\n        stats {\n          floorPrice\n          marketCap\n          numOwners\n          totalSupply\n          sevenDayChange\n          sevenDayVolume\n          oneDayChange\n          oneDayVolume\n          thirtyDayChange\n          thirtyDayVolume\n          totalVolume\n          id\n        }\n        id\n        __typename\n      }\n      cursor\n    }\n    pageInfo {\n      endCursor\n      hasNextPage\n    }\n  }\n}\n","variables":{"chain":None,"count":100,"cursor":"YXJyYXljb25uZWN0aW9uOjk5","sortBy":"SEVEN_DAY_VOLUME","parents":None,"createdAfter":None}}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
    s.headers['x-api-key'] = '2f6f419a083c46de9d83ce3dbe7db601'
    s.headers['x-build-id'] = 'cplNDIqD8Uy8MvANX90r9'
    s.headers['referer'] = 'https://opensea.io/'
    res = s.post(link,json=payload)
    pprint(res.status_code)
    print(res.json())

How can I scrape tabular content from that webpage using requests module?

You can regex it out of a script tag then reconstruct table. There is some column formatting to do.

import requests, re, json
import pandas as pd

r = requests.get('https://opensea.io/rankings')
data = json.loads(re.search(r'window\.__wired__=([^<]*)', r.text).group(1))
items = [v for v in data['records'].values() if v['__typename'] in ['CollectionType', 'CollectionStatsType']]
d = {i['name']:j for i, j in zip(items[::2], items[1::2])}
df = pd.DataFrame.from_dict(d, orient='index')      
print(df)

I don't think that graphql query is the one you want. There is a GET query there that returns the data.

res = s.get('https://api.opensea.io/tokens/?limit=100')

我认为 opensea 使用 CloudFlare 来保护其 API .. 尝试通过ScrapeNinja或 Puppeteer 启动您的请求 - 这种方式似乎可以正常工作。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM