繁体   English   中英

无法使用请求模块从网页中抓取表格内容

[英]Failed to scrape tabular content from a webpage using requests module

我正在尝试使用请求模块从网页中抓取表格内容。 该页面的内容是高度动态的但是,根据开发工具,可以通过 api 访问它。 我试图用适当的参数模仿相同的发布请求,但我总是得到状态403

import requests
from pprint import pprint

start_url = 'https://opensea.io/rankings'
link = 'https://api.opensea.io/graphql/'
payload = {"id":"rankingsQuery","query":"query rankingsQuery(\n  $chain: [ChainScalar!]\n  $count: Int!\n  $cursor: String\n  $sortBy: CollectionSort\n  $parents: [CollectionSlug!]\n  $createdAfter: DateTime\n) {\n  ...rankings_collections\n}\n\nfragment rankings_collections on Query {\n  collections(after: $cursor, chains: $chain, first: $count, sortBy: $sortBy, parents: $parents, createdAfter: $createdAfter, sortAscending: false, includeHidden: true, excludeZeroVolume: true) {\n    edges {\n      node {\n        createdDate\n        name\n        slug\n        logo\n        stats {\n          floorPrice\n          marketCap\n          numOwners\n          totalSupply\n          sevenDayChange\n          sevenDayVolume\n          oneDayChange\n          oneDayVolume\n          thirtyDayChange\n          thirtyDayVolume\n          totalVolume\n          id\n        }\n        id\n        __typename\n      }\n      cursor\n    }\n    pageInfo {\n      endCursor\n      hasNextPage\n    }\n  }\n}\n","variables":{"chain":None,"count":100,"cursor":"YXJyYXljb25uZWN0aW9uOjk5","sortBy":"SEVEN_DAY_VOLUME","parents":None,"createdAfter":None}}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
    s.headers['x-api-key'] = '2f6f419a083c46de9d83ce3dbe7db601'
    s.headers['x-build-id'] = 'cplNDIqD8Uy8MvANX90r9'
    s.headers['referer'] = 'https://opensea.io/'
    res = s.post(link,json=payload)
    pprint(res.status_code)
    print(res.json())

如何使用请求模块从该网页中抓取表格内容?

您可以从脚本标签中对其进行正则表达式,然后重建表。 有一些列格式要做。

import requests, re, json
import pandas as pd

r = requests.get('https://opensea.io/rankings')
data = json.loads(re.search(r'window\.__wired__=([^<]*)', r.text).group(1))
items = [v for v in data['records'].values() if v['__typename'] in ['CollectionType', 'CollectionStatsType']]
d = {i['name']:j for i, j in zip(items[::2], items[1::2])}
df = pd.DataFrame.from_dict(d, orient='index')      
print(df)

我不认为 graphql 查询是你想要的。 那里有一个返回数据的 GET 查询。

试试吧

res = s.get('https://api.opensea.io/tokens/?limit=100')

我认为 opensea 使用 CloudFlare 来保护其 API .. 尝试通过ScrapeNinja或 Puppeteer 启动您的请求 - 这种方式似乎可以正常工作。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM