简体   繁体   English

无法使用请求模块从网页中抓取表格内容

[英]Failed to scrape tabular content from a webpage using requests module

I'm trying to scrape tabular content from a webpage using requests module.我正在尝试使用请求模块从网页中抓取表格内容。 The content of that page is heavily dynamic However, it can be accessed via an api according to dev tools.该页面的内容是高度动态的但是,根据开发工具,可以通过 api 访问它。 I'm trying to mimic the same issuing a post requests with appropriate parameters but I always get status 403 .我试图用适当的参数模仿相同的发布请求,但我总是得到状态403

import requests
from pprint import pprint

start_url = 'https://opensea.io/rankings'
link = 'https://api.opensea.io/graphql/'
payload = {"id":"rankingsQuery","query":"query rankingsQuery(\n  $chain: [ChainScalar!]\n  $count: Int!\n  $cursor: String\n  $sortBy: CollectionSort\n  $parents: [CollectionSlug!]\n  $createdAfter: DateTime\n) {\n  ...rankings_collections\n}\n\nfragment rankings_collections on Query {\n  collections(after: $cursor, chains: $chain, first: $count, sortBy: $sortBy, parents: $parents, createdAfter: $createdAfter, sortAscending: false, includeHidden: true, excludeZeroVolume: true) {\n    edges {\n      node {\n        createdDate\n        name\n        slug\n        logo\n        stats {\n          floorPrice\n          marketCap\n          numOwners\n          totalSupply\n          sevenDayChange\n          sevenDayVolume\n          oneDayChange\n          oneDayVolume\n          thirtyDayChange\n          thirtyDayVolume\n          totalVolume\n          id\n        }\n        id\n        __typename\n      }\n      cursor\n    }\n    pageInfo {\n      endCursor\n      hasNextPage\n    }\n  }\n}\n","variables":{"chain":None,"count":100,"cursor":"YXJyYXljb25uZWN0aW9uOjk5","sortBy":"SEVEN_DAY_VOLUME","parents":None,"createdAfter":None}}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
    s.headers['x-api-key'] = '2f6f419a083c46de9d83ce3dbe7db601'
    s.headers['x-build-id'] = 'cplNDIqD8Uy8MvANX90r9'
    s.headers['referer'] = 'https://opensea.io/'
    res = s.post(link,json=payload)
    pprint(res.status_code)
    print(res.json())

How can I scrape tabular content from that webpage using requests module?如何使用请求模块从该网页中抓取表格内容?

You can regex it out of a script tag then reconstruct table.您可以从脚本标签中对其进行正则表达式,然后重建表。 There is some column formatting to do.有一些列格式要做。

import requests, re, json
import pandas as pd

r = requests.get('https://opensea.io/rankings')
data = json.loads(re.search(r'window\.__wired__=([^<]*)', r.text).group(1))
items = [v for v in data['records'].values() if v['__typename'] in ['CollectionType', 'CollectionStatsType']]
d = {i['name']:j for i, j in zip(items[::2], items[1::2])}
df = pd.DataFrame.from_dict(d, orient='index')      
print(df)

I don't think that graphql query is the one you want.我不认为 graphql 查询是你想要的。 There is a GET query there that returns the data.那里有一个返回数据的 GET 查询。

try instead试试吧

res = s.get('https://api.opensea.io/tokens/?limit=100')

我认为 opensea 使用 CloudFlare 来保护其 API .. 尝试通过ScrapeNinja或 Puppeteer 启动您的请求 - 这种方式似乎可以正常工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM