简体   繁体   中英

Web Scraping Table from 'Dune.com' with Python3 and bs4

I am trying to web scrape table data from Dune.com ( https://dune.com/queries/1144723 ). When I 'inspect' the web page, I am able to clearly see the <table></table> element, but when I run the following code I am returned None results.

import bs4
import requests

data = []

r=requests.get('https://dune.com/queries/1144723/1954237')
soup=bs4.BeautifulSoup(r.text, "html5lib")

table = soup.find('table')

How can I successfully find this table data?

The page uses Javascript to load the data. This example will use their API endpoint to load the data to a dataframe:

import requests
import pandas as pd
from bs4 import BeautifulSoup


api_url = "https://app-api.dune.com/v1/graphql"

payload = {
    "operationName": "GetExecution",
    "query": "query GetExecution($execution_id: String!, $query_id: Int!, $parameters: [Parameter!]!) {\n  get_execution(\n    execution_id: $execution_id\n    query_id: $query_id\n    parameters: $parameters\n  ) {\n    execution_queued {\n      execution_id\n      execution_user_id\n      position\n      execution_type\n      created_at\n      __typename\n    }\n    execution_running {\n      execution_id\n      execution_user_id\n      execution_type\n      started_at\n      created_at\n      __typename\n    }\n    execution_succeeded {\n      execution_id\n      runtime_seconds\n      generated_at\n      columns\n      data\n      __typename\n    }\n    execution_failed {\n      execution_id\n      type\n      message\n      metadata {\n        line\n        column\n        hint\n        __typename\n      }\n      runtime_seconds\n      generated_at\n      __typename\n    }\n    __typename\n  }\n}\n",
    "variables": {
        "execution_id": "01GN7GTHF62FY5DYYSQ5MSEG2H",
        "parameters": [],
        "query_id": 1144723,
    },
}


data = requests.post(api_url, json=payload).json()

df = pd.DataFrame(data["data"]["get_execution"]["execution_succeeded"]["data"])
df["total_pnl"] = df["total_pnl"].astype(str)
df[["account", "link"]] = df.apply(
    func=lambda x: (
        (s := BeautifulSoup(x["account"], "html.parser")).text,
        s.a["href"],
    ),
    result_type="expand",
    axis=1,
)
print(df.head(10))  # <-- print sample data

Prints:

                                      account           last_traded rankings           total_pnl          traded_since                                                                               link
0  0xff33f5653e547a0b54b86b35a45e8b1c9abd1c46  2022-02-01T13:57:01Z     🥇 #1   1591196.831211874  2021-11-20T18:04:19Z  https://www.gmx.house/arbitrum/account/0xff33f5653e547a0b54b86b35a45e8b1c9abd1c46
1  0xcb696fd8e239dd68337c70f542c2e38686849e90  2022-11-23T18:26:04Z     🥈 #2  1367359.0616298981  2022-10-26T06:45:14Z  https://www.gmx.house/arbitrum/account/0xcb696fd8e239dd68337c70f542c2e38686849e90
2                                  190416.eth  2022-12-20T20:30:09Z     🥉 #3   864694.6695150969  2022-09-06T03:07:03Z  https://www.gmx.house/arbitrum/account/0xa688bc5e676325cc5fc891ac48fe442f6298a432
3  0x1729f93e3c3c74b503b8130516984ced70bf47d9  2021-09-24T07:30:51Z       #4   801075.4878765604  2021-09-22T00:16:43Z  https://www.gmx.house/arbitrum/account/0x1729f93e3c3c74b503b8130516984ced70bf47d9
4  0x83b13abab6ec323fff3af6d18a8fd1646ea39477  2022-12-12T21:36:25Z       #5     682459.02019836  2022-04-18T14:19:56Z  https://www.gmx.house/arbitrum/account/0x83b13abab6ec323fff3af6d18a8fd1646ea39477
5  0x9fc3b6191927b044ef709addd163b15c933ee205  2022-12-03T00:05:33Z       #6   652673.6605261166  2022-11-02T18:26:18Z  https://www.gmx.house/arbitrum/account/0x9fc3b6191927b044ef709addd163b15c933ee205
6  0xe8c19db00287e3536075114b2576c70773e039bd  2022-12-23T08:59:38Z       #7    644020.503240131  2022-10-06T07:20:44Z  https://www.gmx.house/arbitrum/account/0xe8c19db00287e3536075114b2576c70773e039bd
7  0x75a34444581f563680003f2ba05ea0c890a10934  2022-11-10T18:08:50Z       #8   639684.0495719836  2022-03-06T23:20:41Z  https://www.gmx.house/arbitrum/account/0x75a34444581f563680003f2ba05ea0c890a10934
8                               omarazhar.eth  2022-09-16T00:27:22Z       #9   536522.3114796011  2022-04-11T20:44:42Z  https://www.gmx.house/arbitrum/account/0x204495da23507be4e1281c32fb1b82d9d4289826
9  0x023cb9f0662c6612e830b37a82f41125a4c117e1  2022-09-06T01:10:28Z      #10   496922.9880152336  2022-04-12T22:31:47Z  https://www.gmx.house/arbitrum/account/0x023cb9f0662c6612e830b37a82f41125a4c117e1
import bs4
import requests

data = []

r=requests.get('https://dune.com/queries/1144723/1954237')
soup=bs4.BeautifulSoup(r.text, "html5lib")

table = soup.find('table')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM