简体   繁体   English

试图从网站上抓取一张桌子<div tags< div><div id="text_translate"><p> 我正在尝试刮这张桌子<a href="https://momentranks.com/topshot/account/mariodustice?limit=250" rel="nofollow noreferrer">https://momentranks.com/topshot/account/mariodustice?limit=250</a></p><p> 我试过这个:</p><pre> import requests from bs4 import BeautifulSoup url = 'https://momentranks.com/topshot/account/mariodustice?limit=250' page = requests.get(url) soup = BeautifulSoup(page.content, 'lxml') table = soup.find_all('table', attrs={'class':'Table_tr__1JI4P'})</pre><p> 但它返回一个空列表。 有人可以就如何解决这个问题提供建议吗?</p></div></div>

[英]Trying to scrape a table from a website with <div tags

I am trying to scrape this table https://momentranks.com/topshot/account/mariodustice?limit=250我正在尝试刮这张桌子https://momentranks.com/topshot/account/mariodustice?limit=250

I have tried this:我试过这个:

import requests
from bs4 import BeautifulSoup
url = 'https://momentranks.com/topshot/account/mariodustice?limit=250'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
table = soup.find_all('table', attrs={'class':'Table_tr__1JI4P'})

But it returns an empty list.但它返回一个空列表。 Can someone give advice on how to approach this?有人可以就如何解决这个问题提供建议吗?

Selenium is a bit overkill when there is an available api.当有可用的 api 时,Selenium 有点矫枉过正。 Just get the data directly:直接获取数据即可:

import requests
import pandas as pd

url = 'https://momentranks.com/api/account/details'

rows = []
page = 0
while True:
    
    payload = {
        'filters': {'page': '%s' %page, 'limit': "250", 'type': "moments"},
        'flowAddress': "f64f1763e61e4087"}
    
    jsonData = requests.post(url, json=payload).json()
    
    data = jsonData['data']
    rows += data
    
    print('%s of %s' %(len(rows),jsonData['totalCount'] ))
    if len(rows) == jsonData['totalCount']:
        break
    
    page += 1

df = pd.DataFrame(rows)

Output: Output:

print(df)
                           _id    flowId  ...  challenges priceFloor
0     619d2f82fda908ecbe74b607  24001245  ...         NaN        NaN
1     61ba30837c1f070eadc0f8e4  25651781  ...         NaN        NaN
2     618d87b290209c5a51128516  21958292  ...         NaN        NaN
3     61aea763fda908ecbe9e8fbf  25201655  ...         NaN        NaN
4     60c38188e245f89daf7c4383  15153366  ...         NaN        NaN
                       ...       ...  ...         ...        ...
1787  61d0a2c37c1f070ead6b10a8  27014524  ...         NaN        NaN
1788  61d0a2c37c1f070ead6b10a8  27025557  ...         NaN        NaN
1789  61e9fafcd8acfcf57792dc5d  28711771  ...         NaN        NaN
1790  61ef40fcd8acfcf577273709  28723650  ...         NaN        NaN
1791  616a6dcb14bfee6c9aba30f9  18394076  ...         NaN        NaN

[1792 rows x 40 columns]

The data is indexed into the page using js code you cant use requests alone however you can use selenium Keep in mind that Selenium's driver.get dosnt wait for the page to completley load which means you need to wait数据使用 js 代码索引到页面中,您不能单独使用请求,但是您可以使用 selenium 请记住,Selenium 的 driver.get 不会等待页面完成加载,这意味着您需要等待

Here to get you started with selenium 在这里让您开始使用 selenium

url = 'https://momentranks.com/topshot/account/mariodustice?limit=250'
page = driver.get(url)
time.sleep(5) #edit the time of this depending on your case (in seconds)
soup = BeautifulSoup(page.source, 'lxml')
table = soup.find_all('table', attrs={'class':'Table_tr__1JI4P'})

The source HTML you see in your browser is rendered using javascript.您在浏览器中看到的源 HTML 是使用 javascript 呈现的。 When you use requests this does not happen which is why your script is not working.当您使用requests时,这不会发生,这就是您的脚本无法正常工作的原因。 If you print the HTML that is returned, it will not contain the information you wanted.如果您打印返回的 HTML,它将不包含您想要的信息。

All of the information is though available via the API which your browser makes calls to to build the page.所有信息都可以通过 API 获得,您的浏览器会调用它来构建页面。 You will need to take a detailed look at the JSON data structure returned to decide which information you wish to extract.您需要详细查看返回的 JSON 数据结构,以决定您希望提取哪些信息。

The following example shows how to get a list of the names and MRvalue of each player:以下示例显示了如何获取每个玩家的名称和 MR 值的列表:

import requests
from bs4 import BeautifulSoup
import json

s = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'}
url = 'https://momentranks.com/topshot/account/mariodustice?limit=250'
req_main = s.get(url, headers=headers)
soup = BeautifulSoup(req_main.content, 'lxml')
data = soup.find('script', id='__NEXT_DATA__')
json_data = json.loads(data.string)
account = json_data['props']['pageProps']['account']['flowAddress']
post_data = {"flowAddress" : account,"filters" : {"page" : 0, "limit":"250", "type":"moments"}}
req_json = s.post('https://momentranks.com/api/account/details', headers=headers, data=post_data)
player_data = req_json.json()

for player in player_data['data']:
    name = player['moment']['playerName']
    mrvalue = player['MRvalue']
    print(f"{name:30} ${mrvalue:.02f}")

Giving you output starting:给你 output 开始:

Scottie Barnes                 $672.38
Cade Cunningham                $549.00
Josh Giddey                    $527.11
Franz Wagner                   $439.26
Trae Young                     $429.51
A'ja Wilson                    $387.07
Ja Morant                      $386.00

The flowAddress is needed from the first page request to allow the API to be used correctly.第一个页面请求需要flowAddress以允许正确使用 API。 This happens to be embedded in a <script> section at the bottom of the HTML.这恰好嵌入在 HTML 底部的<script>部分中。

All of this was worked out by using the browser's network tools to watch how the actual webpage made requests to the server to build its page.所有这些都是通过使用浏览器的网络工具来观察实际网页如何向服务器发出请求以构建其页面来解决的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM