简体   繁体   English

Beautifulsoup object 不包含来自网页的完整表格,而是抓取前 100 行

[英]Beautifulsoup object does not contain full table from webpage, instead grabs first 100 rows

I am attempting to scrape tables from the website spotrac.com and save the data to a pandas dataframe.我正在尝试从网站 spotrac.com 抓取表格并将数据保存到 pandas dataframe。 For whatever reason, if the table I am scraping is over 100 rows, the BeautifulSoup object only appears to grab the first 100 rows of the table.无论出于何种原因,如果我正在抓取的表超过 100 行,则 BeautifulSoup object 似乎只抓取了表的前 100 行。 If you run my code below, you'll see that the resulting dataframe has only 100 rows, and ends with "David Montgomery."如果您在下面运行我的代码,您会看到生成的 dataframe 只有 100 行,并以“David Montgomery”结尾。 If you visit the webpage ( https://www.spotrac.com/nfl/rankings/2019/base/running-back/ ) and ctrl+F "David Montgomery", you'll see that there are additional rows.如果您访问网页 ( https://www.spotrac.com/nfl/rankings/2019/base/running-back/ ) 和 ctrl+F "David Montgomery",您会看到还有其他行。 If you change the webpage in the get row of the code to "https://www.spotrac.com/nfl/rankings/2019/base/wide-receiver/" you'll see that the same thing happens.如果您将代码获取行中的网页更改为“https://www.spotrac.com/nfl/rankings/2019/base/wide-receiver/”,您会看到同样的事情发生。 Only the first 100 rows are included in the BeautifulSoup object and in the dataframe.只有前 100 行包含在 BeautifulSoup object 和 dataframe 中。

import pandas as pd
import requests, lxml.html
from bs4 import BeautifulSoup

# Begin requests session
with requests.session() as s:
        # Get page
        r = s.get('https://www.spotrac.com/nfl/rankings/2019/base/running-back/')
            
        # Get page content, find first table, and save to df
        soup = BeautifulSoup(r.content,'lxml')
        table = soup.find_all('table')[0]
        df_list = pd.read_html(str(table))
        df = df_list[0]

I have read that changing the parser can help.我读过更改解析器会有所帮助。 I have tried using different parsers by replacing the BeautifulSoup object code with the following:我尝试通过将 BeautifulSoup object 代码替换为以下代码来使用不同的解析器:

soup = BeautifulSoup(r.content,'html5lib')
soup = BeautifulSoup(r.content,'html.parser')

Neither of these changes worked.这些变化都没有奏效。 I have run "pip install html5lib" and "pip install lxml" and confirmed that both were already installed.我已经运行“pip install html5lib”和“pip install lxml”并确认两者都已安装。

This page uses JavaScript to load extra data.此页面使用JavaScript加载额外数据。

In DevTools in Firefox / Chrome you can see it sends POST request with extra information {'ajax': True, 'mobile': False}DevTools / ChromeFirefox中,您可以看到它发送带有额外信息的POST请求{'ajax': True, 'mobile': False}

import pandas as pd
import requests, lxml.html
from bs4 import BeautifulSoup

with requests.session() as s:
    
    r = s.post('https://www.spotrac.com/nfl/rankings/2019/base/running-back/', data={'ajax': True, 'mobile': False})
        
    # Get page content, find first table, and save to df
    soup = BeautifulSoup(r.content, 'lxml')
    table = soup.find_all('table')[0]
    df_list = pd.read_html(str(table))
    df = df_list[0]
    print(df)
    

I suggest you use request-html我建议你使用 request-html

import pandas as pd
from bs4 import BeautifulSoup
from requests_html import HTMLSession


if __name__ == "__main__":
    # Begin requests session
    s = HTMLSession()
    # Get page
    r = s.get('https://www.spotrac.com/nfl/rankings/2019/base/running-back/')
    r.html.render()
    # Get page content, find first table, and save to df
    soup = BeautifulSoup(r.html.html, 'lxml')
    table = soup.find_all('table')[0]
    df_list = pd.read_html(str(table))
    df = df_list[0]

Then you will get 140 lines.然后你会得到 140 行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM