簡體   English   中英

刮表(幾頁)到 Pandas Dataframe

[英]Scraping table (several pages) to Pandas Dataframe

我正在嘗試將長表(24 頁)的數據傳輸到 Pandas Dataframe,但面臨(我認為) for循環代碼的一些問題。

import requests
from bs4 import BeautifulSoup
import pandas as pd

base_url = 'https://scrapethissite.com/pages/forms/?page_num={}'
res = requests.get(base_url.format('1'))
soup = BeautifulSoup(res.text, 'lxml')

table = soup.select('table.table')[0]
columns = table.find('tr').find_all('th')
columns_names = [str(c.get_text()).strip() for c in columns]
table_rows = table.find_all('tr', class_='team')

l = []
for n in range(1, 25):
    scrape_url = base_url.format(n)
    res = requests.get(scrape_url)
    soup = BeautifulSoup(res.text, 'lxml')
    for tr in table_rows:
        td = tr.find_all('td')
        row = [str(tr.get_text()).strip() for tr in td]
        l.append(row)

df = pd.DataFrame(l, columns=columns_names)

Dataframe 僅作為第一頁的重復出現,而不是表中所有數據的副本。

我同意@mxbi。

試試看:

import requests
from bs4 import BeautifulSoup
import pandas as pd

base_url = 'https://scrapethissite.com/pages/forms/?page_num={}'

l = []
for n in range(1, 25):
    scrape_url = base_url.format(n)
    res = requests.get(scrape_url)
    soup = BeautifulSoup(res.text, 'lxml')

    table = soup.select('table.table')[0]
    columns = table.find('tr').find_all('th')
    columns_names = [str(c.get_text()).strip() for c in columns]
    table_rows = table.find_all('tr', class_='team')

    for tr in table_rows:
        td = tr.find_all('td')
        row = [str(tr.get_text()).strip() for tr in td]
        l.append(row)

df = pd.DataFrame(l, columns=columns_names)

requests 是必需的,因為服務器需要用戶代理 header 和 pandas read_html 不允許這樣做。 由於您仍然想使用 pandas 來生成 dataframe,您可以通過使用多處理來處理請求來獲得一些效率,並在用戶定義的 function 中提取感興趣的表傳遞和讀取_html。 您將獲得可以與 pandas concat 組合的數據幀列表。

注意:這不能在 Jupyter 中運行,因為會阻塞。

import pandas as pd
from multiprocessing import Pool, cpu_count
import requests
from bs4 import BeautifulSoup as bs

def get_table(url:str)-> pd.DataFrame:
    soup = bs(requests.get(url).text, 'lxml')
    df = pd.read_html(str(soup.select_one('.table')))[0]
    df['page_num'] = url.split("=")[-1]
    return df

if __name__ == '__main__':
    
    urls = [f'https://scrapethissite.com/pages/forms/?page_num={i}' for i in range(1, 25)]

    with Pool(cpu_count()-1) as p:
        results = p.map(get_table, urls)

    final = pd.concat(results)
    print(final)
    # final.to_csv('data.csv', index = False, encoding = 'utf-8-sig')

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM