簡體   English   中英

如何從 Python 抓取的 URL 列表中的 URL 抓取數據?

[英]How do I scrape data from URLs in a python-scraped list of URLs?

我正在嘗試在 Orange 中使用 BeautifulSoup4 從從同一網站抓取的 URL 列表中抓取數據。

當我手動設置 URL 時,我設法從單個頁面中抓取了數據。

from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import csv
import re

url = "https://data.ushja.org/awards-standings/zone-points.aspx?year=2021&zone=1&section=1901"
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")

rank = soup.find("table", class_="table-standings-body")
for child in rank.children:
    print(url,child)

我已經能夠抓取我需要的 URL 列表

from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import csv
import re

url = "https://data.ushja.org/awards-standings/zones.aspx?year=2021&zone=1"
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")

rank = soup.find("table", class_="table-standings-body")

link = soup.find('div',class_='contentSection')

url_list = link.find('a').get('href')
for url_list in link.find_all('a'):
    print (url_list.get('href'))

但到目前為止,我還無法將兩者結合起來從該 URL 列表中抓取數據。 我只能通過嵌套for循環來做到這一點,如果可以,怎么做? 或者我該怎么做?

如果這是一個愚蠢的問題,我很抱歉,但我昨天才開始嘗試使用 Python 和 Web-Scraping,我無法通過咨詢類似的主題來解決這個問題。

假設您的兩個代碼塊都有效

from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import csv
import re

url = "https://data.ushja.org/awards-standings/zones.aspx?year=2021&zone=1"
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")

rank = soup.find("table", class_="table-standings-body")

link = soup.find('div',class_='contentSection')

url_list = link.find('a').get('href')
for url_list in link.find_all('a'):
    url = url_list.get('href')
    req = requests.get(url)
    soup = BeautifulSoup(req.text, "html.parser")
    rank = soup.find("table", class_="table-standings-body")
    for child in rank.children:
        print(url, child)

這應該有效。 但是我在頁面的 DOM 中沒有看到類table-standings-body的 table 元素。 可能是您應該更正元素選擇器。

嘗試:

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "https://data.ushja.org/awards-standings/zones.aspx?year=2021&zone=1"
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")

# get all links
url_list = []
for a in soup.find("div", class_="contentSection").find_all("a"):
    url_list.append(a["href"].replace("§", "&sect"))

# get all data from URLs
all_data = []
for url in url_list:
    print(url)

    req = requests.get(url)
    soup = BeautifulSoup(req.text, "html.parser")

    h2 = soup.h2
    sub = h2.find_next("p")

    for tr in soup.select("tr:has(td)"):
        all_data.append(
            [
                h2.get_text(strip=True),
                sub.get_text(strip=True),
                *[td.get_text(strip=True) for td in tr.select("td")],
            ]
        )

# save data to CSV
df = pd.DataFrame(
    all_data,
    columns=[
        "title",
        "sub_title",
        "Rank",
        "Horse / Owner",
        "Points",
        "Total Comps",
    ],
)
print(df)
df.to_csv("data.csv", index=None)

這將遍歷所有 URL 並將所有數據保存到data.csv (來自 LibreOffice 的屏幕截圖):

在此處輸入圖片說明

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM