簡體   English   中英

如何在網站中刮取主頁的其他頁面

[英]How to scrape additional pages of the main page in a website

我是python的新手。 在一些幫助下,我編寫了一些代碼來從網頁上抓取一些數據。 但是,我只能根據代碼抓取每個鏈接的首頁。

目前,下面的代碼使我可以基於首頁抓取每年數據的鏈接( https://aviation-safety.net/database/dblist.php?Year=1949 )。

但是,在某些年份中,特定年份的鏈接( https://aviation-safety.net/database/dblist.php?Year=1949&lang= &page = 2 )( https://aviation-safety.net/database/dblist.php?Year=1949&lang=&page=3

我想知道是否有可能根據每年數據的附加頁面來檢索附加鏈接。

#get the additional links within each Year Link
import pandas as pd
from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
main_url = "https://aviation-safety.net/database/"

def get_and_parse_url(main_url):
    result = requests.get(main_url)
    soup = BeautifulSoup(result.content, 'html.parser')
    data_table = [main_url + i['href'] for i in soup.select('[href*=Year]')]
    return data_table

with requests.Session() as s:
    data_table = get_and_parse_url(main_url)
    df = pd.DataFrame(data_table, columns=['url'])
    datatable2 = [] #create outside so can append to it

    for anker in df.url:
        result = s.get(anker, headers = headers)
        soup = BeautifulSoup(result.content, 'html.parser')
        datatable2.append(['https://aviation-safety.net' + i['href'] for i in soup.select('[href*="database/record"]')])

 #flatten list of lists
datatable2 = [i for sublist in datatable2 for i in sublist]
df2 = pd.DataFrame(datatable2 , columns=['add_url'])
for i in df2.add_url: 
    print(i)

非常感謝任何形式的幫助,謝謝!

對於每個初始記錄頁面,您可以通過為子對象收集具有類pagenumbers元素中a標記(通過添加nth-of限制為前一個)來確定子頁面的匹配項; 在列表理解中執行此操作,該列表生成實際的其他頁面URL; 然后為這些頁面使用額外的循環收集。 在撰寫本文時,這產生了22,629個不同的鏈接。

import requests
from bs4 import BeautifulSoup as bs

base = 'https://aviation-safety.net/database/'
headers = {'User-Agent':'Mozilla/5.0'}
inner_links = []

def get_soup(url):
    r = s.get(url, headers = headers)
    soup = bs(r.text, 'lxml')
    return soup 

with requests.Session() as s:
    soup = get_soup('https://aviation-safety.net/database/')
    initial_links = [base + i['href'] for i in soup.select('[href*="Year="]')]

    for link in initial_links:
        soup = get_soup(link)
        inner_links+= ['https://aviation-safety.net' + i['href'] for i in soup.select('[href*="database/record"]')]
        pages = [f'{link}&lang=&page={i.text}' for i in soup.select('.pagenumbers:nth-of-type(2) a')]

        for page in pages:
            soup = get_soup(page)
            inner_links+= ['https://aviation-safety.net' + i['href'] for i in soup.select('[href*="database/record"]')]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM