繁体   English   中英

如何使用python美丽汤将所有页面的所有链接保存到csv

[英]How to save all links from all pages to csv using python beautiful soup

我正在尝试将从多个分页页面收集的所有链接保存到 csv。 从 print(links) 我可以看到我想从多个页面保存的所有链接,但不幸的是,当我打开 csv 文件时,我只找到一个 URL 保存。 如何将我从终端看到的所有 URL(打印(链接))保存到 csv?

下面是我的代码:

def scrape_pages(url) -> 无:

#max_pages = 10

max_pages = 5 # doing 3 pages for examples sake

current_page = 1

# Loop through all pages dynamically and build the url using the page number suffix the website uses

while current_page <= max_pages:

    print(f'{url}page/{current_page}')

    # Get each page's html

    raw_html1 = requests.get(f'{url}page/{current_page}')

    soup1 = BeautifulSoup(raw_html1.text, 'html.parser')

    current_page += 1
   
   # Find all table rows and from each table row get the needed data 

    #root = 'https://www.myjobmag.com'

    
    for link1 in soup1.find_all('li',{'class':'mag-b'}):

     link2 =  link1.find('a',href=True)

     link3 = 'https://www.myjobmag.com'+(link2['href'])

    links = []

    [links.append(link3) for link2 in link1 ]  
   


    for link2 in links:

        raw_html =  urlopen(link3)

        soup = BeautifulSoup(raw_html.read(), 'html.parser')


    def getTitle(soup):

      return soup.find('h2', class_="mag-b").text.strip()

    def getCompany(soup):

      return soup.find('li', class_="job-industry").text.strip()

    def getInfo(soup):

      return soup.find('ul', class_="job-key-info").text.strip()

    def getDescription(soup):

      return soup.find('div', class_="job-details").text.strip()

    def getApplication(soup):

       return soup.find('div', class_="mag-b bm-b-30").text.strip()
       
    with open('output.csv', 'w', encoding='utf8', newline='') as 
    f_output:

       csv_output = csv.writer(f_output)

       csv_output.writerow(['Title', 'Info', 'Desc', 'Application'])

       row = [getTitle(soup), getCompany(soup), getInfo(soup),

getDescription(soup), getApplication(soup)]

       print(row)

       for f_output in row:

         csv_output.writerow(row)


   # print(product, row, Title, Company, Info, Description, Application)
  
    time.sleep(5) # sleep before scraping next page to not send too 
     many requests at once 
    
    print('\n\n') # Clearing console up

def main() -> int:

URL = 'https://www.myjobmag.com/'

scrape_pages(URL)

return 0

如果名称== '':退出(主())

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time as t

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'}

s = requests.Session()
s.headers.update(headers)

links_list = []
for x in range(1, 3):
    r = s.get(f'https://www.myjobmag.com/page/{x}')
    soup = BeautifulSoup(r.text, 'html.parser')
    links = soup.select_one('ul.job-list').select('li.job-list-li')
    for link in links:
        try:
            title = link.select_one('h2').text.strip()
            url = link.select_one('h2').select_one('a').get('href')
            r = s.get(f'https://www.myjobmag.com{url}')
            soup = BeautifulSoup(r.text, 'html.parser')
            key_info = soup.select_one('ul.job-key-info').text.strip()
            description = soup.select_one('div.job-details').text.strip()
            application_method = soup.select_one('div.mag-b.bm-b-30').text.strip()
            
            links_list.append((title, key_info, description, application_method, url))
            print(f'done {title} -- {url}')
            t.sleep(5)
        except Exception as e:
            print(e)

df = pd.DataFrame(links_list, columns = ['title', 'key_info', 'description', 'application_method', 'url'])
df.to_csv('my_wonderful_jobs_list.csv')

这将返回一个包含职位名称、关键信息、描述、申请方法和网址的 csv 文件。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM