簡體   English   中英

Web刮美湯,進入所有鏈接獲取資料

[英]Web scraping with beautiful soup, entering all links and getting information

我正在嘗試從 StackOverflow 公司打開每家公司並獲取特定信息(例如整個描述)。有沒有使用 Beautiful Soup 的簡單方法?現在我正在獲取第一頁公司的鏈接。

import requests
from bs4 import BeautifulSoup

r = requests.get('https://stackoverflow.com/jobs/companies')
src = r.content
soup = BeautifulSoup(src,'lxml')
urls=[]

for h2_tag in soup.find_all("h2"):
    a_tag = h2_tag.find('a')
    urls.append(a_tag.attrs['href'])

print(urls)
import requests
from bs4 import BeautifulSoup as bsoup

for i in range(0, 5):
    site_source = requests.get(
        f"https://stackoverflow.com/jobs/companies?pg={i}"
    ).content
    soup = bsoup(site_source, "html.parser")
    company_list = soup.find("div", class_="company-list")
    company_block = company_list.find_all("div", class_="grid--cell fl1 text")
    for company in company_block:
        if company.find("a"):
            company_url = company.find("a").attrs["href"]
            base_url = "https://stackoverflow.com"
            company_source = requests.get(base_url + company_url).content
            company_soup = bsoup(company_source, "html.parser")
            company_info = company_soup.find("div", id="company-name-tagline")
            print("Name: ", company_info.find("h1").text)
            print("Info: ", company_info.find("p").text)
            print()

我基本上是循環瀏覽第 1 頁到第 5 頁,獲取每家公司的鏈接,然后轉到公司名稱並打印出名稱和描述。

我的 output

Name:  BigCommerce
Info:  Think BIG

Name:  Facebook
Info:  Our mission is to give people the power to build community and bring the world closer together.   

Name:  trivago N.V.
Info:  A diverse team of talents that make a blazing fast accommodation search powered by cutting-edge tech and entrepreneurial innovation. 

Name:  General Dynamics UK
Info:  General Dynamics UK is one of the UK’s leading defence companies, and an important supplier to the UK Ministry of Defence (MoD).   

Name:  EDF
Info:  EDF is leading the transition to a cleaner, low emission electric future, tackling climate change and helping Britain reach net zero.

Name:  Radix DLT
Info:  Delivering Scalable Trust.

有,你可以滾動第一頁,然后 go 滾動到第二頁,使用 selenium 點擊第二頁按鈕,每次都傳遞頁面源,我認為這應該有效

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM