![](/img/trans.png)
[英]Unable to scrape different plumber names from a static webpage using requests module
[英]Unable to scrape different company names from a static webpage using the requests module
我已經創建了一個腳本來使用請求模塊從該網站收集不同的公司名稱,但是當我執行該腳本時,它最終什么也得不到。 我在頁面源中查找公司名稱,發現名稱在那里可用,所以它們似乎是 static。
import requests
from bs4 import BeautifulSoup
link = 'https://clutch.co/agencies/digital-marketing'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}
with requests.Session() as s:
s.headers.update(headers)
res = s.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("h3.company_info > a"):
print(item.text)
由於該網站受 Cloudflare 保護,因此有一個名為cloudscraper的 python 模塊試圖繞過 Cloudflare 的反機器人頁面。
使用該模塊,您可以獲得所需的數據。
例如:
import cloudscraper
import pandas as pd
from bs4 import BeautifulSoup
from tabulate import tabulate
scraper = cloudscraper.create_scraper()
source_html = scraper.get("https://clutch.co/agencies/digital-marketing").text
soup = BeautifulSoup(source_html, "lxml")
company_data = [
[item.getText(strip=True), f"https://clutch.co{item['href']}"]
for item in soup.select("h3.company_info > a")
]
df = pd.DataFrame(company_data, columns=["Company", "URL"])
print(tabulate(df, headers="keys", tablefmt="github", showindex=False))
這應該打印:
| Company | URL |
|----------------------------------|------------------------------------------------------------|
| WebFX | https://clutch.co/profile/webfx |
| Ignite Visibility | https://clutch.co/profile/ignite-visibility |
| SocialSEO | https://clutch.co/profile/socialseo |
| Lilo Social | https://clutch.co/profile/lilo-social |
| Favoured | https://clutch.co/profile/favoured |
| Power Digital | https://clutch.co/profile/power-digital |
| Belkins | https://clutch.co/profile/belkins |
| SmartSites | https://clutch.co/profile/smartsites |
| Straight North | https://clutch.co/profile/straight-north |
| Victorious | https://clutch.co/profile/victorious |
| Uplers | https://clutch.co/profile/uplers |
| Daniel Brian Advertising | https://clutch.co/profile/daniel-brian-advertising |
| Thrive Internet Marketing Agency | https://clutch.co/profile/thrive-internet-marketing-agency |
| Big Leap | https://clutch.co/profile/big-leap |
| Mad Fish Digital | https://clutch.co/profile/mad-fish-digital |
| Razor Rank | https://clutch.co/profile/razor-rank |
| Brolik | https://clutch.co/profile/brolik |
| Search Berg | https://clutch.co/profile/search-berg |
| Socialfix Media | https://clutch.co/profile/socialfix-media |
| Kanbar Digital, LLC | https://clutch.co/profile/kanbar-digital |
| NextLeft | https://clutch.co/profile/nextleft |
| Fruition | https://clutch.co/profile/fruition |
| Impactable | https://clutch.co/profile/impactable |
| Lets Tok | https://clutch.co/profile/lets-tok |
| Pyxl | https://clutch.co/profile/pyxl |
| Sagefrog Marketing Group | https://clutch.co/profile/sagefrog-marketing-group |
| Foreignerds INC. | https://clutch.co/profile/foreignerds |
| Social Driver | https://clutch.co/profile/social-driver |
| 3 Media Web | https://clutch.co/profile/3-media-web |
| Brand Vision | https://clutch.co/profile/brand-vision-1 |
嘗試這個:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
link = 'https://clutch.co/agencies/digital-marketing'
driver = webdriver.Chrome()
# Go to the website
driver.get(link)
# Wait for the page to load
time.sleep(5)
# Get the page source
html = driver.page_source
# Create a BeautifulSoup object
soup = BeautifulSoup(html, 'lxml')
# Find all elements with class "company_info" and extract the text
for item in soup.select("h3.company_info > a"):
print(item.text)
# Close the browser
driver.quit()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.