[英]I'm facing a problem with my web scraping code I don't really know the problem is
我的 web 抓取代碼遇到問題 我真的不知道問題出在哪里,你們中的任何人都可以幫助我嗎,請使用此代碼從工作網站抓取數據 我使用 python 和一些庫,例如作為美麗的湯
job_titles = []
company_names = []
locations = []
links = []
salaries = []
#using requests to fetch the URL :
result = requests.get('https://wuzzuf.net/search/jobs/?q=python&a=hpb')
#saving page's content/markup :
src = result.content
#create soup object to parse content
soup = BeautifulSoup(src ,'lxml')
#print(soup)
#Now we're looking for the elements that conains the info we need (job title, job skills, company name, location)
job_title = soup.find_all("h2",{"class":"css-m604qf"})
company_name = soup.find_all("a", {"class": "css-17s97q8"})
location = soup.find_all("span", {"class": "css-5wys0k"})
#Making a loop over returned lists to extract needed info into other lists
for I in range(len(job_title)):
job_titles.append(job_title[I].text)
links.append(job_title[I].find("a").attrs['href'])
company_names.append(company_name[I].text)
locations.append(location[I].text)
for link in links :
results = requests.get(link)
src = results.content
soup = BeautifulSoup(src, 'lxml')
salary = soup.find("a", {"class": "css-4xky9y"})
salaries.append(salary.text)
#Creating a CSV file to store our values
file_list = [job_titles, company_names, locations, links, salaries]
exported = zip_longest(*file_list)
with open("C:\\Users\\NOUFEL\\Desktop\\scraping\\wazzuf\\jobs.csv", "w") as myfile :
wr = csv.writer(myfile)
wr.writerow(["job title", "company name", "location", "links", "salaries"])
wr.writerows(exported)
問題是 PS C:\Users\NOUFEL> & C:/Users/NOUFEL/AppData/Local/Microsoft/WindowsApps/python3.10.exe c:/Users/NOUFEL/Desktop/ScrapeWuzzuf.py Traceback(最后一次調用): 文件“c:\Users\NOUFEL\Desktop\ScrapeWuzzuf.py”,第 33 行,結果 = requests.get(link) 文件“C:\Users\NOUFEL\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0 \LocalCache\local-packages\Python310\site-packages\requests\api.py", line 75, in get return request('get', url, params=params, **kwargs) File "C:\Users\NOUFEL \AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\requests\api.py",第 61 行,請求返回 session.request(方法=方法,url=url **kwargs) 文件“C:\Users\NOUFEL\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\requests\sessions.py”,第 515 行, 在請求 prep = self.prepare_request(req) 文件“C:\Users\NOUFEL\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\requests\sessions.py”中,第 443 行,在 prepare_request p.prepare( 文件“C:\Users\NOUFEL\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\requests\models.py”中,行318、在准備self.prepare_url(url, params) File "C:\Users\NOUFEL\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\requests\models.py “,第 392 行,在 prepare_url raise MissingSchema(error) requests.exceptions.MissingSchema: Invalid URL '/jobs/p/1XOMELtShdah-Flask-Python-Backend-Developer-Virtual-Worker-Now-Cairo-Egypt?o=1&l= sp&t=sj&a=python|search-v3|hpb':未提供方案。 也許你的意思是 http:///jobs/p/1XOMELtShdah-Flask-Python-Backend-Developer-Virtual-Worker-Now-Cairo-Egypt?o=1&l=sp&t=sj&a=python|search-v3|hpb?
提前致謝
如果您會閱讀錯誤消息
requests.exceptions.MissingSchema: Invalid URL '/jobs/p/1XOMELtShdah-Flask-Python-Backend-
或者,如果您要顯示link
,那么您會看到您獲得了像/jobs/p/1XOMELtShdah-Flask-Python-...
這樣的relative link
,並且您必須在開頭添加https://wuzzuf.net
才能獲得absolute link
。
results = requests.get( "https://wuzzuf.net" + link )
要獲取所需的數據,您需要獲取一個“容器”選擇器,其中包含我們需要的有關作業的所有信息作為子元素。 在我們的例子中,這是.css-1gatmva
選擇器。 查看SelectorGadget Chrome 擴展程序,通過單擊瀏覽器中的所需元素輕松選擇選擇器(並非總是完美)。
站點解析可能會出現問題,因為當您嘗試請求站點時,它可能會認為這是一個機器人,因此不會發生這種情況,您需要在請求中發送包含user-agent
的headers
,然后該站點將假設您是用戶並顯示信息。
請求可能會被阻止(如果在requests
庫中使用requests
作為默認user-agent
是python-requests
。額外的步驟可能是rotate user-agent
,例如,在 PC、移動設備和平板電腦之間切換,以及在瀏覽器之間切換,例如Chrome、Firefox、Safari、Edge 等。
網上查碼IDE 。
import requests, lxml, json
from bs4 import BeautifulSoup
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "python" # query
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
# https://www.whatismybrowser.com/detect/what-is-my-user-agent
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.102 Safari/537.36"
}
html = requests.get("https://wuzzuf.net/search/jobs/", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
data = []
for result in soup.select(".css-1gatmva"):
title = result.select_one(".css-m604qf .css-o171kl").text
company_name = result.select_one(".css-17s97q8").text
adding_time = result.select_one(".css-4c4ojb, .css-do6t5g").text
location = result.select_one(".css-5wys0k").text
employment = result.select_one(".css-1lh32fc").text
snippet = result.select_one(".css-1lh32fc+ div").text
data.append({
"title" : title,
"company_name" : company_name,
"adding_time" : adding_time,
"location" : location,
"employment" : employment,
"snippet" : snippet
})
print(json.dumps(data, indent=2))
示例 output
[
{
"title": "Python Developer For Job portal",
"company_name": "Fekra Technology Solutions and Construction -",
"adding_time": "24 days ago",
"location": "Dokki, Giza, Egypt ",
"employment": "Full TimeWork From Home",
"snippet": "Experienced \u00b7 4+ Yrs of Exp \u00b7 IT/Software Development \u00b7 Engineering - Telecom/Technology \u00b7 backend \u00b7 Computer Science \u00b7 Django \u00b7 Flask \u00b7 Git \u00b7 Information Technology (IT) \u00b7 postgres \u00b7 Python"
},
{
"title": "Senior Python Linux Engineer",
"company_name": "El-Sewedy Electrometer -",
"adding_time": "1 month ago",
"location": "6th of October, Giza, Egypt ",
"employment": "Full Time",
"snippet": "Experienced \u00b7 3 - 5 Yrs of Exp \u00b7 IT/Software Development \u00b7 Engineering - Telecom/Technology \u00b7 Software Development \u00b7 Python \u00b7 C++ \u00b7 Information Technology (IT) \u00b7 Computer Science \u00b7 SQL \u00b7 Programming \u00b7 Electronics"
}
]
[
{
"title": "Senior Python Developer",
"company_name": "Trufla -",
"adding_time": "2 days ago",
"location": "Heliopolis, Cairo, Egypt ",
"employment": "Full Time",
"snippet": "Experienced \u00b7 4+ Yrs of Exp \u00b7 IT/Software Development \u00b7 Engineering - Telecom/Technology \u00b7 Agile \u00b7 APIs \u00b7 AWS \u00b7 Computer Science \u00b7 Git \u00b7 Linux \u00b7 Python \u00b7 REST"
},
# ...
]
你需要使用 = results = requests.get( "https://wuzzuf.net" + link )
for link in links :
results = requests.get("https://wuzzuf.net"+link)
src = results.content
soup = BeautifulSoup(src, 'lxml')
salary = soup.find("a", {"class": "css-4xky9y"})
salaries.append(salary.text)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.