简体   繁体   English

我的 web 抓取代码遇到问题 我真的不知道问题出在哪里

[英]I'm facing a problem with my web scraping code I don't really know the problem is

I'm facing a problem with my web scraping code I don't really know what the problem is, could any one of you guys help me, please this code is used to scrape data from a job's website I used python and some libraries such as beatifulsoup我的 web 抓取代码遇到问题 我真的不知道问题出在哪里,你们中的任何人都可以帮助我吗,请使用此代码从工作网站抓取数据 我使用 python 和一些库,例如作为美丽的汤

job_titles = []
company_names = []
locations = []
links = []
salaries = []
#using requests to fetch the URL :
result = requests.get('https://wuzzuf.net/search/jobs/?q=python&a=hpb')

#saving page's content/markup :
src = result.content

#create soup object to parse content 
soup = BeautifulSoup(src ,'lxml')
#print(soup)

#Now we're looking for the elements that conains the info we need (job title, job skills, company name, location)
job_title = soup.find_all("h2",{"class":"css-m604qf"})
company_name = soup.find_all("a", {"class": "css-17s97q8"})
location = soup.find_all("span", {"class": "css-5wys0k"})

#Making a loop over returned lists to extract needed info into other lists 
for I in range(len(job_title)):
    job_titles.append(job_title[I].text)
    links.append(job_title[I].find("a").attrs['href'])
    company_names.append(company_name[I].text)
    locations.append(location[I].text)
for link in links :
    results = requests.get(link)
    src = results.content
    soup = BeautifulSoup(src, 'lxml')
    salary = soup.find("a", {"class": "css-4xky9y"})
    salaries.append(salary.text)
#Creating a CSV file to store our values 
file_list = [job_titles, company_names, locations, links, salaries]
exported = zip_longest(*file_list)
with open("C:\\Users\\NOUFEL\\Desktop\\scraping\\wazzuf\\jobs.csv", "w") as myfile :
    wr = csv.writer(myfile)
    wr.writerow(["job title", "company name", "location", "links", "salaries"])
    wr.writerows(exported)

the problem is PS C:\Users\NOUFEL> & C:/Users/NOUFEL/AppData/Local/Microsoft/WindowsApps/python3.10.exe c:/Users/NOUFEL/Desktop/ScrapeWuzzuf.py Traceback (most recent call last): File "c:\Users\NOUFEL\Desktop\ScrapeWuzzuf.py", line 33, in results = requests.get(link) File "C:\Users\NOUFEL\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\requests\api.py", line 75, in get return request('get', url, params=params, **kwargs) File "C:\Users\NOUFEL\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\requests\api.py", line 61, in request return session.request(method=method, url=url, **kwargs) File "C:\Users\NOUFEL\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\requests\sessions.py", line 515, 问题是 PS C:\Users\NOUFEL> & C:/Users/NOUFEL/AppData/Local/Microsoft/WindowsApps/python3.10.exe c:/Users/NOUFEL/Desktop/ScrapeWuzzuf.py Traceback(最后一次调用): 文件“c:\Users\NOUFEL\Desktop\ScrapeWuzzuf.py”,第 33 行,结果 = requests.get(link) 文件“C:\Users\NOUFEL\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0 \LocalCache\local-packages\Python310\site-packages\requests\api.py", line 75, in get return request('get', url, params=params, **kwargs) File "C:\Users\NOUFEL \AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\requests\api.py",第 61 行,请求返回 session.request(方法=方法,url=url **kwargs) 文件“C:\Users\NOUFEL\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\requests\sessions.py”,第 515 行, in request prep = self.prepare_request(req) File "C:\Users\NOUFEL\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\requests\sessions.py", line 443, in prepare_request p.prepare( File "C:\Users\NOUFEL\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\requests\models.py", line 318, in prepare self.prepare_url(url, params) File "C:\Users\NOUFEL\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\requests\models.py", line 392, in prepare_url raise MissingSchema(error) requests.exceptions.MissingSchema: Invalid URL '/jobs/p/1XOMELtShdah-Flask-Python-Backend-Developer-Virtual-Worker-Now-Cairo-Egypt?o=1&l=sp&t=sj&a=python|search-v3|hpb': No scheme supplied.在请求 prep = self.prepare_request(req) 文件“C:\Users\NOUFEL\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\requests\sessions.py”中,第 443 行,在 prepare_request p.prepare( 文件“C:\Users\NOUFEL\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\requests\models.py”中,行318、在准备self.prepare_url(url, params) File "C:\Users\NOUFEL\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\requests\models.py “,第 392 行,在 prepare_url raise MissingSchema(error) requests.exceptions.MissingSchema: Invalid URL '/jobs/p/1XOMELtShdah-Flask-Python-Backend-Developer-Virtual-Worker-Now-Cairo-Egypt?o=1&l= sp&t=sj&a=python|search-v3|hpb':未提供方案。 Perhaps you meant http:///jobs/p/1XOMELtShdah-Flask-Python-Backend-Developer-Virtual-Worker-Now-Cairo-Egypt?o=1&l=sp&t=sj&a=python|search-v3|hpb?也许你的意思是 http:///jobs/p/1XOMELtShdah-Flask-Python-Backend-Developer-Virtual-Worker-Now-Cairo-Egypt?o=1&l=sp&t=sj&a=python|search-v3|hpb?

thanks in advance提前致谢

If you would read error message如果您会阅读错误消息

requests.exceptions.MissingSchema: Invalid URL '/jobs/p/1XOMELtShdah-Flask-Python-Backend-

or if you would display link then you would see that you get relative link like /jobs/p/1XOMELtShdah-Flask-Python-... and you have to add https://wuzzuf.net at the beginning to get absolute link .或者,如果您要显示link ,那么您会看到您获得了像/jobs/p/1XOMELtShdah-Flask-Python-...这样的relative link ,并且您必须在开头添加https://wuzzuf.net才能获得absolute link

results = requests.get(  "https://wuzzuf.net" + link )

To get the required data, you need to get a "container" selector that contains all the information about the job we need as child elements.要获取所需的数据,您需要获取一个“容器”选择器,其中包含我们需要的有关作业的所有信息作为子元素。 In our case, this is the .css-1gatmva selector.在我们的例子中,这是.css-1gatmva选择器。 Have a look at the SelectorGadget Chrome extension to easily pick selectors by clicking on the desired element in your browser ( not always works perfectly ).查看SelectorGadget Chrome 扩展程序,通过单击浏览器中的所需元素轻松选择选择器(并非总是完美)。

Problems with site parsing may arise because when you try to request a site, it may consider that this is a bot, so that this does not happen, you need to send headers that contain user-agent in the request , then the site will assume that you're a user and display information.站点解析可能会出现问题,因为当您尝试请求站点时,它可能会认为这是一个机器人,因此不会发生这种情况,您需要在请求中发送包含user-agentheaders ,然后该站点将假设您是用户并显示信息。

The request might be blocked (if using requests as default user-agent in requests library is a python-requests . Additional step could be to rotate user-agent , for example, to switch between PC, mobile, and tablet, as well as between browsers eg Chrome, Firefox, Safari, Edge and so on.请求可能会被阻止(如果在requests库中使用requests作为默认user-agentpython-requests 。额外的步骤可能是rotate user-agent ,例如,在 PC、移动设备和平板电脑之间切换,以及在浏览器之间切换,例如Chrome、Firefox、Safari、Edge 等。

Check code in online IDE . 网上查码IDE

import requests, lxml, json
from bs4 import BeautifulSoup

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "python"   # query
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
# https://www.whatismybrowser.com/detect/what-is-my-user-agent
headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.102 Safari/537.36"
}

html = requests.get("https://wuzzuf.net/search/jobs/", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")

data = []

for result in soup.select(".css-1gatmva"):                            
    title = result.select_one(".css-m604qf .css-o171kl").text         
    company_name = result.select_one(".css-17s97q8").text             
    adding_time = result.select_one(".css-4c4ojb, .css-do6t5g").text  
    location = result.select_one(".css-5wys0k").text                  
    employment = result.select_one(".css-1lh32fc").text               
    snippet = result.select_one(".css-1lh32fc+ div").text             

    data.append({
      "title" : title,
      "company_name" : company_name,
      "adding_time" : adding_time,
      "location" : location,
      "employment" : employment,
      "snippet" : snippet    
    })
    print(json.dumps(data, indent=2))

Example output示例 output

[
    {
    "title": "Python Developer For Job portal",
    "company_name": "Fekra Technology Solutions and Construction -",
    "adding_time": "24 days ago",
    "location": "Dokki, Giza, Egypt ",
    "employment": "Full TimeWork From Home",
    "snippet": "Experienced \u00b7 4+ Yrs of Exp \u00b7 IT/Software Development \u00b7 Engineering - Telecom/Technology \u00b7 backend \u00b7 Computer Science \u00b7 Django \u00b7 Flask \u00b7 Git \u00b7 Information Technology (IT) \u00b7 postgres \u00b7 Python"
  },
  {
    "title": "Senior Python Linux Engineer",
    "company_name": "El-Sewedy Electrometer -",
    "adding_time": "1 month ago",
    "location": "6th of October, Giza, Egypt ",
    "employment": "Full Time",
    "snippet": "Experienced \u00b7 3 - 5 Yrs of Exp \u00b7 IT/Software Development \u00b7 Engineering - Telecom/Technology \u00b7 Software Development \u00b7 Python \u00b7 C++ \u00b7 Information Technology (IT) \u00b7 Computer Science \u00b7 SQL \u00b7 Programming \u00b7 Electronics"
  }
]
[
  {
    "title": "Senior Python Developer",
    "company_name": "Trufla -",
    "adding_time": "2 days ago",
    "location": "Heliopolis, Cairo, Egypt ",
    "employment": "Full Time",
    "snippet": "Experienced \u00b7 4+ Yrs of Exp \u00b7 IT/Software Development \u00b7 Engineering - Telecom/Technology \u00b7 Agile \u00b7 APIs \u00b7 AWS \u00b7 Computer Science \u00b7 Git \u00b7 Linux \u00b7 Python \u00b7 REST"
  },
      # ...
]

u need to use = results = requests.get( "https://wuzzuf.net" + link )你需要使用 = results = requests.get( "https://wuzzuf.net" + link )

for link in links :
    results = requests.get("https://wuzzuf.net"+link)
    src = results.content
    soup = BeautifulSoup(src, 'lxml')
    salary = soup.find("a", {"class": "css-4xky9y"})
    salaries.append(salary.text)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 我的网络抓取不起作用,我不知道问题出在哪里 - my web scraping does not work and i don t know what the problem is web 数据抓取问题我不知道如何从文件中导出信息。html 到我的 python 程序 - web scraping problem with data i don't know how to export information from file.html to my python programme 我在组织 PyQt5 Gui 代码时遇到问题 - I'm facing a problem organizing my PyQt5 Gui code 我的代码有语法错误,我不知道问题是什么? - I have syntax error on my code and I don't know what the problem is? 我的代码逻辑遇到问题 - I am facing problem with logic of my code 我的网页抓取图像有问题; 当我尝试我的代码时,它只是说 [] - I have problem web scraping image; when I try my code, it just says [] 我不知道我有什么版本的 Python,并且在编写我的第一个网站时遇到问题 - I don´t know what version of Python I have, and problem coding my first website 我的 python kivy 代码遇到问题 - I am facing a problem in my python kivy code 我正在使用Python 3.7和BS4进行网络抓取,这是我无法解决的问题,希望有人知道如何解决此问题 - I'm using Python 3.7 an BS4 for web scraping, there is a problem I couldn't solve, hope someone knows how to fix this 我的代码没有运行,我正在尝试从 json 文件中检索数据。 我不知道我的 sqlite3 查询的问题 - My code is not running, I am trying to retrieve data from json file. i don't know the problem with my sqlite3 query
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM