[英]Issues for data scraping with BeautifulSoup4
So basically i'm trying to scrape jobs website, my goal is to retrieve job title, company, salary, location.所以基本上我正在尝试抓取工作网站,我的目标是检索职位、公司、薪水、位置。 Which i'm planning to get into csv file so I could do some plotting of it.
我打算进入 csv 文件,以便我可以对其进行一些绘图。 My current code is:
我目前的代码是:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.cvbankas.lt/?miestas=Vilnius&padalinys%5B0%5D=76&page=1'
#Opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
#HTML parser
page_soup = soup(page_html, 'html.parser')
# grabs each product
containers = page_soup.findAll('div',{'class':'list_a_wrapper'})
contain = containers[0]
container = containers[0]
print(container.h3)
And returns me:并返回我:
<h3 class="list_h3" lang="en">Senior Talent Manager</h3>
If I ask: container.h3['class']
this returns ['h3_class']
, If I ask: container.h3['lang']
I get en
but I can't retrieve Senior Talent Manager
如果我问:
container.h3['class']
这将返回['h3_class']
,如果我问: container.h3['lang']
我得到en
但我无法检索Senior Talent Manager
Here is on of the job add HTML code:这是添加 HTML 代码的工作:
<div class="list_a_wrapper">
<div class="list_cell">
<h3 class="list_h3" lang="en">Senior Talent Manager</h3>
<span class="heading_secondary">
<span class="dib mt5">UAB „Omnisend“</span></span>
</div>
<div class="list_cell jobadlist_list_cell_salary">
<span class="salary_c">
<span class="salary_bl salary_bl_gross">
<span class="salary_inner">
<span class="salary_text">
<span class="salary_amount">2300-3300</span>
<span class="salary_period">€/mėn.</span>
</span>
<span class="salary_calculation">Neatskaičius mokesčių</span>
</span>
</span>
<div class="salary_calculate_bl js_salary_calculate_a" data-href="https://www.cvbankas.lt/perskaiciuoti-skelbimo-atlyginima-6732785">
<div class="button_action">Skaičiuoti »</div>
<div class="salary_calculate_text">Į rankas per mėn.</div>
</div>
</span> </div>
<div class="list_cell list_ads_c_last">
<span class="txt_list_1" lang="lt"><span class="list_city">Vilniuje</span></span>
<span class="txt_list_2">prieš 4 d.</span>
</div>
</div>
So what approach would be the best to scrape: title which is in h3, dib mt5, salary_amount, salary_calculation, list_city.那么什么方法最好刮:标题在h3,dib mt5,salary_amount,salary_calculation,list_city。
You can retrieve the text inside the tag with您可以使用检索标签内的文本
title = tag.get_text()
This script will get job title, company, salary, location from the page:此脚本将从页面获取职位、公司、薪水、位置:
import requests
from bs4 import BeautifulSoup
url = 'https://www.cvbankas.lt/?miestas=Vilnius&padalinys%5B0%5D=76&page=1'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for h3 in soup.select('h3.list_h3'):
job_title = h3.get_text(strip=True)
company = h3.find_next(class_="heading_secondary").get_text(strip=True)
salary = h3.find_next(class_="salary_amount").get_text(strip=True)
location = h3.find_next(class_="list_city").get_text(strip=True)
print('{:<50} {:<15} {:<15} {}'.format(company, salary, location, job_title))
Prints:印刷:
UAB „Omnisend“ 2300-3300 Vilniuje Senior Talent Manager
UAB „BALTIC VIRTUAL ASSISTANTS“ Nuo 2700 Vilniuje SENIOR .NET C# DEVELOPER
UAB „Lexita“ 1200-2500 Vilniuje IT PROJEKTŲ VADOVAS (-Ė)
UAB „Nordcode technology“ 1200-2000 Vilniuje PHP developer (mid-level)
UAB „Nordcurrent Group“ Nuo 2300 Vilniuje SENIOR VAIZDO ŽAIDIMŲ TESTUOTOJAS
UAB „Inlusion Netforms“ 1500-3500 Vilniuje Senior C++ Programmer to work with Unreal (UE4) game engine
UAB „Solitera“ 1200-2800 Vilniuje Java(Spring Boot) Developer
UAB „Metso Lithuania“ Nuo 1300 Vilniuje BI DATA ANALYST
UAB „Atticae“ 1000-1500 Vilniuje PHP programuotojas (-a)
UAB „EIS Group Lietuva“ 2000-7000 Vilniuje SYSTEM ARCHITECT
UAB GF Bankas Nuo 1200 Vilniuje HelpDesk specialistas (-ė)
Tesonet 1000-3000 Vilniuje Swift Developer (Security Product)
UAB „Mark ID“ 1000-3000 Vilniuje Full Stack programuotojas
...and so on.
EDIT: To save as csv
, you can use this script:编辑:要另存为
csv
,您可以使用此脚本:
import requests
import pandas as pd
from bs4 import BeautifulSoup
all_data = []
for page in range(1, 9):
url = 'https://www.cvbankas.lt/?padalinys%5B0%5D=76&page=' + str(page)
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for h3 in soup.select('h3.list_h3'):
job_title = h3.get_text(strip=True)
company = h3.find_next(class_="heading_secondary").get_text(strip=True)
salary = h3.find_next(class_="salary_amount")
salary = salary.get_text(strip=True) if salary else '-'
location = h3.find_next(class_="list_city").get_text(strip=True)
print('{:<50} {:<15} {:<15} {}'.format(company, salary, location, job_title))
all_data.append({
'Job Title': job_title,
'Company': company,
'Salary': salary,
'Location': location
})
df = pd.DataFrame(all_data)
df.to_csv('data.csv')
Saves data.csv
(screenshot from LibreOffice):保存
data.csv
(来自 LibreOffice 的屏幕截图):
Instead of:代替:
containers = page_soup.findAll('div',{'class':'list_a_wrapper'})
Try this:尝试这个:
results = []
for i in page_soup.find_all('div',{'class':'list_a_wrapper'}):
results.append(i.text)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.