[英]How do use the soup.find, soup.find_all
这是我的代码和 output
来自 bs4 import BeautifulSoup 的导入请求
res = requests.get("https://www.jobberman.com/jobs")
soup = BeautifulSoup(res.text, "html.parser")
job = soup.find("div", class_ = "relative inline-flex flex-col w-full text-sm font-normal pt-2")
company_name = job.find('a[href*="jobs"]')
print(company_name)
output 没有
None
但是当我使用 select 方法时,我得到了想要的结果但是不能在上面使用.text
import requests
from bs4 import BeautifulSoup
res = requests.get("https://www.jobberman.com/jobs")
soup = BeautifulSoup(res.text, "html.parser")
job = soup.find("div", class_ = "relative inline-flex flex-col w-full text-sm font-normal pt-2")
company_name = job.select('a[href*="jobs"]').text
print(company_name)
output
AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
更改您的选择策略 - 这里的主要问题是,并非所有公司名称都已链接:
job.find('div',{'class':'search-result__job-meta'}).text.strip()
或者
job.select_one('.search-result__job-meta').text.strip()
还以结构化方式存储您的信息以进行后期处理:
import requests
from bs4 import BeautifulSoup
res = requests.get("https://www.jobberman.com/jobs")
soup = BeautifulSoup(res.text, "html.parser")
data = []
for job in soup.select('div:has(>.search-result__body)'):
data.append({
'job':job.h3.text,
'company':job.select_one('.search-result__job-meta').text.strip()
})
data
[{'job': 'Restaurant Manager', 'company': 'Balkaan Employments service'},
{'job': 'Executive Assistant', 'company': 'Nolla Fresh & Frozen ltd'},
{'job': 'Portfolio Manager/Instructor 1', 'company': 'Fun Science World'},
{'job': 'Microbiologist', 'company': "NEIMETH INT'L PHARMACEUTICALS PLC"},
{'job': 'Data Entry Officer', 'company': 'Nkoyo Pharmaceuticals Ltd.'},
{'job': 'Chemical Analyst', 'company': "NEIMETH INT'L PHARMACEUTICALS PLC"},
{'job': 'Senior Front-End Engineer', 'company': 'Salvo Agency'},...]
之前发布的评论和答案已经涵盖了您的搜索策略的问题。 我正在为您的问题提供一个解决方案,其中涉及使用正则表达式库以及 find_all() function 调用:
import requests
from bs4 import BeautifulSoup
import re
res = requests.get("https://www.jobberman.com/jobs")
soup = BeautifulSoup(res.text, "html.parser")
company_name = soup.find_all("a", href=re.compile("/jobs\?"), rel="nofollow")
for i in range(len(company_name)):
print(company_name[i].text)
Output:
GRATIAS DEI NIGERIA LIMITED
Balkaan Employments service
Fun Science World
NEIMETH INT'L PHARMACEUTICALS PLC
Nkoyo Pharmaceuticals Ltd.
...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.