[英]Scraping job hunting website using BeautifulSoup
I am trying to scrape all the complete job descriptions from this website but I got stuck: https://www.seek.co.nz/data-analyst-jobs/full-time?daterange=31&salaryrange=70000-999999&salarytype=annual我试图从这个网站上抓取所有完整的职位描述,但我卡住了: https : //www.seek.co.nz/data-analyst-jobs/full-time? daterange =31& salaryrange = 70000-999999 & salarytype =annual
My logic is to find all the job links on one page first than loop through next pages.我的逻辑是先在一页上找到所有工作链接,然后再遍历下一页。
My code looks like this:我的代码如下所示:
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(response.text, 'html.parser')
link_list = []
for a in soup.find_all('a', attrs={'data-automation': 'jobTitle'}, href=True):
link_list.append('https://www.seek.co.nz/' + a['href'])
print(link_list)
The above code looks fine.上面的代码看起来不错。 I can print a list of job links and put them in a list, but the following code only printed out 2 paragraphs then it threw an error:
我可以打印一份工作链接列表并将它们放在一个列表中,但下面的代码只打印了 2 段然后它抛出了一个错误:
for link in link_list:
response = requests.get(link, 'lxml')
sp = BeautifulSoup(response.text, 'html.parser')
table = sp.find_all('div',attrs={'data-automation': 'jobDescription'})
for x in table:
print(x.find('p').text)
AttributeError Traceback (most recent call last)
<ipython-input-41-8afe949a9497> in <module>()
4 table = sp.find_all('div',attrs={'data-automation': 'jobDescription'})
5 for x in table:
----> 6 print(x.find('p').text)
AttributeError: 'NoneType' object has no attribute 'text'
Can someone tell me why it didn't work and how to make it right?有人能告诉我为什么它不起作用以及如何使它正确吗? I am using Python 3 and bs4.
我正在使用 Python 3 和 bs4。 Thank you!
谢谢!
Going through the webpage , it is clear that the first two posts that you'll get through soup.find_all('a', attrs={'data-automation': 'jobTitle'}, href=True)
are advertisements or promoted posts, and consequently the webpages these two and the other job postings lead to are different.浏览网页,很明显,您将通过
soup.find_all('a', attrs={'data-automation': 'jobTitle'}, href=True)
获得的前两个帖子是广告或促销帖子,因此这两个和其他职位发布导致的网页不同。 For the latter, perhaps soup.find('div', attrs={'class':'templatetext'})
will do the job.对于后者,也许
soup.find('div', attrs={'class':'templatetext'})
可以完成这项工作。
Hope that helps!希望有帮助!
you have to use find_all
你必须使用
find_all
for link in link_list:
response = requests.get(link, 'lxml')
sp = BeautifulSoup(response.text, 'html.parser')
table = sp.find_all('div',attrs={'data-automation': 'jobDescription'})
for x in table:
all_p = x.find_all('p')
for para in all_p:
if para:
print(para.text)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.