使用 BeautifulSoup 抓取求职网站

Question

I am trying to scrape all the complete job descriptions from this website but I got stuck: https://www.seek.co.nz/data-analyst-jobs/full-time?daterange=31&salaryrange=70000-999999&salarytype=annual我试图从这个网站上抓取所有完整的职位描述，但我卡住了： https : //www.seek.co.nz/data-analyst-jobs/full-time? daterange =31& salaryrange = 70000-999999 & salarytype =annual

My logic is to find all the job links on one page first than loop through next pages.我的逻辑是先在一页上找到所有工作链接，然后再遍历下一页。

My code looks like this:我的代码如下所示：

from bs4 import BeautifulSoup
import requests

soup = BeautifulSoup(response.text, 'html.parser')
link_list = []

for a in soup.find_all('a', attrs={'data-automation': 'jobTitle'}, href=True):
    link_list.append('https://www.seek.co.nz/' + a['href'])
print(link_list)

The above code looks fine.上面的代码看起来不错。 I can print a list of job links and put them in a list, but the following code only printed out 2 paragraphs then it threw an error:我可以打印一份工作链接列表并将它们放在一个列表中，但下面的代码只打印了 2 段然后它抛出了一个错误：

for link in link_list:
    response = requests.get(link, 'lxml')
    sp = BeautifulSoup(response.text, 'html.parser')
    table = sp.find_all('div',attrs={'data-automation': 'jobDescription'})
    for x in table:
        print(x.find('p').text)

AttributeError                            Traceback (most recent call last)
<ipython-input-41-8afe949a9497> in <module>()
      4     table = sp.find_all('div',attrs={'data-automation': 'jobDescription'})
      5     for x in table:
----> 6         print(x.find('p').text)

AttributeError: 'NoneType' object has no attribute 'text'

Can someone tell me why it didn't work and how to make it right?有人能告诉我为什么它不起作用以及如何使它正确吗？ I am using Python 3 and bs4.我正在使用 Python 3 和 bs4。 Thank you!谢谢！

Answer 1

Going through the webpage , it is clear that the first two posts that you'll get through soup.find_all('a', attrs={'data-automation': 'jobTitle'}, href=True) are advertisements or promoted posts, and consequently the webpages these two and the other job postings lead to are different.浏览网页，很明显，您将通过soup.find_all('a', attrs={'data-automation': 'jobTitle'}, href=True)获得的前两个帖子是广告或促销帖子，因此这两个和其他职位发布导致的网页不同。 For the latter, perhaps soup.find('div', attrs={'class':'templatetext'}) will do the job.对于后者，也许soup.find('div', attrs={'class':'templatetext'})可以完成这项工作。

Hope that helps!希望有帮助！

Answer 2

you have to use find_all你必须使用find_all

for link in link_list:
    response = requests.get(link, 'lxml')
    sp = BeautifulSoup(response.text, 'html.parser')
    table = sp.find_all('div',attrs={'data-automation': 'jobDescription'})
    for x in table:
        all_p = x.find_all('p')
        for para in all_p:
            if para:
                print(para.text)

使用 BeautifulSoup 抓取求职网站

问题描述

2 个解决方案

解决方案1
0 2020-02-14 09:58:09

解决方案2
0 2020-02-14 09:59:35

使用 BeautifulSoup 抓取求职网站

问题描述

2 个解决方案

解决方案1 0 2020-02-14 09:58:09

解决方案2 0 2020-02-14 09:59:35

解决方案1
0 2020-02-14 09:58:09

解决方案2
0 2020-02-14 09:59:35