简体   繁体   English

使用 BeautifulSoup 抓取求职网站

[英]Scraping job hunting website using BeautifulSoup

I am trying to scrape all the complete job descriptions from this website but I got stuck: https://www.seek.co.nz/data-analyst-jobs/full-time?daterange=31&salaryrange=70000-999999&salarytype=annual我试图从这个网站上抓取所有完整的职位描述,但我卡住了: https : //www.seek.co.nz/data-analyst-jobs/full-time? daterange =31& salaryrange = 70000-999999 & salarytype =annual

My logic is to find all the job links on one page first than loop through next pages.我的逻辑是先在一页上找到所有工作链接,然后再遍历下一页。

My code looks like this:我的代码如下所示:

from bs4 import BeautifulSoup
import requests

soup = BeautifulSoup(response.text, 'html.parser')
link_list = []

for a in soup.find_all('a', attrs={'data-automation': 'jobTitle'}, href=True):
    link_list.append('https://www.seek.co.nz/' + a['href'])
print(link_list)

The above code looks fine.上面的代码看起来不错。 I can print a list of job links and put them in a list, but the following code only printed out 2 paragraphs then it threw an error:我可以打印一份工作链接列表并将它们放在一个列表中,但下面的代码只打印了 2 段然后它抛出了一个错误:

for link in link_list:
    response = requests.get(link, 'lxml')
    sp = BeautifulSoup(response.text, 'html.parser')
    table = sp.find_all('div',attrs={'data-automation': 'jobDescription'})
    for x in table:
        print(x.find('p').text)


AttributeError                            Traceback (most recent call last)
<ipython-input-41-8afe949a9497> in <module>()
      4     table = sp.find_all('div',attrs={'data-automation': 'jobDescription'})
      5     for x in table:
----> 6         print(x.find('p').text)

AttributeError: 'NoneType' object has no attribute 'text'

Can someone tell me why it didn't work and how to make it right?有人能告诉我为什么它不起作用以及如何使它正确吗? I am using Python 3 and bs4.我正在使用 Python 3 和 bs4。 Thank you!谢谢!

Going through the webpage , it is clear that the first two posts that you'll get through soup.find_all('a', attrs={'data-automation': 'jobTitle'}, href=True) are advertisements or promoted posts, and consequently the webpages these two and the other job postings lead to are different.浏览网页,很明显,您将通过soup.find_all('a', attrs={'data-automation': 'jobTitle'}, href=True)获得的前两个帖子是广告或促销帖子,因此这两个其他职位发布导致的网页不同。 For the latter, perhaps soup.find('div', attrs={'class':'templatetext'}) will do the job.对于后者,也许soup.find('div', attrs={'class':'templatetext'})可以完成这项工作。

Hope that helps!希望有帮助!

you have to use find_all你必须使用find_all

for link in link_list:
    response = requests.get(link, 'lxml')
    sp = BeautifulSoup(response.text, 'html.parser')
    table = sp.find_all('div',attrs={'data-automation': 'jobDescription'})
    for x in table:
        all_p = x.find_all('p')
        for para in all_p:
            if para:
                print(para.text)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM