简体繁体 English

如何使用BeautifulSoup遍历网站的每个页面以进行网页抓取

[英]How to loop through each page of website for web scraping with BeautifulSoup

原文 2017-09-20 23:04:25 9 1 python/ html/ web-scraping/ beautifulsoup

I am scraping job posting data from a website using BeautifulSoup. 我正在使用BeautifulSoup从网站抓取职位发布数据。 I have working code that does what I need, but it only scrapes the first page of job postings. 我有满足我需要的工作代码，但它只会刮取职位发布的第一页。 I am having trouble figuring out how to iteratively update the url to scrape each page. 我在弄清楚如何迭代更新URL以刮擦每个页面时遇到了麻烦。 I am new to Python and have looked at a few different solutions to similar questions, but have not figured out how to apply them to my particular url. 我是Python的新手，曾经研究过几种解决类似问题的方法，但是还没有弄清楚如何将其应用于我的特定网址。 I think I need to iteratively update the url or somehow click the next button and then loop my existing code through each page. 我认为我需要迭代更新URL或以某种方式单击“下一步”按钮，然后在每个页面中循环我现有的代码。 I appreciate any solutions. 我感谢任何解决方案。

url: https://jobs.utcaerospacesystems.com/search-jobs 网址： https ： //jobs.utcaerospacesystems.com/search-jobs

1 个解决方案

First, BeautifulSoup doesn't have anything to do with GETing web pages - you get the webpage yourself, then feed it to bs4 for processing. 首先，BeautifulSoup与获取网页没有任何关系-您可以自己获取网页，然后将其提供给bs4进行处理。

The problem with the page you linked is that it's javascript - it only renders correctly in a browser (or any other javascript VM). 您链接的页面的问题在于它是javascript-仅在浏览器（或任何其他javascript VM）中正确显示。

@Fabricator is on the right track - you'll need to watch the developer console and see what the ajax requests the js is sending to the server. @Fabricator处在正确的轨道上-您需要观察开发人员控制台，并查看ajax请求js将其发送到服务器的内容。 In this case, also take a look at the query string params, which include a param called CurrentPage - that's probably the one you want to focus on. 在这种情况下，还要看一下查询字符串参数，其中包括一个称为CurrentPage的参数-可能是您要关注的参数。