简体   繁体   English

python爬行beautifulsoup如何爬行几个页面?

[英]python crawling beautifulsoup how to crawl several pages?

Please Help. 请帮忙。 I want to get all the company names of each pages and they have 12 pages. 我想获取每页的所有公司名称,它们共有12页。

http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/1 http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/2 -- this website only changes the number. http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/1 http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/2 -该网站仅更改号码。

So Here is my code so far. 到目前为止,这是我的代码。 Can I get just the title (company name) of 12 pages? 我可以只得到12页的标题(公司名称)吗? Thank you in advance. 先感谢您。

from bs4 import BeautifulSoup
import requests

maximum = 0
page = 1

URL = 'http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/1'
response = requests.get(URL)
source = response.text
soup = BeautifulSoup(source, 'html.parser')

whole_source = ""
for page_number in range(1, maximum+1):
URL = 'http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/' + str(page_number)
response = requests.get(URL)

whole_source = whole_source + response.text
soup = BeautifulSoup(whole_source, 'html.parser')
find_company = soup.select("#content > div.wrap_analysis_data > div.public_con_box.public_list_wrap > ul > li:nth-child(13) > div > strong")

for company in find_company:
print(company.text)

---------Output of one page ---------输出一页

---------page source :) ---------页面来源:)

So, you want to remove all the headers and get only the string of the company name? 因此,您要删除所有headers并仅获取公司名称的string吗? Basically, you can use the soup.findAll to find the list of company in the format like this: 基本上,您可以使用soup.findAll以如下格式查找公司列表:

 <strong class="company"><span>중소기업진흥공단</span></strong> 

Then you use the .find function to extract information from the <span> tag: 然后,您可以使用.find函数从<span>标记中提取信息:

 <span>중소기업진흥공단</span> 

After that, you use .contents function to get the string from the <span> tag: 之后,使用.contents函数从<span>标记获取字符串:

'중소기업진흥공단'

So you write a loop to do the same for each page, and make a list called company_list to store the results from each page and append them together. 因此,您编写了一个循环以对每个页面执行相同操作,并创建一个名为company_list的列表以存储每个页面的结果并将它们附加在一起。

Here's the code: 这是代码:

from bs4 import BeautifulSoup
import requests

maximum = 12

company_list = [] # List for result storing
for page_number in range(1, maximum+1):
    URL = 'http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/{}'.format(page_number) 
    response = requests.get(URL)
    print(page_number)
    whole_source = response.text
    soup = BeautifulSoup(whole_source, 'html.parser')
    for entry in soup.findAll('strong', attrs={'class': 'company'}): # Finding all company names in the page
        company_list.append(entry.find('span').contents[0]) # Extracting name from the result

The company_list will give you all the company names you want company_list将为您提供所有您想要的公司名称

I figured it out eventually. 我终于想通了。 Thank you for your answer though! 谢谢您的回答!

image : code captured in jupyter notebook 图片:在jupyter笔记本中捕获的代码

Here is my final code. 这是我的最终代码。

from urllib.request import urlopen 
from bs4 import BeautifulSoup

company_list=[]
for n in range(12):
    url = 'http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/{}'.format(n+1)
    webpage = urlopen(url)
    source = BeautifulSoup(webpage,'html.parser',from_encoding='utf-8')
    companys = source.findAll('strong',{'class':'company'})

    for company in companys:
    company_list.append(company.get_text().strip().replace('\n','').replace('\t','').replace('\r',''))

file = open('company_name1.txt','w',encoding='utf-8')

for company in company_list:
file.write(company+'\n')

file.close()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM