简体   繁体   English

使用BeautifulSoup解析HTML的所有页面

[英]Parsing All Pages of HTML using BeautifulSoup

I'm having problems within my code which works perfectly with one page, but when I try to parse all the 28 pages it doesn't parse 27 pages, but parse only the first one. 我的代码存在问题,无法完美地与一个页面配合使用,但是当我尝试解析所有28个页面时,它无法解析27个页面,而只能解析第一个页面。

The main idea is parse the data from the mentioned url which has 28 pages in overall and I made for loop for it in order to make BS parse from all the pages. 主要思想是解析来自提到的url的数据,该url总共有28个页面,我为此进行了循环,以便从所有页面进行BS解析。 However, it parses only the first page, but doesn't parse others. 但是,它仅解析首页,而不解析其他页面。

I would like to get your recommendations and ways to make it work. 我想得到您的建议和使它起作用的方法。

Code: 码:

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd

for t in range(28):
    url = "https://boss.az/vacancies?action=index&controller=vacancies&only_path=true&page={}&type=vacancies".format(t)
    r = requests.get(url)
    soup = bs(r.content, 'html.parser')

    titles = [i.text for i in soup.select('.results-i-title')]
    #print(titles)
    companies = [i.text for i in soup.select('.results-i-company')]
    #print(companies)
    summaries = [i.text for i in soup.select('.results-i-summary')]

df = pd.DataFrame(list(zip(titles, companies, summaries)), columns = ['Title', 'Company', 'Summary'])
df.to_csv(r'Data.csv', sep=',', encoding='utf-8-sig',index = False )

You are overwriting titles , companies and summaries with every iteration of the loop. 您将在循环的每次迭代中覆盖titlescompaniessummaries Simply change titles = ... to titles += ... : 只需将titles = ...更改为titles += ...

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd

titles = []
companies = []
summaries = []

for t in range(28):
    url = "https://boss.az/vacancies?action=index&controller=vacancies&only_path=true&page={}&type=vacancies".format(t)
    r = requests.get(url)
    soup = bs(r.content, 'html.parser')

    titles += [i.text for i in soup.select('.results-i-title')]
    companies += [i.text for i in soup.select('.results-i-company')]
    summaries += [i.text for i in soup.select('.results-i-summary')]

df = pd.DataFrame(list(zip(titles, companies, summaries)), columns = ['Title', 'Company', 'Summary'])
df.to_csv(r'Data.csv', sep=',', encoding='utf-8-sig',index = False )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM