Beautifulsoup 按搜索類別提取數據

Question

下面的代碼從以下網站的評論中提取 Arbeitsatmosphare 和 Stadt 數據。 但是提取是基於索引的方法，所以如果我們不想提取Arteitsatmosphare，而是Image（ rating_tags[12] ），我們就會出錯，因為有時我們只有2或3條評論。

我想更新此代碼以獲得以下輸出。 如果我們沒有 Image 使用 0 或 n/a。

         Arbeitsatmosphare | Stadt     | Image | 
   1.      4.00            | Berlin    | 4.00  |
   2.      5.00            | Frankfurt | 3.00  |
   3.      3.00            | Munich    | 3.00  |
   4.      5.00            | Berlin    | 2.00  |
   5.      4.00            | Berlin    | 5.00  |

我的代碼在下面

import requests
from bs4 import BeautifulSoup
import pandas as  pd

arbeit = []
stadt = []
with requests.Session() as session:
    session.headers = {
        'x-requested-with': 'XMLHttpRequest'
    }
    page = 1
    while True:
        print(f"Processing page {page}..")
        url = f'https://www.kununu.com/de/volkswagenconsulting/kommentare/{page}'
        response = session.get(url)

        soup = BeautifulSoup(response.text, 'html.parser')
        articles = soup.find_all('article')
        print("Number of articles: " + str(len(articles)))
        for article in articles:

            rating_tags = article.find_all('span', {'class' : 'rating-badge'})

            arbeit.append(rating_tags[0].text.strip())


            detail_div = article.find_all('div', {'class' : 'review-details'})[0]
            nodes = detail_div.find_all('li')
            stadt_node = nodes[1]
            stadt_node_div = stadt_node.find_all('div')
            stadt_name = stadt_node_div[1].text.strip()
            stadt.append(stadt_name)

        page += 1

        pagination = soup.find_all('div', {'class' : 'paginationControl'})
        if not pagination:
            break

df = pd.DataFrame({'Arbeitsatmosphäre' : arbeit, 'Stadt' : stadt})
print(df)

Answer 1

您可以使用try/except 。

import requests
from bs4 import BeautifulSoup
import pandas as  pd
import re

arbeit = []
stadt = []
image = []
with requests.Session() as session:
    session.headers = {
        'x-requested-with': 'XMLHttpRequest'
    }
    page = 1
    while True:
        print(f"Processing page {page}..")
        url = f'https://www.kununu.com/de/volkswagenconsulting/kommentare/{page}'
        response = session.get(url)

        soup = BeautifulSoup(response.text, 'html.parser')
        articles = soup.find_all('article')
        print("Number of articles: " + str(len(articles)))
        for article in articles:

            rating_tags = article.find_all('span', {'class' : 'rating-badge'})

            arbeit.append(rating_tags[0].text.strip())

            try:
                imageText = article.find('span', text=re.compile(r'Image')).find_next('span').text.strip()
                image.append(imageText)
            except:
                image.append('N/A')



            detail_div = article.find_all('div', {'class' : 'review-details'})[0]
            nodes = detail_div.find_all('li')
            stadt_node = nodes[1]
            stadt_node_div = stadt_node.find_all('div')
            stadt_name = stadt_node_div[1].text.strip()
            stadt.append(stadt_name)

        page += 1

        pagination = soup.find_all('div', {'class' : 'paginationControl'})
        if not pagination:
            break

df = pd.DataFrame({'Arbeitsatmosphäre' : arbeit, 'Stadt' : stadt, 'Image': image})
print(df)

輸出：

Processing page 1..
Number of articles: 10
Processing page 2..
Number of articles: 10
Processing page 3..
Number of articles: 10
Processing page 4..
Number of articles: 4
   Arbeitsatmosphäre      Stadt Image
0               5,00  Wolfsburg  4,00
1               5,00  Wolfsburg  4,00
2               5,00  Wolfsburg  5,00
3               5,00  Wolfsburg  4,00
4               2,00  Wolfsburg  2,00
5               5,00  Wolfsburg  5,00
6               5,00  Wolfsburg  5,00
7               5,00  Wolfsburg  4,00
8               5,00  Wolfsburg  4,00
9               5,00  Wolfsburg  5,00
10              5,00  Wolfsburg  4,00
11              5,00  Wolfsburg  5,00
12              5,00  Wolfsburg  5,00
13              4,00  Wolfsburg  4,00
14              4,00  Wolfsburg  4,00
15              4,00  Wolfsburg  4,00
16              5,00  Wolfsburg  5,00
17              3,00  Wolfsburg  5,00
18              5,00  Wolfsburg  4,00
19              5,00  Wolfsburg  5,00
20              5,00  Wolfsburg  4,00
21              4,00  Wolfsburg  2,00
22              5,00  Wolfsburg  5,00
23              4,00  Wolfsburg   N/A
24              4,00  Wolfsburg  4,00
25              4,00  Wolfsburg  4,50
26              5,00  Wolfsburg  5,00
27              2,33  Wolfsburg  2,00
28              5,00  Wolfsburg  5,00
29              2,00  Wolfsburg  1,00
30              4,00  Wolfsburg  3,00
31              5,00  Wolfsburg  5,00
32              5,00  Wolfsburg  4,00
33              4,00  Wolfsburg  4,00

Beautifulsoup 按搜索類別提取數據

問題描述

1 個解決方案

解決方案1
2 已采納 2020-01-07 16:11:05

Beautifulsoup 按搜索類別提取數據

問題描述

1 個解決方案

解決方案1 2 已采納 2020-01-07 16:11:05

解決方案1
2 已采納 2020-01-07 16:11:05