[英]beautifulsoup extract data by searched class
下面的代碼從以下網站的評論中提取 Arbeitsatmosphare 和 Stadt 數據。 但是提取是基於索引的方法,所以如果我們不想提取Arteitsatmosphare,而是Image( rating_tags[12]
),我們就會出錯,因為有時我們只有2或3條評論。
我想更新此代碼以獲得以下輸出。 如果我們沒有 Image 使用 0 或 n/a。
Arbeitsatmosphare | Stadt | Image |
1. 4.00 | Berlin | 4.00 |
2. 5.00 | Frankfurt | 3.00 |
3. 3.00 | Munich | 3.00 |
4. 5.00 | Berlin | 2.00 |
5. 4.00 | Berlin | 5.00 |
我的代碼在下面
import requests
from bs4 import BeautifulSoup
import pandas as pd
arbeit = []
stadt = []
with requests.Session() as session:
session.headers = {
'x-requested-with': 'XMLHttpRequest'
}
page = 1
while True:
print(f"Processing page {page}..")
url = f'https://www.kununu.com/de/volkswagenconsulting/kommentare/{page}'
response = session.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
articles = soup.find_all('article')
print("Number of articles: " + str(len(articles)))
for article in articles:
rating_tags = article.find_all('span', {'class' : 'rating-badge'})
arbeit.append(rating_tags[0].text.strip())
detail_div = article.find_all('div', {'class' : 'review-details'})[0]
nodes = detail_div.find_all('li')
stadt_node = nodes[1]
stadt_node_div = stadt_node.find_all('div')
stadt_name = stadt_node_div[1].text.strip()
stadt.append(stadt_name)
page += 1
pagination = soup.find_all('div', {'class' : 'paginationControl'})
if not pagination:
break
df = pd.DataFrame({'Arbeitsatmosphäre' : arbeit, 'Stadt' : stadt})
print(df)
您可以使用try/except
。
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
arbeit = []
stadt = []
image = []
with requests.Session() as session:
session.headers = {
'x-requested-with': 'XMLHttpRequest'
}
page = 1
while True:
print(f"Processing page {page}..")
url = f'https://www.kununu.com/de/volkswagenconsulting/kommentare/{page}'
response = session.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
articles = soup.find_all('article')
print("Number of articles: " + str(len(articles)))
for article in articles:
rating_tags = article.find_all('span', {'class' : 'rating-badge'})
arbeit.append(rating_tags[0].text.strip())
try:
imageText = article.find('span', text=re.compile(r'Image')).find_next('span').text.strip()
image.append(imageText)
except:
image.append('N/A')
detail_div = article.find_all('div', {'class' : 'review-details'})[0]
nodes = detail_div.find_all('li')
stadt_node = nodes[1]
stadt_node_div = stadt_node.find_all('div')
stadt_name = stadt_node_div[1].text.strip()
stadt.append(stadt_name)
page += 1
pagination = soup.find_all('div', {'class' : 'paginationControl'})
if not pagination:
break
df = pd.DataFrame({'Arbeitsatmosphäre' : arbeit, 'Stadt' : stadt, 'Image': image})
print(df)
輸出:
Processing page 1..
Number of articles: 10
Processing page 2..
Number of articles: 10
Processing page 3..
Number of articles: 10
Processing page 4..
Number of articles: 4
Arbeitsatmosphäre Stadt Image
0 5,00 Wolfsburg 4,00
1 5,00 Wolfsburg 4,00
2 5,00 Wolfsburg 5,00
3 5,00 Wolfsburg 4,00
4 2,00 Wolfsburg 2,00
5 5,00 Wolfsburg 5,00
6 5,00 Wolfsburg 5,00
7 5,00 Wolfsburg 4,00
8 5,00 Wolfsburg 4,00
9 5,00 Wolfsburg 5,00
10 5,00 Wolfsburg 4,00
11 5,00 Wolfsburg 5,00
12 5,00 Wolfsburg 5,00
13 4,00 Wolfsburg 4,00
14 4,00 Wolfsburg 4,00
15 4,00 Wolfsburg 4,00
16 5,00 Wolfsburg 5,00
17 3,00 Wolfsburg 5,00
18 5,00 Wolfsburg 4,00
19 5,00 Wolfsburg 5,00
20 5,00 Wolfsburg 4,00
21 4,00 Wolfsburg 2,00
22 5,00 Wolfsburg 5,00
23 4,00 Wolfsburg N/A
24 4,00 Wolfsburg 4,00
25 4,00 Wolfsburg 4,50
26 5,00 Wolfsburg 5,00
27 2,33 Wolfsburg 2,00
28 5,00 Wolfsburg 5,00
29 2,00 Wolfsburg 1,00
30 4,00 Wolfsburg 3,00
31 5,00 Wolfsburg 5,00
32 5,00 Wolfsburg 4,00
33 4,00 Wolfsburg 4,00
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.