[英]beautifulsoup extract data by searched class
下面的代码从以下网站的评论中提取 Arbeitsatmosphare 和 Stadt 数据。 但是提取是基于索引的方法,所以如果我们不想提取Arteitsatmosphare,而是Image( rating_tags[12]
),我们就会出错,因为有时我们只有2或3条评论。
我想更新此代码以获得以下输出。 如果我们没有 Image 使用 0 或 n/a。
Arbeitsatmosphare | Stadt | Image |
1. 4.00 | Berlin | 4.00 |
2. 5.00 | Frankfurt | 3.00 |
3. 3.00 | Munich | 3.00 |
4. 5.00 | Berlin | 2.00 |
5. 4.00 | Berlin | 5.00 |
我的代码在下面
import requests
from bs4 import BeautifulSoup
import pandas as pd
arbeit = []
stadt = []
with requests.Session() as session:
session.headers = {
'x-requested-with': 'XMLHttpRequest'
}
page = 1
while True:
print(f"Processing page {page}..")
url = f'https://www.kununu.com/de/volkswagenconsulting/kommentare/{page}'
response = session.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
articles = soup.find_all('article')
print("Number of articles: " + str(len(articles)))
for article in articles:
rating_tags = article.find_all('span', {'class' : 'rating-badge'})
arbeit.append(rating_tags[0].text.strip())
detail_div = article.find_all('div', {'class' : 'review-details'})[0]
nodes = detail_div.find_all('li')
stadt_node = nodes[1]
stadt_node_div = stadt_node.find_all('div')
stadt_name = stadt_node_div[1].text.strip()
stadt.append(stadt_name)
page += 1
pagination = soup.find_all('div', {'class' : 'paginationControl'})
if not pagination:
break
df = pd.DataFrame({'Arbeitsatmosphäre' : arbeit, 'Stadt' : stadt})
print(df)
您可以使用try/except
。
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
arbeit = []
stadt = []
image = []
with requests.Session() as session:
session.headers = {
'x-requested-with': 'XMLHttpRequest'
}
page = 1
while True:
print(f"Processing page {page}..")
url = f'https://www.kununu.com/de/volkswagenconsulting/kommentare/{page}'
response = session.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
articles = soup.find_all('article')
print("Number of articles: " + str(len(articles)))
for article in articles:
rating_tags = article.find_all('span', {'class' : 'rating-badge'})
arbeit.append(rating_tags[0].text.strip())
try:
imageText = article.find('span', text=re.compile(r'Image')).find_next('span').text.strip()
image.append(imageText)
except:
image.append('N/A')
detail_div = article.find_all('div', {'class' : 'review-details'})[0]
nodes = detail_div.find_all('li')
stadt_node = nodes[1]
stadt_node_div = stadt_node.find_all('div')
stadt_name = stadt_node_div[1].text.strip()
stadt.append(stadt_name)
page += 1
pagination = soup.find_all('div', {'class' : 'paginationControl'})
if not pagination:
break
df = pd.DataFrame({'Arbeitsatmosphäre' : arbeit, 'Stadt' : stadt, 'Image': image})
print(df)
输出:
Processing page 1..
Number of articles: 10
Processing page 2..
Number of articles: 10
Processing page 3..
Number of articles: 10
Processing page 4..
Number of articles: 4
Arbeitsatmosphäre Stadt Image
0 5,00 Wolfsburg 4,00
1 5,00 Wolfsburg 4,00
2 5,00 Wolfsburg 5,00
3 5,00 Wolfsburg 4,00
4 2,00 Wolfsburg 2,00
5 5,00 Wolfsburg 5,00
6 5,00 Wolfsburg 5,00
7 5,00 Wolfsburg 4,00
8 5,00 Wolfsburg 4,00
9 5,00 Wolfsburg 5,00
10 5,00 Wolfsburg 4,00
11 5,00 Wolfsburg 5,00
12 5,00 Wolfsburg 5,00
13 4,00 Wolfsburg 4,00
14 4,00 Wolfsburg 4,00
15 4,00 Wolfsburg 4,00
16 5,00 Wolfsburg 5,00
17 3,00 Wolfsburg 5,00
18 5,00 Wolfsburg 4,00
19 5,00 Wolfsburg 5,00
20 5,00 Wolfsburg 4,00
21 4,00 Wolfsburg 2,00
22 5,00 Wolfsburg 5,00
23 4,00 Wolfsburg N/A
24 4,00 Wolfsburg 4,00
25 4,00 Wolfsburg 4,50
26 5,00 Wolfsburg 5,00
27 2,33 Wolfsburg 2,00
28 5,00 Wolfsburg 5,00
29 2,00 Wolfsburg 1,00
30 4,00 Wolfsburg 3,00
31 5,00 Wolfsburg 5,00
32 5,00 Wolfsburg 4,00
33 4,00 Wolfsburg 4,00
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.