[英]beautifulsoup extract data by searched class
Below code extracts Arbeitsatmosphare and Stadt data from reviews from below website.下面的代码从以下网站的评论中提取 Arbeitsatmosphare 和 Stadt 数据。 But extraction is based on index method, so if we would like to extract not Arteitsatmosphare, but Image (
rating_tags[12]
), we will have error, because sometimes we have only 2 or 3 items in review.但是提取是基于索引的方法,所以如果我们不想提取Arteitsatmosphare,而是Image(
rating_tags[12]
),我们就会出错,因为有时我们只有2或3条评论。
I would like to update this code to get below output.我想更新此代码以获得以下输出。 If we dont have Image use 0 or n/a.
如果我们没有 Image 使用 0 或 n/a。
Arbeitsatmosphare | Stadt | Image |
1. 4.00 | Berlin | 4.00 |
2. 5.00 | Frankfurt | 3.00 |
3. 3.00 | Munich | 3.00 |
4. 5.00 | Berlin | 2.00 |
5. 4.00 | Berlin | 5.00 |
My code is below我的代码在下面
import requests
from bs4 import BeautifulSoup
import pandas as pd
arbeit = []
stadt = []
with requests.Session() as session:
session.headers = {
'x-requested-with': 'XMLHttpRequest'
}
page = 1
while True:
print(f"Processing page {page}..")
url = f'https://www.kununu.com/de/volkswagenconsulting/kommentare/{page}'
response = session.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
articles = soup.find_all('article')
print("Number of articles: " + str(len(articles)))
for article in articles:
rating_tags = article.find_all('span', {'class' : 'rating-badge'})
arbeit.append(rating_tags[0].text.strip())
detail_div = article.find_all('div', {'class' : 'review-details'})[0]
nodes = detail_div.find_all('li')
stadt_node = nodes[1]
stadt_node_div = stadt_node.find_all('div')
stadt_name = stadt_node_div[1].text.strip()
stadt.append(stadt_name)
page += 1
pagination = soup.find_all('div', {'class' : 'paginationControl'})
if not pagination:
break
df = pd.DataFrame({'Arbeitsatmosphäre' : arbeit, 'Stadt' : stadt})
print(df)
You can use try/except
.您可以使用
try/except
。
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
arbeit = []
stadt = []
image = []
with requests.Session() as session:
session.headers = {
'x-requested-with': 'XMLHttpRequest'
}
page = 1
while True:
print(f"Processing page {page}..")
url = f'https://www.kununu.com/de/volkswagenconsulting/kommentare/{page}'
response = session.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
articles = soup.find_all('article')
print("Number of articles: " + str(len(articles)))
for article in articles:
rating_tags = article.find_all('span', {'class' : 'rating-badge'})
arbeit.append(rating_tags[0].text.strip())
try:
imageText = article.find('span', text=re.compile(r'Image')).find_next('span').text.strip()
image.append(imageText)
except:
image.append('N/A')
detail_div = article.find_all('div', {'class' : 'review-details'})[0]
nodes = detail_div.find_all('li')
stadt_node = nodes[1]
stadt_node_div = stadt_node.find_all('div')
stadt_name = stadt_node_div[1].text.strip()
stadt.append(stadt_name)
page += 1
pagination = soup.find_all('div', {'class' : 'paginationControl'})
if not pagination:
break
df = pd.DataFrame({'Arbeitsatmosphäre' : arbeit, 'Stadt' : stadt, 'Image': image})
print(df)
Output:输出:
Processing page 1..
Number of articles: 10
Processing page 2..
Number of articles: 10
Processing page 3..
Number of articles: 10
Processing page 4..
Number of articles: 4
Arbeitsatmosphäre Stadt Image
0 5,00 Wolfsburg 4,00
1 5,00 Wolfsburg 4,00
2 5,00 Wolfsburg 5,00
3 5,00 Wolfsburg 4,00
4 2,00 Wolfsburg 2,00
5 5,00 Wolfsburg 5,00
6 5,00 Wolfsburg 5,00
7 5,00 Wolfsburg 4,00
8 5,00 Wolfsburg 4,00
9 5,00 Wolfsburg 5,00
10 5,00 Wolfsburg 4,00
11 5,00 Wolfsburg 5,00
12 5,00 Wolfsburg 5,00
13 4,00 Wolfsburg 4,00
14 4,00 Wolfsburg 4,00
15 4,00 Wolfsburg 4,00
16 5,00 Wolfsburg 5,00
17 3,00 Wolfsburg 5,00
18 5,00 Wolfsburg 4,00
19 5,00 Wolfsburg 5,00
20 5,00 Wolfsburg 4,00
21 4,00 Wolfsburg 2,00
22 5,00 Wolfsburg 5,00
23 4,00 Wolfsburg N/A
24 4,00 Wolfsburg 4,00
25 4,00 Wolfsburg 4,50
26 5,00 Wolfsburg 5,00
27 2,33 Wolfsburg 2,00
28 5,00 Wolfsburg 5,00
29 2,00 Wolfsburg 1,00
30 4,00 Wolfsburg 3,00
31 5,00 Wolfsburg 5,00
32 5,00 Wolfsburg 4,00
33 4,00 Wolfsburg 4,00
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.