简体   繁体   English

beautifulsoup 类值提取 python

[英]beautifulsoup class value extract python

Im trying to extract values from html page using beautifulsoup.我正在尝试使用 beautifulsoup 从 html 页面中提取值。

I updated Jack's code and now it extracts rating in commentaries.我更新了 Jack 的代码,现在它在评论中提取评分。 But I have 2 issues: 1. It extracts rating only from first 10 reviews 2. I would like to include also third column to extraction, date, which is located in upper left of review.但我有两个问题: 1. 它仅从前 10 条评论中提取评分 2. 我还想将第三列提取到日期,它位于评论的左上角。 Could you please help me?请你帮助我好吗?

url = 'https://www.kununu.com/de/allianz-deutschland/kommentare'
page = requests.get(url)

soup = bs(page.text, 'html.parser')
divs = soup.find_all(class_="col-xs-12 col-lg-12")

titles = [] #this initializes a list of titles
badges = [] #this initializes a list of badges
for item in divs[0].find_all('span',class_="rating-title"):
    titles.append(item.text.strip())
for item in divs[0].find_all('span',class_="rating-badge"):
    badges.append(item.text.strip())


my_list = list(zip(titles, badges)) #this takes the two lists, zips them and converts the zip element back to a list
df = pd.DataFrame(my_list, columns = ['rating-title', 'rating-badge'])
print(df)

Output
                    rating-title rating-badge
0              Arbeitsatmosphäre         5,00
1          Vorgesetztenverhalten         2,00
2           Kollegenzusammenhalt         5,00
3          Interessante Aufgaben         4,00
4                  Kommunikation         3,00
..                           ...          ...
125    Gehalt / Sozialleistungen         4,00
126           Arbeitsbedingungen         4,00
127  Umwelt- / Sozialbewusstsein         3,00
128            Work-Life-Balance         5,00
129                        Image         4,00

[130 rows x 2 columns]

You haven't gone into the nested elements.您还没有进入嵌套元素。 You just grabbed and printed the parent element.您刚刚抓取并打印了父元素。

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://www.kununu.com/de/allianz-deutschland/kommentare'


page = requests.get(url).text

soup = BeautifulSoup(page, 'html.parser')
div = soup.find(class_="col-md-9 col-sm-12 col-xs-12 flex-left")

row = div.find('div', {'class':'row'})

titles = [ x.text.strip() for x in row.find_all('span', {'class':'rating-title'}) ] 
ratings = [ x.text.strip() for x in row.find_all('div', {'class':'rating-stars'}) ]
data_tuples = list(zip(titles,ratings))

df = pd.DataFrame(data_tuples, columns=['rating-title','rating-badge'])

Output:输出:

print (df)
                          title ratings
0             Arbeitsatmosphäre    3,62
1         Vorgesetztenverhalten    3,49
2          Kollegenzusammenhalt    3,92
3         Interessante Aufgaben    3,78
4                 Kommunikation    3,44
5            Arbeitsbedingungen    3,70
6   Umwelt- / Sozialbewusstsein    3,76
7             Work-Life-Balance    3,54
8            Gleichberechtigung    3,94
9   Umgang mit älteren Kollegen    3,88
10     Karriere / Weiterbildung    3,52
11    Gehalt / Sozialleistungen    3,60
12                        Image    3,80

The following should get you the data into a pandas dataframe:以下应该让你的数据进入熊猫数据帧:

import pandas as pd
import requests
from bs4 import BeautifulSoup as bs

url = 'https://www.kununu.com/de/allianz-deutschland/kommentare'
page = requests.get(url)

soup = bs(page.text, 'html.parser')
divs = soup.find_all(class_="col-md-9 col-sm-12 col-xs-12 flex-left")

titles = [] #this initializes a list of titles
badges = [] #this initializes a list of badges
for item in divs[0].find_all('span',class_="rating-title"):
    titles.append(item.text.strip())
for item in divs[0].find_all('span',class_="rating-badge"):
    badges.append(item.text.strip())

my_list = list(zip(titles, badges)) #this takes the two lists, zips them and converts the zip element back to a list
df = pd.DataFrame(my_list, columns = ['rating-title', 'rating-badge']) 
df

Output:输出:

    rating-title    rating-badge
0   Arbeitsatmosphäre   3,62
1   Vorgesetztenverhalten   3,49
2   Kollegenzusammenhalt    3,92

etc.等等。

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.kununu.com/de/allianz-deutschland/kommentare')
soup = BeautifulSoup(r.text, 'html.parser')

rates = []
stars = []
for rate in soup.findAll('div', attrs={'col-lg-6 col-md-12 col-sm-12 col-xs-12'}):
    for item in rate.findAll('span', attrs={'class': 'rating-title'}):
        item = item.text.strip()
        rates.append(item)
for star in soup.findAll('div', attrs={'col-lg-6 col-md-12 col-sm-12 col-xs-12'}):
    for item in star.findAll('span', attrs={'class': 'rating-badge'}):
        item = item.text.strip()
        stars.append(item)

for a, b in zip(rates, stars):
    print("Name: {:<30} Stars: {:>5}".format(a, b))

Output:输出:

Name: Arbeitsatmosphäre              Stars:  3,62
Name: Vorgesetztenverhalten          Stars:  3,49
Name: Kollegenzusammenhalt           Stars:  3,92
Name: Interessante Aufgaben          Stars:  3,78
Name: Kommunikation                  Stars:  3,44
Name: Arbeitsbedingungen             Stars:  3,70
Name: Umwelt- / Sozialbewusstsein    Stars:  3,76
Name: Work-Life-Balance              Stars:  3,54
Name: Gleichberechtigung             Stars:  3,94
Name: Umgang mit älteren Kollegen    Stars:  3,88
Name: Karriere / Weiterbildung       Stars:  3,52
Name: Gehalt / Sozialleistungen      Stars:  3,60
Name: Image                          Stars:  3,80

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM