简体   繁体   English

使用beautifulsoup python在span类HTML中刮取值

[英]Scrape values inside span class HTML with beautifulsoup python

I am trying to scrape data inside a span class and putting that data inside a DataFrame using Beautifulsoup. 我试图在span类中抓取数据,然后使用Beautifulsoup将数据放入DataFrame中。 So far, I've been successful at getting to the right place of the webpage. 到目前为止,我已经成功地找到了正确的网页位置。 But can't seem to be able to scrape the keywords and numbers next to "Happiness", "Sadness". 但似乎无法刮除“幸福”,“悲伤”旁边的关键字和数字。

<span class="text-border tooltips" data-original-title="Happiness 84%
 Sadness 80%
 " data-placement="left" data-toggle="tooltip">More stats</span>,
 <span class="text-border tooltips" data-original-title="Happiness 70%
 Sadness 59%
 " data-placement="left" data-toggle="tooltip">More stats</span>

Would be super helpful if someone could help me figure out to scrape all numbers next to Happiness and Sadness, and have them as columns in a pandas DataFrame. 如果有人可以帮助我找出“幸福”和“悲伤”旁边的所有数字,并将它们作为熊猫DataFrame中的列,将非常有帮助。

Thanks a lot 非常感谢

You could do something like 你可以做类似的事情

from bs4 import BeautifulSoup

s = """
<span class="text-border tooltips" data-original-title="Happiness 84%
 Sadness 80%
 " data-placement="left" data-toggle="tooltip">More stats</span>,
 <span class="text-border tooltips" data-original-title="Happiness 70%
 Sadness 59%
 " data-placement="left" data-toggle="tooltip">More stats</span>
"""

soup = BeautifulSoup(s, "lxml")
spans = soup.find_all("span") #get all spans
for span in spans:
    data = span["data-original-title"].split("\n") #get attr and split by \n 
    happiness = data[0][:-1].replace("Happiness ", "") #remove % and remove words
    sadness = data[1][:-1].replace("Sadness ", "")
    print("{} {}".format(happiness, sadness))

If it's guaranteed that all the spans will have a data-original-title... and If the title will always in in the format of "Happiness<SPACE><PERCENTAGE><NEW LINE>Sadness<SPACE><PERCENTAGE>" Then below should work out for you. 如果可以保证所有跨度都具有数据原始标题...,并且标题始终以"Happiness<SPACE><PERCENTAGE><NEW LINE>Sadness<SPACE><PERCENTAGE>"的格式显示,则下面应该为您解决。

>>> import itertools
>>> import re
>>> import pandas as pd
>>> import bs4
>>> html = """<span class="text-border tooltips" data-original-title="Happiness 84%
...  Sadness 80%
...  " data-placement="left" data-toggle="tooltip">More stats</span>,
...  <span class="text-border tooltips" data-original-title="Happiness 70%
...  Sadness 59%
...  " data-placement="left" data-toggle="tooltip">More stats</span>"""
>>> soup = bs4.BeautifulSoup(html, 'lxml')
>>> all_rows = []
>>> for span in soup.find_all('span'):
...     title_eles = re.split(' |\n', span['data-original-title'])
...     title_eles = list(filter(None, title_eles))
...     row = dict(itertools.zip_longest(title_eles[::2], title_eles[1::2], fillvalue=""))
...     all_rows.append(row)
...
>>> pd.DataFrame(all_rows)
  Happiness Sadness
0       84%     80%
1       70%     59%

Also the reason why soup.find_all(class_='data-original-title') returns empty is because data-original-title is an attribute in your HTML. 另外, soup.find_all(class_='data-original-title')返回空的原因是因为data-original-title是HTML中的一个属性。 It's not a class. 这不是一堂课。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM