简体   繁体   English

从 BeautifulSoup 中没有类的 span 标签中提取文本

[英]Extracting text from span tag with no classes in BeautifulSoup

I'm trying to extract data from a website for the purposes of finishing a small data analysis project.我正在尝试从网站中提取数据以完成一个小型数据分析项目。 Here is the the HTML source that I'm dealing with (All the divs that I want to extract data from have the same structure exactly).这是我正在处理的 HTML 源(我想从中提取数据的所有 div 都具有完全相同的结构)。

url = "https://www.rystadenergy.com/newsevents/news/press-releases/"
results = requests.get(url)
soup = BeautifulSoup(results.text, "html.parser")


   <div class="col-12 col-md-6 col-lg-4 mt-3 news-events-list__item" data-category="Oil Markets" data-month="11" data-year="2020">
     <a class="d-block bg-light p-3 text-body text-decoration-none h-100" href="/newsevents/news/press-releases/prices-at-stake-if-opec-increases-output-in-january-a-200-million-barrel-glut-will-build-through-may/">
      <small class="mb-3 d-flex flex-wrap justify-content-between">
       <time datetime="2020-11-30">
        November 30, 2020
       </time>
       <span>
        Oil Markets
       </span>
      </small>
      <h5 class="mb-0">
       Prices at stake: If OPEC+ increases output in January, a 200 million-barrel glut will build through May
      </h5>
     </a>
    </div>

Fortunately, I succeeded in extracting the titles of the articles and their publishing dates.幸运的是,我成功地提取了文章的标题和发表日期。 I've first created bs4.element.ResultSet and then wrote a loop in order to iterate through each date as follow and it worked properly (same happened for the title of the article).我首先创建了bs4.element.ResultSet ,然后编写了一个循环以遍历每个日期,如下所示并且它正常工作(文章标题也发生了同样的情况)。

divs = soup.find_all('div', class_='col-12 col-md-6 col-lg-4 mt-3 news-events-list__item')

dates = []
for container in divs:
    date = container.find('time')
    dates.append(date['datetime'])

However, when I tried to extract the category of each article, which lives between <span></span> (Oil Markets in my case), I've got an error that 'NoneType' object has no attribute 'text .但是,当我尝试提取位于<span></span>之间的每篇文章的类别时(在我的例子中是 Oil Markets),我得到了一个error that 'NoneType' object has no attribute 'text The code I used to do so was:我以前这样做的代码是:

for container in divs:
    topic = container.find('span').text
    topics.append(topic)  

The weird thing here is that when I print(topics) , I've got a list contains more elements than the actual ones (almost 800 element and sometimes even more) and the elements were mixed and include strings and bs4 element tags at the same time.这里奇怪的是,当我print(topics)时,我得到的列表包含的元素比实际元素多(将近 800 个元素,有时甚至更多),并且元素混合在一起,同时包含字符串和 bs4 元素标签时间。 Here is a snapshot of the list I've got:这是我得到的列表的快照:

</span>, <span> E&amp;P, Oil Markets, Supply Chain </span>, <span> Oil Markets, Gas Markets </span>, <span> Supply Chain </span>, <span> Gas Markets </span>, <span> E&amp;P </span>, <span> Shale </span>, <span> Corporate </span>, <span> E&amp;P </span>, <span> Oil Markets </span>, <span> Supply Chain, Other, Renewables </span>, <span> Gas Markets </span>, <span> Oil Markets </span>, <span> Gas Markets </span>, <span> Gas Markets </span>, <span> E&amp;P </span>, <span> Gas Markets </span>, <span> E&amp;P </span>, <span> Supply Chain </span>, <span> Shale </span>, None, <span> Corporate </span>, <span> Shale </span>, None, <span> Renewables </span>, <span> Renewables </span>, <span> Renewables </span>, <span> E&amp;P </span>, <span> E&amp;P </span>, <span> E&amp;P </span>, <span> E&amp;P </span>, <span> Oil Markets </span>, <span> E&amp;P </span>, <span> Supply Chain </span>, ' Oil Markets ', ' Oil Markets ', ' Supply Chain, Renewables ', ' Oil Markets ', ' Renewables ', ' E&P ', ' Renewables ', ' Supply Chain ', ' Shale ', ' E&P ', ' Shale ', ' Gas Markets ', ' Gas Markets ', ' Supply Chain ', ' Oil Markets ', ' Shale ', ' Oil Markets ', ' Corporate, Oil Markets, Other ', ' Shale ', ' Renewables ', ' Shale ', ' Supply Chain ',

My aim is to extract the categories as a list of strings (they should be 207 categories combined) in order to populate them later in a data frame along with dates and title.我的目标是将类别提取为字符串列表(它们应该是 207 个类别的组合),以便稍后将它们与日期和标题一起填充到数据框中。

I've tried the solutions here and here and here but with no success.我在这里这里这里尝试了解决方案,但没有成功。 I was wondering if someone can help me to fix this problem.我想知道是否有人可以帮助我解决这个问题。

Your code is fine, you just have to add a try..catch to avoid crashing on some articles not having categories.您的代码很好,您只需添加一个try..catch以避免在某些没有类别的文章上崩溃。

Below snippet illustrates it:下面的片段说明了这一点:

from bs4 import BeautifulSoup
import requests

html = BeautifulSoup(requests.get('https://www.rystadenergy.com/newsevents/news/press-releases/').text, 'html.parser')

divs = html.find_all('div', class_='col-12 col-md-6 col-lg-4 mt-3 news-events-list__item')

for container in divs:
    topic = container.find('span')
    if not topic :
        print(container)

Output: Output:

<div class="col-12 col-md-6 col-lg-4 mt-3 news-events-list__item" data-category="" data-month="1" data-year="2020"> <a class="d-block bg-light p-3 text-body text-decoration-none h-100" href="/newsevents/news/press-releases/winners-gullkronen-2020/"> <small class="mb-3 d-flex flex-wrap justify-content-between"> <time datetime="2020-01-28">January 28, 2020</time> </small> <h5 class="mb-0"> Rystad Energy announces winners for Gullkronen 2020 </h5> </a> </div>
<div class="col-12 col-md-6 col-lg-4 mt-3 news-events-list__item" data-category="" data-month="1" data-year="2020"> <a class="d-block bg-light p-3 text-body text-decoration-none h-100" href="/newsevents/news/press-releases/nominees-gullkronen-2020/"> <small class="mb-3 d-flex flex-wrap justify-content-between"> <time datetime="2020-01-23">January 23, 2020</time> </small> <h5 class="mb-0"> Rystad Energy announces nominees for Gullkronen 2020 </h5> </a> </div>

As you see, no span element.如您所见,没有span元素。

So in your case:所以在你的情况下:

topics = []
for container in divs:
    try:
        topic = container.find('span').text.strip()
    except:
        topic = ''
    finally:
        topics.append(topic)

Note that this is just one way to do it:)请注意,这只是一种方法:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM