Scrape News with BeautifulSoup

Question

i am trying to scrape news articles from a Website. I am only interested articles that contain a <span class="news_headline"> with the text "Transfers". From this article I want to extract the text inside the spans from <div class="news_text"> . The result should end up in a csv file and look something like this:

R.Wolf; wechselt für 167.000 von Computer zu; Hauke  
Weiner; wechselt für 167.000 von Computer zu; Hauke  
Gonther; wechselt für 770.000 von Computer zu; Hauke

or

3378; wechselt für 167.000 von Computer zu; 514102  
3605; wechselt für 167.000 von Computer zu; 514102  
1197; wechselt für 770.000 von Computer zu; 514102

I am very new to programing so I hope anyone can help.

<div class="single_news_container">
    <div class="news_body_right single_news_content">
        <div class="cont_news_headlines">
          <span class="wrapper_news_headlines">
            <span class="news_headline">Transfers</span>
          </span>
        </div>
        <div class="news_text">
            <div>
                <p>
                    <span><a href="/2.bundesliga/players/R. Wolf-3378">R. Wolf</a> wechselt für 167.000 von Computer zu <a href="/users/514102">Hauke</a></span>
                </p>
                <p>
                    <span><a href="/2.bundesliga/players/Weiner-3605">Weiner</a> wechselt für 167.000 von Computer zu <a href="/users/514102">Hauke</a></span>
                </p>
                <p>
                    <span><a href="/2.bundesliga/players/Gonther-1197">Gonther</a> wechselt für 770.000 von Computer zu <a href="/users/514096">Christoph</a></span>
                </p>
            </div>
        </div>
    </div>
</div>

Answer 1

First, check the nested structure of the html code. You will see that the data you want to scrape is not wrapped in the div you mention, rather they're both wrapped in <div class="news_body_right single_news_content"> . So you should run a find_all on that div and then loop the results to check whether within these div 's the news headline contains 'Transfers'. Only then you can extract the data by, for example, populating an empty list, then loading it in pandas and saving it to csv :

As find_all returns a list

from bs4 import BeautifulSoup
import pandas as pd

html='''<div class="single_news_container">
<div class="news_body_right single_news_content">
    <div class="cont_news_headlines">
      <span class="wrapper_news_headlines">
        <span class="news_headline">Transfers</span>
      </span>
    </div>
    <div class="news_text">
        <div>
            <p>
                <span><a href="/2.bundesliga/players/R. Wolf-3378">R. Wolf</a> wechselt für 167.000 von Computer zu <a href="/users/514102">Hauke</a></span>
            </p>
            <p>
                <span><a href="/2.bundesliga/players/Weiner-3605">Weiner</a> wechselt für 167.000 von Computer zu <a href="/users/514102">Hauke</a></span>
            </p>
            <p>
                <span><a href="/2.bundesliga/players/Gonther-1197">Gonther</a> wechselt für 770.000 von Computer zu <a href="/users/514096">Christoph</a></span>
            </p>
        </div>
    </div>
</div>'''

soup = BeautifulSoup(html,'html.parser')

data = []

for news in soup.find_all("div", class_="news_body_right single_news_content"):
  if 'Transfers' in news.find("span", class_="news_headline"):
    for i in news.find("div", class_="news_text").find_all('span'):
      subject = i.find_all('a')[0].get_text()
      amount = i.get_text().split('für ', 1)[1].split(' von')[0].replace('.','').replace(',','.')
      from_player = i.get_text().split('von ', 1)[1].split(' zu')[0]
      to_player = i.find_all('a')[1].get_text()
      data.append({'subject': subject, 'amount': amount, 'from_player': from_player, 'to_player': to_player})

df = pd.DataFrame(data)
df.to_csv('output.csv')

Result:

	subject	amount	from_player	to_player
0	R. Wolf	167000	Computer	Hauke
1	Weiner	167000	Computer	Hauke
2	Gonther	770000	Computer	Christoph

Scrape News with BeautifulSoup

Question

1 answers

solution1
3 2021-07-03 21:55:29

Scrape News with BeautifulSoup

Question

1 answers

solution1 3 2021-07-03 21:55:29

solution1
3 2021-07-03 21:55:29