简体   繁体   中英

Scrape News with BeautifulSoup

i am trying to scrape news articles from a Website. I am only interested articles that contain a <span class="news_headline"> with the text "Transfers". From this article I want to extract the text inside the spans from <div class="news_text"> . The result should end up in a csv file and look something like this:

R.Wolf; wechselt für 167.000 von Computer zu; Hauke  
Weiner; wechselt für 167.000 von Computer zu; Hauke  
Gonther; wechselt für 770.000 von Computer zu; Hauke 

or

3378; wechselt für 167.000 von Computer zu; 514102  
3605; wechselt für 167.000 von Computer zu; 514102  
1197; wechselt für 770.000 von Computer zu; 514102

I am very new to programing so I hope anyone can help.

<div class="single_news_container">
    <div class="news_body_right single_news_content">
        <div class="cont_news_headlines">
          <span class="wrapper_news_headlines">
            <span class="news_headline">Transfers</span>
          </span>
        </div>
        <div class="news_text">
            <div>
                <p>
                    <span><a href="/2.bundesliga/players/R. Wolf-3378">R. Wolf</a> wechselt für 167.000 von Computer zu <a href="/users/514102">Hauke</a></span>
                </p>
                <p>
                    <span><a href="/2.bundesliga/players/Weiner-3605">Weiner</a> wechselt für 167.000 von Computer zu <a href="/users/514102">Hauke</a></span>
                </p>
                <p>
                    <span><a href="/2.bundesliga/players/Gonther-1197">Gonther</a> wechselt für 770.000 von Computer zu <a href="/users/514096">Christoph</a></span>
                </p>
            </div>
        </div>
    </div>
</div>

First, check the nested structure of the html code. You will see that the data you want to scrape is not wrapped in the div you mention, rather they're both wrapped in <div class="news_body_right single_news_content"> . So you should run a find_all on that div and then loop the results to check whether within these div 's the news headline contains 'Transfers'. Only then you can extract the data by, for example, populating an empty list, then loading it in pandas and saving it to csv :

As find_all returns a list

from bs4 import BeautifulSoup
import pandas as pd

html='''<div class="single_news_container">
<div class="news_body_right single_news_content">
    <div class="cont_news_headlines">
      <span class="wrapper_news_headlines">
        <span class="news_headline">Transfers</span>
      </span>
    </div>
    <div class="news_text">
        <div>
            <p>
                <span><a href="/2.bundesliga/players/R. Wolf-3378">R. Wolf</a> wechselt für 167.000 von Computer zu <a href="/users/514102">Hauke</a></span>
            </p>
            <p>
                <span><a href="/2.bundesliga/players/Weiner-3605">Weiner</a> wechselt für 167.000 von Computer zu <a href="/users/514102">Hauke</a></span>
            </p>
            <p>
                <span><a href="/2.bundesliga/players/Gonther-1197">Gonther</a> wechselt für 770.000 von Computer zu <a href="/users/514096">Christoph</a></span>
            </p>
        </div>
    </div>
</div>'''

soup = BeautifulSoup(html,'html.parser')

data = []

for news in soup.find_all("div", class_="news_body_right single_news_content"):
  if 'Transfers' in news.find("span", class_="news_headline"):
    for i in news.find("div", class_="news_text").find_all('span'):
      subject = i.find_all('a')[0].get_text()
      amount = i.get_text().split('für ', 1)[1].split(' von')[0].replace('.','').replace(',','.')
      from_player = i.get_text().split('von ', 1)[1].split(' zu')[0]
      to_player = i.find_all('a')[1].get_text()
      data.append({'subject': subject, 'amount': amount, 'from_player': from_player, 'to_player': to_player})

df = pd.DataFrame(data)
df.to_csv('output.csv')

Result:

subject amount from_player to_player
0 R. Wolf 167000 Computer Hauke
1 Weiner 167000 Computer Hauke
2 Gonther 770000 Computer Christoph

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM