i am trying to scrape news articles from a Website. I am only interested articles that contain a <span class="news_headline">
with the text "Transfers". From this article I want to extract the text inside the spans from <div class="news_text">
. The result should end up in a csv file and look something like this:
R.Wolf; wechselt für 167.000 von Computer zu; Hauke
Weiner; wechselt für 167.000 von Computer zu; Hauke
Gonther; wechselt für 770.000 von Computer zu; Hauke
or
3378; wechselt für 167.000 von Computer zu; 514102
3605; wechselt für 167.000 von Computer zu; 514102
1197; wechselt für 770.000 von Computer zu; 514102
I am very new to programing so I hope anyone can help.
<div class="single_news_container">
<div class="news_body_right single_news_content">
<div class="cont_news_headlines">
<span class="wrapper_news_headlines">
<span class="news_headline">Transfers</span>
</span>
</div>
<div class="news_text">
<div>
<p>
<span><a href="/2.bundesliga/players/R. Wolf-3378">R. Wolf</a> wechselt für 167.000 von Computer zu <a href="/users/514102">Hauke</a></span>
</p>
<p>
<span><a href="/2.bundesliga/players/Weiner-3605">Weiner</a> wechselt für 167.000 von Computer zu <a href="/users/514102">Hauke</a></span>
</p>
<p>
<span><a href="/2.bundesliga/players/Gonther-1197">Gonther</a> wechselt für 770.000 von Computer zu <a href="/users/514096">Christoph</a></span>
</p>
</div>
</div>
</div>
</div>
First, check the nested structure of the html code. You will see that the data you want to scrape is not wrapped in the div
you mention, rather they're both wrapped in <div class="news_body_right single_news_content">
. So you should run a find_all
on that div
and then loop the results to check whether within these div
's the news headline contains 'Transfers'. Only then you can extract the data by, for example, populating an empty list, then loading it in pandas
and saving it to csv
:
As find_all
returns a list
from bs4 import BeautifulSoup
import pandas as pd
html='''<div class="single_news_container">
<div class="news_body_right single_news_content">
<div class="cont_news_headlines">
<span class="wrapper_news_headlines">
<span class="news_headline">Transfers</span>
</span>
</div>
<div class="news_text">
<div>
<p>
<span><a href="/2.bundesliga/players/R. Wolf-3378">R. Wolf</a> wechselt für 167.000 von Computer zu <a href="/users/514102">Hauke</a></span>
</p>
<p>
<span><a href="/2.bundesliga/players/Weiner-3605">Weiner</a> wechselt für 167.000 von Computer zu <a href="/users/514102">Hauke</a></span>
</p>
<p>
<span><a href="/2.bundesliga/players/Gonther-1197">Gonther</a> wechselt für 770.000 von Computer zu <a href="/users/514096">Christoph</a></span>
</p>
</div>
</div>
</div>'''
soup = BeautifulSoup(html,'html.parser')
data = []
for news in soup.find_all("div", class_="news_body_right single_news_content"):
if 'Transfers' in news.find("span", class_="news_headline"):
for i in news.find("div", class_="news_text").find_all('span'):
subject = i.find_all('a')[0].get_text()
amount = i.get_text().split('für ', 1)[1].split(' von')[0].replace('.','').replace(',','.')
from_player = i.get_text().split('von ', 1)[1].split(' zu')[0]
to_player = i.find_all('a')[1].get_text()
data.append({'subject': subject, 'amount': amount, 'from_player': from_player, 'to_player': to_player})
df = pd.DataFrame(data)
df.to_csv('output.csv')
Result:
subject | amount | from_player | to_player | |
---|---|---|---|---|
0 | R. Wolf | 167000 | Computer | Hauke |
1 | Weiner | 167000 | Computer | Hauke |
2 | Gonther | 770000 | Computer | Christoph |
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.