简体   繁体   中英

Web scraping new articles

I've been learning python and BeautifulSoup functions for the past few months trying to use if mainly for web scraping news articles for my own research purposes.

However, I have been having difficulties trying to get content printed out nicely as texts from the Chinese website.

Which tag should I use to get the content of the article?

<<div class="w980 wbnav clear"><a 
href="http://english.peopledaily.com.cn/" 
target="_blank">English</a>&gt;&gt;</div>
<div class="w980 wb_10 clear">
<h1>DPRK launches ballistic missile 'capable of hitting US 
mainland'</h1>
<div> (<a 



</div>
<div class="wb_12 clear">
<p style="text-align: center;">
<img alt="" src="/NMediaFile/2017/1129/FOREIGN201711291331000220555852915.jpg" style="width: 900px; height: 783px;" /></p>
<p>
The Democratic Peopleâs Republic of Korea (DPRK) has confirmed that it successfully tested a âHwasong 15â intercontinental ballistic missile (ICBM) on Wednesday.</p>
<p>
A Korean Central News Agency (KCNA) statement, which confirms earlier assessments from the United States and the Republic of Korea (ROK), claims the new type of ICBM "is capable of striking the whole mainland of the US."
</p>
<p>
It was Pyongyang's first test launch since a missile was fired in mid-September, days after its sixth-nuclear test.</p>
<p>
The ICBM was launched at 02:48 local time on Wednesday, according to the KCNA statement, and flew to an altitude of 4,475 km and then a distance of 950 km.</p>
<p>
It was launched from Sain Ni in the DPRK and flew for 53 minutes before splashing down into the Sea of Japan, said Pentagon spokesman Robert Manning.</p>

I opened up the website link ( http://en.people.cn/index.html ) and looked at the articles.

If you just want to scrape the data off a particular article such as this http://en.people.cn/n3/2017/1220/c90000-9306707.html

then you can use the following code-

import requests
from bs4 import BeautifulSoup
import sys

r=requests.get('http://en.people.cn/n3/2017/1220/c90000-9306707.html')

c=r.content
soup=BeautifulSoup(c,'html.parser')

all=soup.find("div",{"class":"d2p3_left wb_left fl"})

d={}
d["heading"]=all.find("h2").text




d["content"]=all.find_all("p")

p=''
for item in d["content"]:
    p=p+item.text


p.replace("\t","")
d["content"]=p
f=open('article1.txt','w')

for item in d.values():
    f.write(item)

f.close()

Now I checked other articles also and they all seem to be using d2p3_left wb_left fl class to categories their html div tags that contain the actual article content.

So I took the content from this particular tag and put them in a dictionary with keys 'heading' and 'content' so that they can be formatted latter if you want.

Then I exported all the values of dictionary to a text file.

If u want to scrape all the articles from the home page then you can just get the links in a list and then loop through the list items as an argument for the requests.get() method.

Hope this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM