繁体   English   中英

如何从网站中提取链接并在 web 中提取其内容 使用 python 抓取

[英]how to extract links from a website and extract its content in web scraping using python

我正在尝试使用beautifulSoup从网站中提取数据,并请求要提取链接及其内容的包。

到目前为止,我可以提取定义的url上存在的链接列表,但我不知道如何输入每个链接并提取文本。

下图描述了我的问题: 在此处输入图像描述

文字和图片是大厅文章的链接。

代码:

import requests
from bs4 import BeautifulSoup

url = "https://www.annahar.com/english/section/186-mena"
html_text = requests.get(url)
soup = BeautifulSoup(html_text.content, features = "lxml")

print(soup.prettify())

#scrappring html tags such as Title,  Links, Publication date
for index,new in enumerate(news):
    published_date = new.find('span',class_="article__time-stamp").text
    title = new.find('h3',class_="article__title").text
    link = new.find('a',class_="article__link").attrs['href']
    print(f" publish_date: {published_date}")
    print(f" title: {title}")
    print(f" link: {link}")

结果:

publish_date: 
                                                        06-10-2020 | 20:53
                                                    
 title: 

                                                        18 killed in bombing in Turkish-controlled Syrian town
                                                    

 link: https://www.annahar.com/english/section/186-mena/06102020061027020

我的问题是如何从这里继续输入每个链接并提取其内容?

预期结果:

publish_date: 
                                                        06-10-2020 | 20:53
                                                    
 title: 

                                                        18 killed in bombing in Turkish-controlled Syrian town
                                                    

 link: https://www.annahar.com/english/section/186-mena/06102020061027020

description:

ANKARA: An explosives-laden truck ignited Tuesday on a busy street in a northern #Syrian town controlled by #Turkey-backed opposition fighters, killing at least 18 people and wounding dozens, Syrian opposition activists reported.

The blast in the town of al-Bab took place near a bus station where people often gather to travel from one region to another, according to the opposition’s Civil Defense, also known as White Helmets.

描述存在于链接中的位置

您必须获取文章的所有关注链接,然后循环访问并获取您感兴趣的部分。

就是这样:

import time

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(
    requests.get("https://www.annahar.com/english/section/186-mena").content,
    "lxml"
)

follow_links = [
    a["href"] for a in
    soup.find_all("a", class_="article__link")
    if "#" not in a["href"]
]

for link in follow_links:
    s = BeautifulSoup(requests.get(link).content, "lxml")
    date_published = s.find("span", class_="date").getText(strip=True)
    title = s.find("h1", class_="article-main__title").getText(strip=True)
    article_body = s.find("div", {"id": "bodyToAddTags"}).getText()

    print(f"{date_published} {title}\n\n{article_body}\n", "-" * 80)
    time.sleep(2)

Output(为简洁起见缩短):

08-10-2020 | 12:35 Iran frees rights activist after more than 8 years in prison

TEHRAN: Iran has released a prominent human rights activist who campaigned against the death penalty, Iranian media reported Thursday.The semiofficial ISNA news agency quoted judiciary official Sadegh Niaraki as saying that Narges Mohammadi was freed late Wednesday after serving 8 1/2 years in prison. She was sentenced to 10 years in 2016 while already incarcerated.Niaraki said Mohammadi was released based on a law that allows a prison sentence to be commutated if the related court agrees.In July, rights group Amnesty International demanded Mohammadi’s immediate release because of serious pre-existing health conditions and showing suspected COVID-19 symptoms. The Thursday report did not refer to her possible illness.Mohammadi was sentenced in Tehran’s Revolutionary Court on charges including planning crimes to harm the security of Iran, spreading propaganda against the government and forming and managing an illegal group.She was in a prison in the northwestern city of Zanjan, some 280 kilometers (174 miles) northwest of the capital Tehran.Mohammadi was close to Iranian Nobel Peace Prize laureate Shirin Ebadi, who founded the banned Defenders of Human Rights Center. Ebadi left Iran after the disputed re-election of then-President Mahmoud Ahmadinejad in 2009, which touched off unprecedented protests and harsh crackdowns by authorities.In 2018, Mohammadi, an engineer and physicist, was awarded the 2018 Andrei Sakharov Prize, which recognizes outstanding leadership or achievements of scientists in upholding human rights.
 --------------------------------------------------------------------------------
...

向您的循环添加一个额外的请求,该请求会到达文章页面并在那里获取描述

page = requests.get(link)
soup = BeautifulSoup(page.content, features = "lxml")

description = soup.select_one('div.articleMainText').get_text()
print(f" description: {description}")

例子

import requests
from bs4 import BeautifulSoup

url = "https://www.annahar.com/english/section/186-mena"
html_text = requests.get(url)
soup = BeautifulSoup(html_text.content, features = "lxml")

# print(soup.prettify())

#scrappring html tags such as Title,  Links, Publication date
for index,new in enumerate(soup.select('div#listingDiv44083 div.article')):
    published_date = new.find('span',class_="article__time-stamp").get_text(strip=True)
    title = new.find('h3',class_="article__title").get_text(strip=True)
    link = new.find('a',class_="article__link").attrs['href']
       
    page = requests.get(link)
    soup = BeautifulSoup(page.content, features = "lxml")
    
    description = soup.select_one('div.articleMainText').get_text()
    print(f" publish_date: {published_date}")
    print(f" title: {title}")
    print(f" link: {link}")
    print(f" description: {description}", '\n')

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM