繁体   English   中英

网页抓取:使用字典中的 for 循环简化代码

[英]Webscraping: simplify the code using a for loop inside a dictionary

我已经构建了一个用于抓取一些巴西新闻页面的代码,但我现在意识到它可能不必要地长。 我试图用 for 循环简化事情,但我找不到相同的结果。 不需要向我展示一个全新的代码,因为我还在学习。 但如果有人能给我一个提示,我会很感激。

原始代码:

#web scraping: Yahoo Finanças 
url_yf = "https://br.financas.yahoo.com/"
page_yf = requests.get(url_yf)
page_content_yf = page_yf.content
soup_yf = bs(page_content_yf, "lxml")
news_yf = soup_yf.find_all("h3", {"class" : "Mb(5px)"})
titles_yf = [i.text for i in news_yf]

#web scraping: Estadão 
url_est = "https://www.estadao.com.br/ultimas"
page_est = requests.get(url_est)
page_content_est = page_est.content
soup_est = bs(page_content_est, "lxml")
news_est = soup_est.find_all("section", {"class" : "col-md-6 col-sm-6 col-xs-12 col-margin"})
titles_est = [i.text for i in news_est]

#web scraping: Folha de São Paulo
url_fsp = "https://www1.folha.uol.com.br/ultimas-noticias/"
page_fsp = requests.get(url_fsp)
page_content_fsp = page_fsp.content
soup_fsp = bs(page_content_fsp, "html.parser")
news_fsp = soup_fsp.find_all("div", {"class" : "c-headline__content"})
titles_fsp = str([i.text for i in news_fsp]).strip('[]')
titles_fsp_clean = re.sub(r'\\n', ' ', titles_fsp)

#web scraping: Valor Econômico
url_ve = "https://valor.globo.com/ultimas-noticias/"
page_ve = requests.get(url_ve)
page_content_ve = page_ve.content
soup_ve = bs(page_content_ve, "lxml")
news_ve = soup_ve.find_all("div", {"class" : "feed-post-body-title gui-color-primary gui-color-hover"})
titles_ve = [i.text for i in news_ve]

#web scraping: InfoMoney
url_im = "https://www.infomoney.com.br/ultimas-noticias/"
page_im = requests.get(url_im)
page_content_im = page_im.content
soup_im = bs(page_content_im, "lxml")
news_im = soup_im.find_all("span", {"class" : "hl-title hl-title-2"})
titles_im = [i.text for i in news_im]

#cleaning titles
list_of_titles = [titles_yf, titles_est, titles_fsp_clean, titles_ve, titles_im]
titles = [title for title in list_of_titles for title in title]

我现在正在做的一个:

classes = ["h3", {"class" : "Mb(5px)"}, "section", {"class" : "col-md-6 col-sm-6 col-xs-12 col-margin"}, "div", {"class" : "c-headline__content"}, "div", {"class" : "feed-post-body-title gui-color-primary gui-color-hover"}, "span", {"class" : "hl-title hl-title-2"}]

url = {"https://br.financas.yahoo.com/": classes[0], "https://www.estadao.com.br/ultimas": classes[1], "https://www1.folha.uol.com.br/ultimas-noticias/": classes[2], "https://valor.globo.com/ultimas-noticias/": classes[3], "https://www.infomoney.com.br/ultimas-noticias/": classes[4]}

for (key, value) in url.items():
    page = requests.get(key)
    page_content = page.content
    soup = bs(page_content, "html.parser")
    news = soup.find_all(value)
    titles = [i.text for i in news]

cleaning = str(titles).strip('[]')
cleaning2 = re.sub(r'\\n', ' ', cleaning)

我的想法是尝试在字典中循环(我必须构建一个类列表,因为我在字典中的引号有问题)以使用相对于 url 的确切值。 我还是 Python 的菜鸟,但这对我来说似乎是可能的。 不幸的是,我找不到相同的结果。

您可以在列表classes或直接在url使用子列表["h3", {"class" : "Mb(5px)"}]

url = {
   "https://br.financas.yahoo.com/": ["h3", {"class" : "Mb(5px)"}], 
   "https://www.estadao.com.br/ultimas": ["section", {"class" : "col-md-6 col-sm-6 col-xs-12 col-margin"}], 
   "https://www1.folha.uol.com.br/ultimas-noticias/": ["div", {"class" : "c-headline__content"}], 
   "https://valor.globo.com/ultimas-noticias/": ["div", {"class" : "feed-post-body-title gui-color-primary gui-color-hover"}], 
   "https://www.infomoney.com.br/ultimas-noticias/": ["span", {"class" : "hl-title hl-title-2"}],
}

然后你可以使用value[0], value[1]

 news = soup.find_all(value[0], value[1])

甚至使用*来解包value作为find_all()参数

 news = soup.find_all(*value)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM