[英]Webscraping: simplify the code using a for loop inside a dictionary
我已经构建了一个用于抓取一些巴西新闻页面的代码,但我现在意识到它可能不必要地长。 我试图用 for 循环简化事情,但我找不到相同的结果。 不需要向我展示一个全新的代码,因为我还在学习。 但如果有人能给我一个提示,我会很感激。
原始代码:
#web scraping: Yahoo Finanças
url_yf = "https://br.financas.yahoo.com/"
page_yf = requests.get(url_yf)
page_content_yf = page_yf.content
soup_yf = bs(page_content_yf, "lxml")
news_yf = soup_yf.find_all("h3", {"class" : "Mb(5px)"})
titles_yf = [i.text for i in news_yf]
#web scraping: Estadão
url_est = "https://www.estadao.com.br/ultimas"
page_est = requests.get(url_est)
page_content_est = page_est.content
soup_est = bs(page_content_est, "lxml")
news_est = soup_est.find_all("section", {"class" : "col-md-6 col-sm-6 col-xs-12 col-margin"})
titles_est = [i.text for i in news_est]
#web scraping: Folha de São Paulo
url_fsp = "https://www1.folha.uol.com.br/ultimas-noticias/"
page_fsp = requests.get(url_fsp)
page_content_fsp = page_fsp.content
soup_fsp = bs(page_content_fsp, "html.parser")
news_fsp = soup_fsp.find_all("div", {"class" : "c-headline__content"})
titles_fsp = str([i.text for i in news_fsp]).strip('[]')
titles_fsp_clean = re.sub(r'\\n', ' ', titles_fsp)
#web scraping: Valor Econômico
url_ve = "https://valor.globo.com/ultimas-noticias/"
page_ve = requests.get(url_ve)
page_content_ve = page_ve.content
soup_ve = bs(page_content_ve, "lxml")
news_ve = soup_ve.find_all("div", {"class" : "feed-post-body-title gui-color-primary gui-color-hover"})
titles_ve = [i.text for i in news_ve]
#web scraping: InfoMoney
url_im = "https://www.infomoney.com.br/ultimas-noticias/"
page_im = requests.get(url_im)
page_content_im = page_im.content
soup_im = bs(page_content_im, "lxml")
news_im = soup_im.find_all("span", {"class" : "hl-title hl-title-2"})
titles_im = [i.text for i in news_im]
#cleaning titles
list_of_titles = [titles_yf, titles_est, titles_fsp_clean, titles_ve, titles_im]
titles = [title for title in list_of_titles for title in title]
我现在正在做的一个:
classes = ["h3", {"class" : "Mb(5px)"}, "section", {"class" : "col-md-6 col-sm-6 col-xs-12 col-margin"}, "div", {"class" : "c-headline__content"}, "div", {"class" : "feed-post-body-title gui-color-primary gui-color-hover"}, "span", {"class" : "hl-title hl-title-2"}]
url = {"https://br.financas.yahoo.com/": classes[0], "https://www.estadao.com.br/ultimas": classes[1], "https://www1.folha.uol.com.br/ultimas-noticias/": classes[2], "https://valor.globo.com/ultimas-noticias/": classes[3], "https://www.infomoney.com.br/ultimas-noticias/": classes[4]}
for (key, value) in url.items():
page = requests.get(key)
page_content = page.content
soup = bs(page_content, "html.parser")
news = soup.find_all(value)
titles = [i.text for i in news]
cleaning = str(titles).strip('[]')
cleaning2 = re.sub(r'\\n', ' ', cleaning)
我的想法是尝试在字典中循环(我必须构建一个类列表,因为我在字典中的引号有问题)以使用相对于 url 的确切值。 我还是 Python 的菜鸟,但这对我来说似乎是可能的。 不幸的是,我找不到相同的结果。
您可以在列表classes
或直接在url
使用子列表["h3", {"class" : "Mb(5px)"}]
url = {
"https://br.financas.yahoo.com/": ["h3", {"class" : "Mb(5px)"}],
"https://www.estadao.com.br/ultimas": ["section", {"class" : "col-md-6 col-sm-6 col-xs-12 col-margin"}],
"https://www1.folha.uol.com.br/ultimas-noticias/": ["div", {"class" : "c-headline__content"}],
"https://valor.globo.com/ultimas-noticias/": ["div", {"class" : "feed-post-body-title gui-color-primary gui-color-hover"}],
"https://www.infomoney.com.br/ultimas-noticias/": ["span", {"class" : "hl-title hl-title-2"}],
}
然后你可以使用value[0], value[1]
news = soup.find_all(value[0], value[1])
甚至使用*
来解包value
作为find_all()
参数
news = soup.find_all(*value)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.