[英]newspaper(python) get all cnn news url
例如在這個網址( https://edition.cnn.com/search/?q=%20news&size=10&from=5540&page=555 )
在 html 文件中,我可以找到這個鏈接(html 標簽)
<div class="cnn-search__result-thumbnail">
<a href="https://www.cnn.com/2018/03/27/asia/north-korea-kim-jong-un-china-visit/index.html">
<img src="./Search CNN - Videos, Pictures, and News -
CNN.com_files/180328104116china-xi-kim-story-body.jpg">
</a>
但在這段代碼中
cnn_paper = newspaper.build(url, memoize_articles=False)
for article in cnn_paper.articles:
print(article.url)
我找不到新聞鏈接
https://edition.cnn.com/search/?q=%20news&size=10&from=5540&page=555 https://edition.cnn.com/search/?q=%20news&size=10&from=5550&page=556
獲取相同的鏈接
這是你想要的嗎?
from bs4 import BeautifulSoup
import urllib.request
for numb in ('1', '100'):
resp = urllib.request.urlopen("https://edition.cnn.com/search/?q=%20news&size=10&from=5540&page=555")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
for link in soup.find_all('a', href=True):
print(link['href'])
或者,也許這個?
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests
resp = requests.get("https://edition.cnn.com/search/?q=%20news&size=10&from=5540&page=555")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, from_encoding=encoding)
for link in soup.find_all('a', href=True):
print(link)
搜索結果從來自不同請求的 JSON 文件動態顯示:https ://search.api.cnn.io/content?q=news&size=50& from=0
大小最多可以是 50。
res = requests.get("https://search.api.cnn.io/content?q=news&size=50&from=0")
links = [x['url'] for x in res.json()['result']]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.