簡體   English   中英

使用報紙從 HTML 中提取圖像

[英]Extract image using Newspaper from HTML

我不能像通常那樣下載文章來實例化文章對象,如下所示:

from newspaper import Article
url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article = Article(url)
article.download()
article.top_image

但是,我可以從請求中獲取 HTML。 我可以使用這個原始 HTML 並以某種方式將它傳遞給 Newspaper 以從中提取圖像嗎? (以下是嘗試,但不起作用)。 謝謝

from newspaper import Article
import requests
url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
raw_html= requests.get(url, verify=False, proxies=proxy)
article = Article('')
article.set_html(raw_html)
article.top_image

Python 模塊Newspaper允許使用代理,但此功能未在模塊文檔中列出。


報紙代理

from newspaper import Article
from newspaper.configuration import Configuration

# add your corporate proxy information and test the connection
PROXIES = {
           'http': "http://ip_address:port_number",
           'https': "https://ip_address:port_number"
          }

config = Configuration()
config.proxies = PROXIES

url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
articles = Article(url, config=config)
articles.download()
articles.parse()
print(articles.top_image)
https://ewscripps.brightspotcdn.com/dims4/default/d49dab0/2147483647/strip/true/crop/400x210+0+8/resize/1200x630!/quality/90/?url=http%3A%2F%2Fmediaassets.fox13now.com%2Ftribune-network%2Ftribkstu-files-wordpress%2F2012%2F04%2Fnational-news-e1486938949489.jpg

使用代理和報紙的請求

import requests
from newspaper import Article

url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
raw_html = requests.get(url, verify=False, proxies=proxy)
article = Article('')
article.download(raw_html.content)
article.parse()
print(article.top_image) https://ewscripps.brightspotcdn.com/dims4/default/d49dab0/2147483647/strip/true/crop/400x210+0+8/resize/1200x630!/quality/90/?url=http%3A%2F%2Fmediaassets.fox13now.com%2Ftribune-network%2Ftribkstu-files-wordpress%2F2012%2F04%2Fnational-news-e1486938949489.jpg

首先確保你使用的是python3 ,你之前運行過pip3 install newspaper3k

然后,如果您在第一個版本中遇到 SSL 錯誤(如下所示)

/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py:981: InsecureRequestWarning: 正在向主機“fox13now.com”發出未經驗證的 HTTPS 請求。 強烈建議添加證書驗證。 請參閱: https : //urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings warnings.warn(

您可以通過添加禁用它們

import urllib3
urllib3.disable_warnings()

這應該有效:

from newspaper import Article
import urllib3
urllib3.disable_warnings()


url = "https://www.fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/"
article = Article(url)
article.download()
print(article.html)

使用python3 <yourfile>.py


自己在文章中設置 html 對您沒有多大好處,因為您不會在其他字段中以這種方式獲得任何內容。 讓我知道這是否可以解決問題,或者是否出現任何其他錯誤!

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM