简体繁体中英

newspaper3k - get articles from HTML instead of URL

原文 2021-07-13 10:34:21 4 1 python/ parsing/ web-scraping/ scrapy/ newspaper3k

I'm using newspaper3k inside Scrapy parse method. I want to extract links but I don't want to fetch the website again.

Is it possible to use this:

newspaper.build(..)

with plain html so I can call .articles than?

1 answers

I found this solution:

import httpx

from newspaper import Article

async def get_article(url):
    with httpx.AsyncClient() as client:
        response = await client.get(url)

    article = Article(url)
    article.set_html(response.text)
    article.parse()

How to access cached articles in newspaper3k

Get more article URLs from a news source with newspaper3k?

Newspaper3k returns 0 articles from archive.org waybackmachine pages whereas the live page works as expected

Web scraping with Newspaper3k, got only 50 articles

How to use Newspaper3k library without downloading articles?

Newspaper3k filter out bad URL while extracting

Shortcomings of Newspaper3k: How to Scrape ONLY Article HTML? Python

How to stop python newspaper3k from returning null values?

Why the python module newspaper3k only return 0 articles for tencent, sina and wallstreetcn?

Newspaper3k: Any way to download multiple web articles to one variable?

暂无

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to access cached articles in newspaper3k Get more article URLs from a news source with newspaper3k? Newspaper3k returns 0 articles from archive.org waybackmachine pages whereas the live page works as expected Web scraping with Newspaper3k, got only 50 articles How to use Newspaper3k library without downloading articles? Newspaper3k filter out bad URL while extracting Shortcomings of Newspaper3k: How to Scrape ONLY Article HTML? Python How to stop python newspaper3k from returning null values? Why the python module newspaper3k only return 0 articles for tencent, sina and wallstreetcn? Newspaper3k: Any way to download multiple web articles to one variable?

Related Tags

粤ICP备18138465号 © 2020-2024 STACKOOM.COM