简体   繁体   中英

Web scraping news articles and keyword search

I have a code which fetches me titles of news articles in webpages. I have used a for loop in which I get the titles of 4 news websites. I have also implemented a word search which tells the number of articles in which the word " coronavirus" is used. I want the word search such that it tells me the number of articles with the word "coronavirus" in each website. Right now I'm getting the output of the number of times the word "coronavirus" is used in all the websites put together. Please help me, I have to submit this project shortly. Following is the code:

from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
from newspaper import Article
import requests
URL=["https://www.timesnownews.com/coronavirus","https://www.indiatoday.in/coronavirus", "https://www.ndtv.com/coronavirus?pfrom=home-mainnavigation"]
for url in URL:
    parser = 'html.parser'  
    resp = requests.get(url)
    http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
    html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
    encoding = html_encoding or http_encoding
    soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)
    
    links = []
    for link in soup.find_all('a', href=True):
        if "javascript" in link["href"]:
            continue
        links.append(link['href'])
            
    count = 0
     
            
    for link in links:
        try:
            article = Article(link)
            article.download()
            article.parse()
            print(article.title)
            if "COVID" in article.title or "coronavirus" in article.title or "Coronavirus"in article.title or "Covid-19" in article.title or "COVID-19" in article.title :
                    count += 1
    
        except:
            pass
         
        
print(" number of articles with the word COVID:")
print(count)

Actually you are getting only the last site count. If you want to get then all, append it to a list, then you can print the count for each site.

First create an empty list and append the final count each iteration:

URL = ["https://www.timesnownews.com/coronavirus", "https://www.indiatoday.in/coronavirus",
       "https://www.ndtv.com/coronavirus?pfrom=home-mainnavigation"]
Url_count = []

for url in URL:
    parser = 'html.parser'
    ...
    ...
        except:
            pass

    Url_count.append(count)

Then you can use zip to print the results:

for url, count in zip(URL, Url_count):
    print("Site:", url, "Count:", count)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM