简体   繁体   中英

web scraping from news articles

I have been trying to access the links from a given news website. I've found the code which works really well, but the only issue is that, it outputs "javascript:void();" along with all the other links. Please let me know what changes I can make such that I don't encounter "javascript:void();" in the output with all the other links. The following is the code:

from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get("https://www.ndtv.com/coronavirus?pfrom=home-mainnavgation")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)

for link in soup.find_all('a', href=True):
    print(link['href'])

If you don't want them, just filter them out.

Here's how:

import requests
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector

resp = requests.get("https://www.ndtv.com/coronavirus?pfrom=home-mainnavgation")

http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding

soup = BeautifulSoup(resp.content, 'html.parser', from_encoding=encoding)

for link in soup.find_all('a', href=True):
    if link["href"] != "javascript:void();":
        print(link['href'])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM