简体   繁体   中英

Web Scraping with Python, Beautiful Soup, and Selenium not working

I am doing a Python exercise, and it requires me to get the top news from the Google news website by web scraping and print to the console. As I was doing it, I just used the Beautiful Soup library to retrieve the news. That was my code:

import bs4
from bs4 import BeautifulSoup
import urllib.request

news_url = "https://news.google.com/news/rss";
URLObject = urllib.request.urlopen(news_url);
xml_page = URLObject.read();
URLObject.close();

soup_page = BeautifulSoup(xml_page,"html.parser");
news_list = soup_page.findAll("item");

for news in news_list:
  print(news.title.text);
  print(news.link.text);
  print(news.pubDate.text);
  print("-"*60);

But it kept giving me errors by not printing the 'link' and 'pubDate'. After some research, I saw some answers here on Stack Overflow, and they said that, as the website uses Javascript, one should use the Selenium package in addition to Beautiful Soup. Despite not understanding how Selenium really works, I updated the code as following:

from bs4 import BeautifulSoup
from selenium import webdriver
import urllib.request

driver = webdriver.Chrome("C:/Users/mauricio/Downloads/chromedriver");
driver.maximize_window();
driver.get("https://news.google.com/news/rss");
content = driver.page_source.encode("utf-8").strip();
soup = BeautifulSoup(content, "html.parser");
news_list = soup.findAll("item");

print(news_list);

for news in news_list:
  print(news.title.text);
  print(news.link.text);
  print(news.pubDate.text);
  print("-"*60);

However, when I run it, a blank browser page opens and this is printed to the console:

 raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: crashed
  (Driver info: chromedriver=2.38.551601 (edb21f07fc70e9027c746edd3201443e011a61ed),platform=Windows NT 6.3.9600 x86_64)

I just tried and the following code is working for me. The items = line is horrible, apologies in advance. But for now it works...

EDIT Just updated the snippet, you can use the ElementTree.iter('tag') to iterate over all the nodes with that tag :

import urllib.request
import xml.etree.ElementTree

news_url = "https://news.google.com/news/rss"
with urllib.request.urlopen(news_url) as page:
    xml_page = page.read()

# Parse XML page
e = xml.etree.ElementTree.fromstring(xml_page)

# Get the item list
for it in e.iter('item'):
    print(it.find('title').text)
    print(it.find('link').text)
    print(it.find('pubDate').text, '\n')

EDIT2: Discussion personal preferences of libraries for scraping
Personally, for interactive/dynamic pages in which I have to do stuff (click here, fill a form, obtain results, ...): I use selenium , and usually I don't have a need to use bs4 , since you can use selenium directly to find and parse the specific nodes of the web you are looking for.

I use bs4 in conjunction with requests (instead of urllib.request ) for to parse more static webpages in projects I don't want to have a whole webdriver installed.

There is nothing wrong with using urllib.request , but requests (see here for the docs ) is one of the best python packages out there (in my opinion) and is a great example of how to create a simple yet powerful API.

Simply use BeautifulSoup with requests .

from bs4 import BeautifulSoup
import requests

r = requests.get('https://news.google.com/news/rss')
soup = BeautifulSoup(r.text, 'xml')
news_list = soup.find_all('item')

# do whatever you need with news_list

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM