简体   繁体   中英

How to scrape a javascript website in Python?

I am trying to scrape a website. I have tried using two methods but both do not provide me with the full website source code that I am looking for. I am trying to scrape the news titles from the website URL provided below.

URL: "https://www.todayonline.com/"

These are the two methods I have tried but failed.

Method 1: Beautiful Soup

tdy_url = "https://www.todayonline.com/"
page = requests.get(tdy_url).text
soup = BeautifulSoup(page)
soup  # Returns me a HTML with javascript text
soup.find_all('h3')

### Returns me empty list []

Method 2: Selenium + BeautifulSoup

tdy_url = "https://www.todayonline.com/"

options = Options()
options.headless = True

driver = webdriver.Chrome("chromedriver",options=options)

driver.get(tdy_url)
time.sleep(10)
html = driver.page_source

soup = BeautifulSoup(html)
soup.find_all('h3')

### Returns me only less than 1/4 of the 'h3' tags found in the original page source 

Please help. I have tried scraping other news websites and it is so much easier. Thank you.

You can access data via API (check out the Network tab): 在此处输入图片说明


For example,

import requests
url = "https://www.todayonline.com/api/v3/news_feed/7"
data = requests.get(url).json()

The news data on the website you are trying to scrape is fetched from the server using JavaScript (this is called XHR -- XMLHttpRequest ). It is happening dynamically, while the page is loading or being scrolled. so this data is not returned inside the page returned by the server.

In the first example, you are getting only the page returned by the server -- without the news, but with JS that is supposed to get them. Neither requests nor BeautifulSoup can execute JS.

However, you can try to reproduce requests that are getting news titles from the server with Python requests. Do the following steps:

  1. Open DevTools of your browser (usually you have to press F12 or the combination of Ctrl + Shift + I for that), and take a look at requests that are getting news titles from the server. Sometimes, it is even easier than web scraping with BeautifulSoup. Here is a screenshot (Firefox): 屏幕截图(火狐)
  1. Copy the request link (right-click -> Copy -> Copy link), and pass it to requests.get(...) .

  2. Get .json() of the request. It will return a dict that is easy to work with. To better understand the structure of the dict, I would recommend to use pprint instead of simple print. Note you have to do from pprint import pprint before using it.

Here is an example of the code that gets the titles from the main news on the page:

import requests


nodes = requests.get("https://www.todayonline.com/api/v3/news_feed/7")\
        .json()["nodes"]
for node in nodes:
    print(node["node"]["title"])

If you want to scrape a group of news under caption, you need to change the number after news_feed/ in the request URL (to get it, you just need to filter the requests by "news_feed" in the DevTools and scroll the news page down).

Sometimes web sites have protection against bots (although the website you are trying to scrape doesn't). In such cases, you might need to do these steps as well.

I will suggest you the fairly simple approach,

import requests
from bs4 import BeautifulSoup as bs

page = requests.get('https://www.todayonline.com/googlenews.xml').content
soup = bs(page)
news = [i.text for i in soup.find_all('news:title')]

print(news)

output

['DBS named world’s best bank by New York-based financial publication',
 'Russia has very serious questions to answer on Navalny - UK',
 "Exclusive: 90% of China's Sinovac employees, families took coronavirus vaccine - CEO",
 'Three militants killed after fatal attack on policeman in Tunisia',
.....]

Also, you can check the XML page for more information if required.

PS Always check for the compliance before scraping any website :)

There are different ways of gathering the content of a webpage that contains Javascript.

  1. Using selenium with Firefox web driver
  2. Using a headless browser with phantomJS
  3. Making an API call using a REST client or python requests library

You have to do your research first

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM