简体   繁体   中英

Scraping multiple pages with an unchanging URL using BeautifulSoup

I am using Beautiful Soup to extract data from a non-English website. Right now my code only extracts the first ten results from the keyword search. The website is designed so that additional results are accessed through the 'more' button (sort of like an infinity scroll, but you have to keep on clicking more to get the next set of results ). When I click 'more' the URL doesn't change, so I cannot just iterate over a different URL each time.

I would really like some help with two things.

  1. Modifying the code below so that I can get data from all of the pages and not just the first 10 results
  2. Insert a timer function so that the server doesn't block me

I'm adding a photo of what the 'more' button looks like because it's not in English. It's in blue text at the end of the page . 在此处输入图像描述

import requests, csv, os
from bs4 import BeautifulSoup
from time import strftime, sleep

# make a GET request (requests.get("URL")) and store the response in a response object (req)
responsePA = requests.get('https://www.prothomalo.com/search?q=%E0%A6%A7%E0%A6%B0%E0%A7%8D%E0%A6%B7%E0%A6%A3')

# read the content of the server’s response
rawPagePA = responsePA.text

soupPA = BeautifulSoup(rawPagePA)
# take a look
print (soupPA.prettify())

urlsPA = [] #creating empty list to store URLs
for item in soupPA.find_all("div", class_= "customStoryCard9-m__story-data__2qgWb"): #first part of loop selects all items with class=field-title
    aTag = item.find("a") #extracting elements containing 'a' tags
    urlsPA.append(aTag.attrs["href"]) 

print(urlsPA) 

#Below I'm getting the data from each of the urls and storing them in a list
PAlist=[]
for link in urlsPA:
    specificpagePA=requests.get(link) #making a get request and stores the response in an object
    rawAddPagePA=specificpagePA.text # read the content of the server’s response
    PASoup2=BeautifulSoup(rawAddPagePA) # parse the response into an HTML tree
    PAcontent=PASoup2.find_all(class_=["story-element story-element-text", "time-social-share-wrapper storyPageMetaData-m__time-social-share-wrapper__2-RAX", "headline headline-type-9 story-headline bn-story-headline headline-m__headline__3vaq9 headline-m__headline-type-9__3gT8S", "contributor-name contributor-m__contributor-name__1-593"]) 
    #print(PAcontent)
    PAlist.append(PAcontent)

You don't actually need Selenium.


The Buttons sends the following GET-request:

https://www.prothomalo.com/api/v1/advanced-search?fields=headline,subheadline,slug,url,hero-image-s3-key,hero-image-caption,hero-image-metadata,first-published-at,last-published-at,alternative,published-at,authors,author-name,author-id,sections,story-template,metadata,tags,cards&offset=10&limit=6&q=ধর্ষণ

The important part is the " offset=10&limit=6 " at the end, subsequent clicks on the button only increase that offset by 6.

Getting

data from all of the pages

won't work, because there seem to be quite a lot and I don't see an option to determine how many. So you better pick a number and request until you have that many links.

As this request returns JSON, you also might be better off to just parse that instead of feeding the HTML to BeautifulSoup.


Have a look at that:

import requests
import json

s = requests.Session()
term = 'ধর্ষণ'
count = 20

# Make GET-Request
r = s.get(
    'https://www.prothomalo.com/api/v1/advanced-search',
    params={
        'offset': 0,
        'limit': count,
        'q': term
    }
)

# Read response text (a JSON file)
info = json.loads(r.text)

# Loop over items
urls = [item['url'] for item in info['items']]

print(urls)

This returns the following list:

['https://www.prothomalo.com/world/asia/পাকিস্তানে-সন্তানদের-সামনে-মাকে-ধর্ষণের-মামলায়-দুজনের-মৃত্যুদণ্ড', 'https://www.prothomalo.com/bangladesh/district/খাবার-দেওয়ার-কথা-বদলে-ধর্ষণ-অবসরপ্রাপ্ত-শিক্ষকের-বিরুদ্ধে-মামলা', 'https://www.prothomalo.com/bangladesh/district/জয়পুরহাটে-অপহরণ-ও-ধর্ষণ-মামলায়-যুবকের-যাবজ্জীবন-কারাদণ্ড', 'https://www.prothomalo.com/bangladesh/district/কিশোরীকে-ধর্ষণ-মামলায়-যুবক-গ্রেপ্তার', 'https://www.prothomalo.com/bangladesh/সুবর্ণচরে-এত-ধর্ষণ-কেন', 'https://www.prothomalo.com/bangladesh/district/১২-বছরের-ছেলেকে-ধর্ষণ-মামলায়-একজন-গ্রেপ্তার', 'https://www.prothomalo.com/bangladesh/district/ভালো-পাত্রের-সঙ্গে-বিয়ে-দেওয়ার-কথা-বলে-কিশোরীকে-ধর্ষণ-গ্রেপ্তার-১', 'https://www.prothomalo.com/bangladesh/district/সখীপুরে-দুই-শিশুকে-ধর্ষণ-মামলার-আসামিকে-গ্রেপ্তারের-দাবিতে-মানববন্ধন', 'https://www.prothomalo.com/bangladesh/district/বগুড়ায়-ছাত্রী-ধর্ষণ-মামলায়-তুফান-সরকারের-জামিন-বাতিল', 'https://www.prothomalo.com/world/india/ধর্ষণ-নিয়ে-মন্তব্যের-জের-ভারতের-প্রধান-বিচারপতির-পদত্যাগ-দাবি', 'https://www.prothomalo.com/bangladesh/district/ফুলগাজীতে-ধর্ষণ-মামলায়-অভিযুক্ত-ইউপি-চেয়ারম্যান-বরখাস্ত', 'https://www.prothomalo.com/bangladesh/district/ধুনটে-ধর্ষণ-মামলায়-ছাত্রলীগ-নেতা-গ্রেপ্তার', 'https://www.prothomalo.com/bangladesh/district/নোয়াখালীতে-কিশোরীকে-ধর্ষণ-ভিডিও-ধারণ-ও-অপহরণের-অভিযোগে-গ্রেপ্তার-২', 'https://www.prothomalo.com/bangladesh/district/বাবার-সঙ্গে-দেখা-করানোর-কথা-বলে-স্কুলছাত্রীকে-ধর্ষণ', 'https://www.prothomalo.com/opinion/column/ধর্ষণ-ঠেকাতে-প্রযুক্তির-ব্যবহার', 'https://www.prothomalo.com/world/asia/পার্লামেন্টের-মধ্যে-ধর্ষণ-প্রধানমন্ত্রীর-ক্ষমা-প্রার্থনা', 'https://www.prothomalo.com/bangladesh/district/তাবিজ-দেওয়ার-কথা-বলে-গৃহবধূকে-ধর্ষণ-কবিরাজ-আটক', 'https://www.prothomalo.com/bangladesh/district/আদালত-প্রাঙ্গণে-বিয়ে-করে-জামিন-পেলেন-ধর্ষণ-মামলার-আসামি', 'https://www.prothomalo.com/bangladesh/district/কিশোরীকে-দল-বেঁধে-ধর্ষণ-ও-ভিডিও-ধারণ-গ্রেপ্তার-৩', 'https://www.prothomalo.com/bangladesh/district/ধর্ষণ-মামলায়-সহকারী-স্টেশনমাস্টার-গ্রেপ্তার']

By adjusting count you can set the number of urls (articles) to retrieve, term is the search-term.

The requests.Session -object is used to have consistent cookies.


If you have any questions, feel free to ask.


Edit:

  1. Just in case you are wondering how I found out which GET -request was being sent by clicking the button: I went to the Network Analysis -tab from the developer tools of my browser (Firefox), clicked the button, observed which requests were being sent and copied that URL:

    从网络分析选项卡发送的请求

  2. Another explanation for the params parameter from the .get -function: It contains (in python-dictionary-format) all the parameters that would normally be appended to the URL after the question mark. So

    requests.get('https://www.prothomalo.com/search?q=%E0%A6%A7%E0%A6%B0%E0%A7%8D%E0%A6%B7%E0%A6%A3')

    can be written as

    requests.get('https://www.prothomalo.com/search', params={'q': 'ধর্ষণ'})

    which makes it a lot nicer to look at and you can actually see what you are searching for, because it's written in unicode and not already encoded for the URL.


Edit :
If the script starts returning an empty JSON-file and thus no URLs, you probably have to set a User-Agent like so (I used the one for Firefox, but any browser should be fine):

s.headers.update({
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:87.0) '
                  'Gecko/20100101 Firefox/87.0'
})

Just put that code below the line where the session-object is initialized (the s =... line).
A User-Agent tells the site what kind of program is accessing their data.

Always keep in mind that the server has other stuff to do as well and that the webpage has other priorities than sending thousands of search-results to a single person, so try to keep the traffic as low as possible. Scraping 5000 URLs is a lot and if you really have to do it multiple times, put a sleep(...) of at least a few seconds anywhere before you make the next request (not just to prevent getting blocked, but rather to be nice to the people who provide you with the information you request).
Where you put the sleep does not really matter, as the only time you're actually making contact with the server is the s.get(...) line.

This is where you add selenium with bs4. To add the click for the site to load then get the page content.

you can download the geckodriver from link

Mock code will look like this,

from bs4 import BeautifulSoup
from selenium import webdriver

url = "https://www.prothomalo.com/search?q=%E0%A6%A7%E0%A6%B0%E0%A7%8D%E0%A6%B7%E0%A6%A3"

driver = webdriver.Firefox(executable_path=r'geckodriver.exe')
driver.get(url)

# You need to iterate over this with a loop on how many times you want to click more, 
#do remember if it takes time to fetch the data try adding time.sleep() to wait for the page to load
driver.find_element_by_css_selector('{class-name}').click()

# Then you just get the page content 
soup = BeautifulSoup(driver.page_source, 'html')

# now you have the content loaded with beautifulsoap and can manipulate it as you were doing previously
{YOUR CODE}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM