I am using Beautiful Soup
to extract data from a non-English website. Right now my code only extracts the first ten results from the keyword search. The website is designed so that additional results are accessed through the 'more' button (sort of like an infinity scroll, but you have to keep on clicking more to get the next set of results ). When I click 'more' the URL doesn't change, so I cannot just iterate over a different URL each time.
I would really like some help with two things.
I'm adding a photo of what the 'more' button looks like because it's not in English. It's in blue text at the end of the page .
import requests, csv, os
from bs4 import BeautifulSoup
from time import strftime, sleep
# make a GET request (requests.get("URL")) and store the response in a response object (req)
responsePA = requests.get('https://www.prothomalo.com/search?q=%E0%A6%A7%E0%A6%B0%E0%A7%8D%E0%A6%B7%E0%A6%A3')
# read the content of the server’s response
rawPagePA = responsePA.text
soupPA = BeautifulSoup(rawPagePA)
# take a look
print (soupPA.prettify())
urlsPA = [] #creating empty list to store URLs
for item in soupPA.find_all("div", class_= "customStoryCard9-m__story-data__2qgWb"): #first part of loop selects all items with class=field-title
aTag = item.find("a") #extracting elements containing 'a' tags
urlsPA.append(aTag.attrs["href"])
print(urlsPA)
#Below I'm getting the data from each of the urls and storing them in a list
PAlist=[]
for link in urlsPA:
specificpagePA=requests.get(link) #making a get request and stores the response in an object
rawAddPagePA=specificpagePA.text # read the content of the server’s response
PASoup2=BeautifulSoup(rawAddPagePA) # parse the response into an HTML tree
PAcontent=PASoup2.find_all(class_=["story-element story-element-text", "time-social-share-wrapper storyPageMetaData-m__time-social-share-wrapper__2-RAX", "headline headline-type-9 story-headline bn-story-headline headline-m__headline__3vaq9 headline-m__headline-type-9__3gT8S", "contributor-name contributor-m__contributor-name__1-593"])
#print(PAcontent)
PAlist.append(PAcontent)
You don't actually need Selenium.
The Buttons sends the following GET-request:
https://www.prothomalo.com/api/v1/advanced-search?fields=headline,subheadline,slug,url,hero-image-s3-key,hero-image-caption,hero-image-metadata,first-published-at,last-published-at,alternative,published-at,authors,author-name,author-id,sections,story-template,metadata,tags,cards&offset=10&limit=6&q=ধর্ষণ
The important part is the " offset=10&limit=6 " at the end, subsequent clicks on the button only increase that offset by 6.
Getting
data from all of the pages
won't work, because there seem to be quite a lot and I don't see an option to determine how many. So you better pick a number and request until you have that many links.
As this request returns JSON, you also might be better off to just parse that instead of feeding the HTML to BeautifulSoup.
Have a look at that:
import requests
import json
s = requests.Session()
term = 'ধর্ষণ'
count = 20
# Make GET-Request
r = s.get(
'https://www.prothomalo.com/api/v1/advanced-search',
params={
'offset': 0,
'limit': count,
'q': term
}
)
# Read response text (a JSON file)
info = json.loads(r.text)
# Loop over items
urls = [item['url'] for item in info['items']]
print(urls)
This returns the following list:
['https://www.prothomalo.com/world/asia/পাকিস্তানে-সন্তানদের-সামনে-মাকে-ধর্ষণের-মামলায়-দুজনের-মৃত্যুদণ্ড', 'https://www.prothomalo.com/bangladesh/district/খাবার-দেওয়ার-কথা-বদলে-ধর্ষণ-অবসরপ্রাপ্ত-শিক্ষকের-বিরুদ্ধে-মামলা', 'https://www.prothomalo.com/bangladesh/district/জয়পুরহাটে-অপহরণ-ও-ধর্ষণ-মামলায়-যুবকের-যাবজ্জীবন-কারাদণ্ড', 'https://www.prothomalo.com/bangladesh/district/কিশোরীকে-ধর্ষণ-মামলায়-যুবক-গ্রেপ্তার', 'https://www.prothomalo.com/bangladesh/সুবর্ণচরে-এত-ধর্ষণ-কেন', 'https://www.prothomalo.com/bangladesh/district/১২-বছরের-ছেলেকে-ধর্ষণ-মামলায়-একজন-গ্রেপ্তার', 'https://www.prothomalo.com/bangladesh/district/ভালো-পাত্রের-সঙ্গে-বিয়ে-দেওয়ার-কথা-বলে-কিশোরীকে-ধর্ষণ-গ্রেপ্তার-১', 'https://www.prothomalo.com/bangladesh/district/সখীপুরে-দুই-শিশুকে-ধর্ষণ-মামলার-আসামিকে-গ্রেপ্তারের-দাবিতে-মানববন্ধন', 'https://www.prothomalo.com/bangladesh/district/বগুড়ায়-ছাত্রী-ধর্ষণ-মামলায়-তুফান-সরকারের-জামিন-বাতিল', 'https://www.prothomalo.com/world/india/ধর্ষণ-নিয়ে-মন্তব্যের-জের-ভারতের-প্রধান-বিচারপতির-পদত্যাগ-দাবি', 'https://www.prothomalo.com/bangladesh/district/ফুলগাজীতে-ধর্ষণ-মামলায়-অভিযুক্ত-ইউপি-চেয়ারম্যান-বরখাস্ত', 'https://www.prothomalo.com/bangladesh/district/ধুনটে-ধর্ষণ-মামলায়-ছাত্রলীগ-নেতা-গ্রেপ্তার', 'https://www.prothomalo.com/bangladesh/district/নোয়াখালীতে-কিশোরীকে-ধর্ষণ-ভিডিও-ধারণ-ও-অপহরণের-অভিযোগে-গ্রেপ্তার-২', 'https://www.prothomalo.com/bangladesh/district/বাবার-সঙ্গে-দেখা-করানোর-কথা-বলে-স্কুলছাত্রীকে-ধর্ষণ', 'https://www.prothomalo.com/opinion/column/ধর্ষণ-ঠেকাতে-প্রযুক্তির-ব্যবহার', 'https://www.prothomalo.com/world/asia/পার্লামেন্টের-মধ্যে-ধর্ষণ-প্রধানমন্ত্রীর-ক্ষমা-প্রার্থনা', 'https://www.prothomalo.com/bangladesh/district/তাবিজ-দেওয়ার-কথা-বলে-গৃহবধূকে-ধর্ষণ-কবিরাজ-আটক', 'https://www.prothomalo.com/bangladesh/district/আদালত-প্রাঙ্গণে-বিয়ে-করে-জামিন-পেলেন-ধর্ষণ-মামলার-আসামি', 'https://www.prothomalo.com/bangladesh/district/কিশোরীকে-দল-বেঁধে-ধর্ষণ-ও-ভিডিও-ধারণ-গ্রেপ্তার-৩', 'https://www.prothomalo.com/bangladesh/district/ধর্ষণ-মামলায়-সহকারী-স্টেশনমাস্টার-গ্রেপ্তার']
By adjusting count you can set the number of urls (articles) to retrieve, term is the search-term.
The requests.Session -object is used to have consistent cookies.
If you have any questions, feel free to ask.
Edit:
Just in case you are wondering how I found out which GET -request was being sent by clicking the button: I went to the Network Analysis -tab from the developer tools of my browser (Firefox), clicked the button, observed which requests were being sent and copied that URL:
Another explanation for the params parameter from the .get -function: It contains (in python-dictionary-format) all the parameters that would normally be appended to the URL after the question mark. So
requests.get('https://www.prothomalo.com/search?q=%E0%A6%A7%E0%A6%B0%E0%A7%8D%E0%A6%B7%E0%A6%A3')
can be written as
requests.get('https://www.prothomalo.com/search', params={'q': 'ধর্ষণ'})
which makes it a lot nicer to look at and you can actually see what you are searching for, because it's written in unicode and not already encoded for the URL.
Edit :
If the script starts returning an empty JSON-file and thus no URLs, you probably have to set a User-Agent like so (I used the one for Firefox, but any browser should be fine):
s.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:87.0) '
'Gecko/20100101 Firefox/87.0'
})
Just put that code below the line where the session-object is initialized (the s =...
line).
A User-Agent tells the site what kind of program is accessing their data.
Always keep in mind that the server has other stuff to do as well and that the webpage has other priorities than sending thousands of search-results to a single person, so try to keep the traffic as low as possible. Scraping 5000 URLs is a lot and if you really have to do it multiple times, put a sleep(...)
of at least a few seconds anywhere before you make the next request (not just to prevent getting blocked, but rather to be nice to the people who provide you with the information you request).
Where you put the sleep does not really matter, as the only time you're actually making contact with the server is the s.get(...)
line.
This is where you add selenium with bs4. To add the click for the site to load then get the page content.
you can download the geckodriver from link
Mock code will look like this,
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://www.prothomalo.com/search?q=%E0%A6%A7%E0%A6%B0%E0%A7%8D%E0%A6%B7%E0%A6%A3"
driver = webdriver.Firefox(executable_path=r'geckodriver.exe')
driver.get(url)
# You need to iterate over this with a loop on how many times you want to click more,
#do remember if it takes time to fetch the data try adding time.sleep() to wait for the page to load
driver.find_element_by_css_selector('{class-name}').click()
# Then you just get the page content
soup = BeautifulSoup(driver.page_source, 'html')
# now you have the content loaded with beautifulsoap and can manipulate it as you were doing previously
{YOUR CODE}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.