I'm trying to scrape all the article links from a site and have I been successful in doing so.
The site page has a Show more
button for loading more articles.
I'm using Selenium to click on this button which also works.
The problem is that clicking on Show more
doesn't change the URL of the page, therefore I'm being able to scrape only the initial links displayed by default.
Here is the code snip:
def startWebDriver():
global driver
options = Options()
options.add_argument("--disable-extensions")
driver = webdriver.Chrome(executable_path = '/home/Downloads/chromedriver_linux64/chromedriver',options=options)
startWebDriver()
count = 0
s = set()
driver.get('https://www.nytimes.com/search? endDate=20181231&query=trump&sort=best&startDate=20180101')
time.sleep(4)
element = driver.find_element_by_xpath('//*[@id="site-content"]/div/div/div[2]/div[2]/div/button')
while(count < 10):
element.click()
time.sleep(4)
count=count+1
url = driver.current_url
I expect to get all article links displayed on the page after clicking on Show More
10 times
Seems like your target resource give us a nice API for their articles.
It will be much easier to use it instead of selenium.
You can open that page in Chrome. Then open Dev Tools -> Network. Click on "Show more" and you can see API request named v2 (looks like it is GraphQL gateway).
Something like
{
"operationName":"SearchRootQuery",
"variables":{
"first":10,
"sort":"best",
"beginDate":"20180101",
"endDate":"20181231",
"text":"trump" ...
}}
You can mimic that request but ask as many "first" articles as you want.
EDIT :
You can right-click in DevTools and select "copy as cURL". Then paste it to your terminal. So you can see how it works.
After that you can use library like requests to do it from your code.
Here is a mimic of a POST request using API info as I see in network tab. I have stripped back to headers that seems to be required.
import requests
url = 'https://samizdat-graphql.nytimes.com/graphql/v2'
headers = {
'nyt-app-type': 'project-vi',
'nyt-app-version': '0.0.3',
'nyt-token': 'MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAlYOpRoYg5X01qAqNyBDM32EI/E77nkFzd2rrVjhdi/VAZfBIrPayyYykIIN+d5GMImm3wg6CmTTkBo7ixmwd7Xv24QSDpjuX0gQ1eqxOEWZ0FHWZWkh4jfLcwqkgKmfHJuvOctEiE/Wic5Qrle323SMDKF8sAqClv8VKA8hyrXHbPDAlAaxq3EPOGjJqpHEdWNVg2S0pN62NSmSudT/ap/BqZf7FqsI2cUxv2mUKzmyy+rYwbhd8TRgj1kFprNOaldrluO4dXjubJIY4qEyJY5Dc/F03sGED4AiGBPVYtPh8zscG64yJJ9Njs1ReyUCSX4jYmxoZOnO+6GfXE0s2xQIDAQAB'
}
data = '''
{"operationName":"SearchRootQuery","variables":{"first":10,"sort":"best","beginDate":"20180101","text":"trump","cursor":"YXJyYXljb25uZWN0aW9uOjk="},"extensions":{"persistedQuery":{"version":1,"sha256Hash":"d2895d5a5d686528b9b548f018d7d0c64351ad644fa838384d94c35c585db813"}}}
'''
with requests.Session() as r:
re = r.post(url, headers = headers, data = data)
print(re.json())
To scrape all the article links ie the href
attributes from the URL clicking on the link with text as SHOW MORE you can use the following solution:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument('disable-infobars')
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get("https://www.nytimes.com/search?%20endDate=20181231&query=trump&sort=best&startDate=20180101")
myLength = len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//main[@id='site-content']//figure[@class='css-rninck toneNews']//following::a[1]"))))
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
try:
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//button[text()='Show More']"))).click()
WebDriverWait(driver, 20).until(lambda driver: len(driver.find_elements_by_xpath("//main[@id='site-content']//figure[@class='css-rninck toneNews']//following::a[1]")) > myLength)
titles = driver.find_elements_by_xpath("//main[@id='site-content']//figure[@class='css-rninck toneNews']//following::a[1]")
myLength = len(titles)
except TimeoutException:
break
for title in titles:
print(title.get_attribute("href"))
driver.quit()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.