从点击“显示更多”后URL不变的站点中收集数据

Question

I'm trying to scrape all the article links from a site and have I been successful in doing so. 我正在尝试从网站上抓取所有文章链接，并且我已经成功地做到了。

The site page has a Show more button for loading more articles. 该站点页面上有一个“ Show more按钮，用于加载更多文章。

I'm using Selenium to click on this button which also works. 我正在使用Selenium单击此按钮，该按钮也有效。

The problem is that clicking on Show more doesn't change the URL of the page, therefore I'm being able to scrape only the initial links displayed by default. 问题在于，单击“ Show more ”不会更改页面的URL，因此，我只能抓取默认显示的初始链接。

Here is the code snip: 这是代码片段：

def startWebDriver():
    global driver
    options = Options()
    options.add_argument("--disable-extensions")
    driver = webdriver.Chrome(executable_path = '/home/Downloads/chromedriver_linux64/chromedriver',options=options)

startWebDriver()
count = 0 
s = set()

driver.get('https://www.nytimes.com/search? endDate=20181231&query=trump&sort=best&startDate=20180101')
time.sleep(4)
element = driver.find_element_by_xpath('//*[@id="site-content"]/div/div/div[2]/div[2]/div/button')

while(count < 10):
    element.click()
    time.sleep(4)
    count=count+1

url = driver.current_url

I expect to get all article links displayed on the page after clicking on Show More 10 times 我希望单击“ Show More 10次”后在页面上Show More所有文章链接。

Answer 1

Seems like your target resource give us a nice API for their articles. 似乎您的目标资源为我们的文章提供了一个不错的API。

It will be much easier to use it instead of selenium. 用它代替硒会容易得多。

You can open that page in Chrome. 您可以在Chrome中打开该页面。 Then open Dev Tools -> Network. 然后打开开发工具->网络。 Click on "Show more" and you can see API request named v2 (looks like it is GraphQL gateway). 单击“显示更多”，您可以看到名为v2的API请求（看起来像是GraphQL网关）。

Something like 就像是

{
    "operationName":"SearchRootQuery",
    "variables":{
        "first":10,
        "sort":"best",
        "beginDate":"20180101",
        "endDate":"20181231",
        "text":"trump" ...
}}

You can mimic that request but ask as many "first" articles as you want. 您可以模仿该请求，但可以根据需要询问尽可能多的“第一篇”文章。

EDIT : 编辑：

You can right-click in DevTools and select "copy as cURL". 您可以在DevTools中右键单击，然后选择“复制为cURL”。 Then paste it to your terminal. 然后将其粘贴到您的终端。 So you can see how it works. 这样您就可以了解其工作原理。

After that you can use library like requests to do it from your code. 之后，您可以使用类似请求的库从代码中进行操作。

Answer 2

Here is a mimic of a POST request using API info as I see in network tab. 这是使用API信息的POST请求的模拟，正如我在“网络”标签中看到的那样。 I have stripped back to headers that seems to be required. 我已剥离回似乎是必需的标题。

import requests
url = 'https://samizdat-graphql.nytimes.com/graphql/v2'
headers = {
         'nyt-app-type': 'project-vi',
         'nyt-app-version': '0.0.3',
         'nyt-token': 'MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAlYOpRoYg5X01qAqNyBDM32EI/E77nkFzd2rrVjhdi/VAZfBIrPayyYykIIN+d5GMImm3wg6CmTTkBo7ixmwd7Xv24QSDpjuX0gQ1eqxOEWZ0FHWZWkh4jfLcwqkgKmfHJuvOctEiE/Wic5Qrle323SMDKF8sAqClv8VKA8hyrXHbPDAlAaxq3EPOGjJqpHEdWNVg2S0pN62NSmSudT/ap/BqZf7FqsI2cUxv2mUKzmyy+rYwbhd8TRgj1kFprNOaldrluO4dXjubJIY4qEyJY5Dc/F03sGED4AiGBPVYtPh8zscG64yJJ9Njs1ReyUCSX4jYmxoZOnO+6GfXE0s2xQIDAQAB'
}

data = '''
{"operationName":"SearchRootQuery","variables":{"first":10,"sort":"best","beginDate":"20180101","text":"trump","cursor":"YXJyYXljb25uZWN0aW9uOjk="},"extensions":{"persistedQuery":{"version":1,"sha256Hash":"d2895d5a5d686528b9b548f018d7d0c64351ad644fa838384d94c35c585db813"}}}
'''
with requests.Session() as r:
    re = r.post(url, headers = headers, data = data)
    print(re.json())

Answer 3

To scrape all the article links ie the href attributes from the URL clicking on the link with text as SHOW MORE you can use the following solution: 要从URL抓取所有文章链接（即URL的href属性），请单击文本链接，如显示更多，您可以使用以下解决方案：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

options = webdriver.ChromeOptions() 
options.add_argument("start-maximized")
options.add_argument('disable-infobars')
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get("https://www.nytimes.com/search?%20endDate=20181231&query=trump&sort=best&startDate=20180101")
myLength = len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//main[@id='site-content']//figure[@class='css-rninck toneNews']//following::a[1]"))))

while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    try:
        WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//button[text()='Show More']"))).click()
        WebDriverWait(driver, 20).until(lambda driver: len(driver.find_elements_by_xpath("//main[@id='site-content']//figure[@class='css-rninck toneNews']//following::a[1]")) > myLength)
        titles = driver.find_elements_by_xpath("//main[@id='site-content']//figure[@class='css-rninck toneNews']//following::a[1]")
        myLength = len(titles)
    except TimeoutException:
        break

for title in titles:
    print(title.get_attribute("href"))
driver.quit()

从点击“显示更多”后URL不变的站点中收集数据

问题描述

3 个解决方案

解决方案1
0 2019-01-03 16:13:00

解决方案2
0 2019-01-04 07:30:25

解决方案3
0 已采纳 2019-01-07 06:50:01

从点击“显示更多”后URL不变的站点中收集数据

问题描述

3 个解决方案

解决方案1 0 2019-01-03 16:13:00

解决方案2 0 2019-01-04 07:30:25

解决方案3 0 已采纳 2019-01-07 06:50:01

解决方案1
0 2019-01-03 16:13:00

解决方案2
0 2019-01-04 07:30:25

解决方案3
0 已采纳 2019-01-07 06:50:01