简体   繁体   English

从点击“显示更多”后URL不变的站点中收集数据

[英]Scraping data from a site where URL doesn't change on clicking 'Show More'

I'm trying to scrape all the article links from a site and have I been successful in doing so. 我正在尝试从网站上抓取所有文章链接,并且我已经成功地做到了。

The site page has a Show more button for loading more articles. 该站点页面上有一个“ Show more按钮,用于加载更多文章。

I'm using Selenium to click on this button which also works. 我正在使用Selenium单击此按钮,该按钮也有效。

The problem is that clicking on Show more doesn't change the URL of the page, therefore I'm being able to scrape only the initial links displayed by default. 问题在于,单击“ Show more ”不会更改页面的URL,因此,我只能抓取默认显示的初始链接。

Here is the code snip: 这是代码片段:

def startWebDriver():
    global driver
    options = Options()
    options.add_argument("--disable-extensions")
    driver = webdriver.Chrome(executable_path = '/home/Downloads/chromedriver_linux64/chromedriver',options=options)

startWebDriver()
count = 0 
s = set()

driver.get('https://www.nytimes.com/search? endDate=20181231&query=trump&sort=best&startDate=20180101')
time.sleep(4)
element = driver.find_element_by_xpath('//*[@id="site-content"]/div/div/div[2]/div[2]/div/button')

while(count < 10):
    element.click()
    time.sleep(4)
    count=count+1

url = driver.current_url

I expect to get all article links displayed on the page after clicking on Show More 10 times 我希望单击“ Show More 10次​​”后在页面上Show More所有文章链接。

Seems like your target resource give us a nice API for their articles. 似乎您的目标资源为我们的文章提供了一个不错的API。

It will be much easier to use it instead of selenium. 用它代替硒会容易得多。

You can open that page in Chrome. 您可以在Chrome中打开该页面。 Then open Dev Tools -> Network. 然后打开开发工具->网络。 Click on "Show more" and you can see API request named v2 (looks like it is GraphQL gateway). 单击“显示更多”,您可以看到名为v2的API请求(看起来像是GraphQL网关)。

Something like 就像是

{
    "operationName":"SearchRootQuery",
    "variables":{
        "first":10,
        "sort":"best",
        "beginDate":"20180101",
        "endDate":"20181231",
        "text":"trump" ...
}}

You can mimic that request but ask as many "first" articles as you want. 您可以模仿该请求,但可以根据需要询问尽可能多的“第一篇”文章。

EDIT : 编辑

You can right-click in DevTools and select "copy as cURL". 您可以在DevTools中右键单击,然后选择“复制为cURL”。 Then paste it to your terminal. 然后将其粘贴到您的终端。 So you can see how it works. 这样您就可以了解其工作原理。

After that you can use library like requests to do it from your code. 之后,您可以使用类似请求的库从代码中进行操作。

Here is a mimic of a POST request using API info as I see in network tab. 这是使用API​​信息的POST请求的模拟,正如我在“网络”标签中看到的那样。 I have stripped back to headers that seems to be required. 我已剥离回似乎是必需的标题。

import requests
url = 'https://samizdat-graphql.nytimes.com/graphql/v2'
headers = {
         'nyt-app-type': 'project-vi',
         'nyt-app-version': '0.0.3',
         'nyt-token': 'MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAlYOpRoYg5X01qAqNyBDM32EI/E77nkFzd2rrVjhdi/VAZfBIrPayyYykIIN+d5GMImm3wg6CmTTkBo7ixmwd7Xv24QSDpjuX0gQ1eqxOEWZ0FHWZWkh4jfLcwqkgKmfHJuvOctEiE/Wic5Qrle323SMDKF8sAqClv8VKA8hyrXHbPDAlAaxq3EPOGjJqpHEdWNVg2S0pN62NSmSudT/ap/BqZf7FqsI2cUxv2mUKzmyy+rYwbhd8TRgj1kFprNOaldrluO4dXjubJIY4qEyJY5Dc/F03sGED4AiGBPVYtPh8zscG64yJJ9Njs1ReyUCSX4jYmxoZOnO+6GfXE0s2xQIDAQAB'
}

data = '''
{"operationName":"SearchRootQuery","variables":{"first":10,"sort":"best","beginDate":"20180101","text":"trump","cursor":"YXJyYXljb25uZWN0aW9uOjk="},"extensions":{"persistedQuery":{"version":1,"sha256Hash":"d2895d5a5d686528b9b548f018d7d0c64351ad644fa838384d94c35c585db813"}}}
'''
with requests.Session() as r:
    re = r.post(url, headers = headers, data = data)
    print(re.json())

To scrape all the article links ie the href attributes from the URL clicking on the link with text as SHOW MORE you can use the following solution: 要从URL抓取所有文章链接(即URLhref属性),请单击文本链接,如显示更多,您可以使用以下解决方案:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

options = webdriver.ChromeOptions() 
options.add_argument("start-maximized")
options.add_argument('disable-infobars')
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get("https://www.nytimes.com/search?%20endDate=20181231&query=trump&sort=best&startDate=20180101")
myLength = len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//main[@id='site-content']//figure[@class='css-rninck toneNews']//following::a[1]"))))

while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    try:
        WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//button[text()='Show More']"))).click()
        WebDriverWait(driver, 20).until(lambda driver: len(driver.find_elements_by_xpath("//main[@id='site-content']//figure[@class='css-rninck toneNews']//following::a[1]")) > myLength)
        titles = driver.find_elements_by_xpath("//main[@id='site-content']//figure[@class='css-rninck toneNews']//following::a[1]")
        myLength = len(titles)
    except TimeoutException:
        break

for title in titles:
    print(title.get_attribute("href"))
driver.quit()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在更改下拉列表中的选项时从 URL 不更改的站点抓取数据 - Scraping data from a site where the URL doesn't change while changing options in a drop-down list 使用更多按钮和 JSON 文件从站点抓取数据不会加载 - Scraping data from site with more button and JSON file doesn't load 使用Selenium Python解析URL不变的网站 - Parsing a site where URL doesn't change with Selenium Python 单击“下一页”按钮时,抓取 URL 不会更改的网站 - Scraping a website that URL doesn't change when clicking on "next page" button 当单击特定的 onclick 按钮时,从 URL 不会更改的网站中抓取数据 - Scraping data from a website that URL does not change when clicking on a particular onclick button Python - 单击链接时抓取数据不会更改URL - Python - scraping data when clicking a link does not change the URL 如何使用请求或其他模块从URL不变的页面获取数据? - How to use requests or other module to get data from a page where the url doesn't change? 从网站抓取数据 - Scraping data from site 由于启用了 cloudflare,bs4 数据抓取在现场不起作用 - bs4 data scraping doesn't work on site as cloudflare is enabled 在 Scrapy 中抓取用户评论 - 站点从哪里获取数据? - Scraping user comments in Scrapy - Where is site getting data from?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM