[英]Scraping data from a site where URL doesn't change on clicking 'Show More'
I'm trying to scrape all the article links from a site and have I been successful in doing so. 我正在尝试从网站上抓取所有文章链接,并且我已经成功地做到了。
The site page has a Show more
button for loading more articles. 该站点页面上有一个“ Show more
按钮,用于加载更多文章。
I'm using Selenium to click on this button which also works. 我正在使用Selenium单击此按钮,该按钮也有效。
The problem is that clicking on Show more
doesn't change the URL of the page, therefore I'm being able to scrape only the initial links displayed by default. 问题在于,单击“ Show more
”不会更改页面的URL,因此,我只能抓取默认显示的初始链接。
Here is the code snip: 这是代码片段:
def startWebDriver():
global driver
options = Options()
options.add_argument("--disable-extensions")
driver = webdriver.Chrome(executable_path = '/home/Downloads/chromedriver_linux64/chromedriver',options=options)
startWebDriver()
count = 0
s = set()
driver.get('https://www.nytimes.com/search? endDate=20181231&query=trump&sort=best&startDate=20180101')
time.sleep(4)
element = driver.find_element_by_xpath('//*[@id="site-content"]/div/div/div[2]/div[2]/div/button')
while(count < 10):
element.click()
time.sleep(4)
count=count+1
url = driver.current_url
I expect to get all article links displayed on the page after clicking on Show More
10 times 我希望单击“ Show More
10次”后在页面上Show More
所有文章链接。
Seems like your target resource give us a nice API for their articles. 似乎您的目标资源为我们的文章提供了一个不错的API。
It will be much easier to use it instead of selenium. 用它代替硒会容易得多。
You can open that page in Chrome. 您可以在Chrome中打开该页面。 Then open Dev Tools -> Network. 然后打开开发工具->网络。 Click on "Show more" and you can see API request named v2 (looks like it is GraphQL gateway). 单击“显示更多”,您可以看到名为v2的API请求(看起来像是GraphQL网关)。
Something like 就像是
{
"operationName":"SearchRootQuery",
"variables":{
"first":10,
"sort":"best",
"beginDate":"20180101",
"endDate":"20181231",
"text":"trump" ...
}}
You can mimic that request but ask as many "first" articles as you want. 您可以模仿该请求,但可以根据需要询问尽可能多的“第一篇”文章。
EDIT : 编辑 :
You can right-click in DevTools and select "copy as cURL". 您可以在DevTools中右键单击,然后选择“复制为cURL”。 Then paste it to your terminal. 然后将其粘贴到您的终端。 So you can see how it works. 这样您就可以了解其工作原理。
After that you can use library like requests to do it from your code. 之后,您可以使用类似请求的库从代码中进行操作。
Here is a mimic of a POST request using API info as I see in network tab. 这是使用API信息的POST请求的模拟,正如我在“网络”标签中看到的那样。 I have stripped back to headers that seems to be required. 我已剥离回似乎是必需的标题。
import requests
url = 'https://samizdat-graphql.nytimes.com/graphql/v2'
headers = {
'nyt-app-type': 'project-vi',
'nyt-app-version': '0.0.3',
'nyt-token': 'MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAlYOpRoYg5X01qAqNyBDM32EI/E77nkFzd2rrVjhdi/VAZfBIrPayyYykIIN+d5GMImm3wg6CmTTkBo7ixmwd7Xv24QSDpjuX0gQ1eqxOEWZ0FHWZWkh4jfLcwqkgKmfHJuvOctEiE/Wic5Qrle323SMDKF8sAqClv8VKA8hyrXHbPDAlAaxq3EPOGjJqpHEdWNVg2S0pN62NSmSudT/ap/BqZf7FqsI2cUxv2mUKzmyy+rYwbhd8TRgj1kFprNOaldrluO4dXjubJIY4qEyJY5Dc/F03sGED4AiGBPVYtPh8zscG64yJJ9Njs1ReyUCSX4jYmxoZOnO+6GfXE0s2xQIDAQAB'
}
data = '''
{"operationName":"SearchRootQuery","variables":{"first":10,"sort":"best","beginDate":"20180101","text":"trump","cursor":"YXJyYXljb25uZWN0aW9uOjk="},"extensions":{"persistedQuery":{"version":1,"sha256Hash":"d2895d5a5d686528b9b548f018d7d0c64351ad644fa838384d94c35c585db813"}}}
'''
with requests.Session() as r:
re = r.post(url, headers = headers, data = data)
print(re.json())
To scrape all the article links ie the href
attributes from the URL clicking on the link with text as SHOW MORE you can use the following solution: 要从URL抓取所有文章链接(即URL的href
属性),请单击文本链接,如显示更多,您可以使用以下解决方案:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument('disable-infobars')
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get("https://www.nytimes.com/search?%20endDate=20181231&query=trump&sort=best&startDate=20180101")
myLength = len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//main[@id='site-content']//figure[@class='css-rninck toneNews']//following::a[1]"))))
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
try:
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//button[text()='Show More']"))).click()
WebDriverWait(driver, 20).until(lambda driver: len(driver.find_elements_by_xpath("//main[@id='site-content']//figure[@class='css-rninck toneNews']//following::a[1]")) > myLength)
titles = driver.find_elements_by_xpath("//main[@id='site-content']//figure[@class='css-rninck toneNews']//following::a[1]")
myLength = len(titles)
except TimeoutException:
break
for title in titles:
print(title.get_attribute("href"))
driver.quit()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.