I am trying to build a web scraper. The goal is to download the pdf that can be accessed by a series of links on a webpage. Currently, I am trying to retrieve the URLs directing to the pdf files, so I would be able to insert them in eg node download helper (or maybe wget). Ideally, I would have an array of the different links that I can then iterate through.
Currently, the function looks like this.
function scrape(){
driver.get('https://examplelink.com/pagewheretofindthedifferentlinks')
.then(function(){
return links = driver.findElements(By.partialLinkText('ABCD.')); //all the links contain the same pattern lets say 'ABCD.BLABLA.BLABLA'
})
.then(function(links){
console.log(links[0].getAttribute('href'))
})}
For one or another reason this returns:
Promise { <pending> }
I have tried a lot of different forms of the async await... but nothing seems to work.
I have also tried to click the link and then use driver.getCurrentUrl() but this just returns the URL of the original page ('https://xxx.be/xxx') and not the URL of the tabs that are opened, which would lead me to implement a function that the driver switches between the different tabs...
Thank you in advance!
Ok this is the way I got it currently working:
//function for finding hrefs
function findHref(array, input){
var href = driver.wait(array[input].getAttribute('href'))
return href;
}
//delete all cookies
//driver.manage().deleteAllCookies();
//navigate findlinks
function scrape(){
console.log('Starting scrape process')
driver.get('https://blabla.com/blabla')
.then(function(){
return links = driver.findElements(By.tagName('a'));
})
.then(function(links){
for(i=0; i<links.length; i++){
findHref(links, i)
.then(function(href){
console.log("This is link:" + href)
})
}
})
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.