简体   繁体   中英

Scraping a website that has certain problems

I want to scrape this website and scrape all articles by this author, with Python(response or Selenium libraries) and put them in PDF file.
However, when I click on the button "Show More" that is in the bottom, after 8 times, it doesn't anymore display more articles, hence I can't access them all(idea was to automate selenium, to click on it until all articles are showed, and then scrape them all). Is there a workaround? Alternative ways I can access all articles chronologically and scrape them?
My idea was to somehow analyze if the links come from alternative source, but I'm clueless. However, I scraped successfully those articles that are displayed.
Thanks in advance!

Use findElements and search for <h2 class="css-1j9dxys e1xfvim30">...</h2> which will give you a list of all titles. Each time when you click the Show more the size of the list will get extended by 10 or so. So the idea is to simply click the button untill the size of the list does not change. Use a while loop. Something like:

List<WebElements> oldList = Driver.findElements(by.cssSelector("h2.css- 
    1j9dxys.e1xfvim30"));

List<WebElements> newList = new ArrayList<>();

WebElement button = Driver.findElement(by.xpath("//button[text()='Show More']"));

while(newList.size!=oldList.size){
    button.click();
    newList = List<WebElements> newList = Driver.findElements(by.cssSelector("h2.css- 
    1j9dxys.e1xfvim30));
}

I might have some mistakes in the code but the idea is there. Good luck!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM