简体   繁体   中英

How to open a list of links and scrape the text with Selenium

I am new to programming in Python and I want to write a Code to scrape text from articles on reuters using Selenium. I´m trying to open the article links and then get the full text from the article but it doesn´t work. I would be glad if somebody could help me.

article_links1 = []

for link in driver.find_elements_by_xpath("/html/body/div[4]/section[2]/div/div[1]/div[4]/div/div[3]/div[*]/div/h3/a"):
    links = link.get_attribute("href")
    article_links1.append(links)
    
article_links = article_links1[:5]

article_links

This is a shortened list of the articles, so it doesn´t take that long to scrape for testing. It contains 5 links, this is the output:

['https://www.reuters.com/article/idUSKCN2DM21B',
 'https://www.reuters.com/article/idUSL2N2NS20U',
 'https://www.reuters.com/article/idUSKCN2DM20N',
 'https://www.reuters.com/article/idUSKCN2DM21W',
 'https://www.reuters.com/article/idUSL3N2NS2F7']

Then I tried to iterate over the links and scrape the text out of the paragraphs but it doesn´t work.


for article in article_links:
    driver.switch_to.window(driver.window_handles[-1])
    driver.get(article)
    time.sleep(5)
    for article_text in driver.find_elements_by_xpath("/html/body/div[1]/div/div[4]/div[1]/article/div[1]/p[*]"):
        full_text.append(article_text.text)
        
full_text   

The output is only the empty list:

[]

There are a couple issues with your current code. The first one is an easy fix. You need to indent your second for loop, so that it's within the for loop that is iterating through each article. Otherwise, you won't be adding anything to the full_text list until it gets to the last article. It should look like this:

for article in article_links:
    driver.switch_to.window(driver.window_handles[-1])
    driver.get(article)
    time.sleep(5)
    
    for article_text in driver.find_elements_by_xpath("/html/body/div[1]/div/div[4]/div[1]/article/div[1]/p[*]"):
        full_text.append(article_text.text)

The second problem lies within your xpath. Xpath can be very long when it's generated automatically by a browser. (I'd suggest learning CSS selectors, which are pretty concise. A good place to learn CSS selectors is called the CSS Diner .)

I've changed your find_elements_by_xpath() function to find_elements_by_css_selector(). You can see the example below.

for article_text in driver.find_elements_by_css_selector("article p"):
    full_text.append(article_text.text)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM