简体   繁体   中英

Web Scraping Linkedin Profiles: Cannot pick all links

Following is the code being used:

linkedin_urls = driver.find_elements_by_class_name('r')

sub = 'linkedin.com'

for linkedin_url in linkedin_urls:
    tag = linkedin_url.find_element_by_tag_name('a')
    URL = tag.get_attribute('href')

    if sub in URL:
       try:
          driver.get(URL)
          sleep(5)
          driver.back()
          driver.get(URL)
       except:
          pass
  • Following is the error i get:

Traceback (most recent call last):

File "", line 25, in tag = linkedin_url.find_element_by_tag_name('a')

File "C:\\Users\\deepankar.garg\\AppData\\Roaming\\Python\\Python37\\site-packages\\selenium\\webdriver\\remote\\webelement.py", line 305, in find_element_by_tag_name return self.find_element(by=By.TAG_NAME, value=name)

File "C:\\Users\\deepankar.garg\\AppData\\Roaming\\Python\\Python37\\site-packages\\selenium\\webdriver\\remote\\webelement.py", line 659, in find_element {"using": by, "value": value})['value']

File "C:\\Users\\deepankar.garg\\AppData\\Roaming\\Python\\Python37\\site-packages\\selenium\\webdriver\\remote\\webelement.py", line 633, in _execute return self._parent.execute(command, params)

File "C:\\Users\\deepankar.garg\\AppData\\Roaming\\Python\\Python37\\site-packages\\selenium\\webdriver\\remote\\webdriver.py", line 321, in execute self.error_handler.check_response(response)

File "C:\\Users\\deepankar.garg\\AppData\\Roaming\\Python\\Python37\\site-packages\\selenium\\webdriver\\remote\\errorhandler.py", line 242, in check_response raise exception_class(message, screen, stacktrace)

StaleElementReferenceException: stale element reference: element is not attached to the page document (Session info: chrome=79.0.3945.79)

Following is the output before IF condition:

https://www.linkedin.com/in/elena-grewal

https://www.quora.com/What-is-the-difference-between-Data-Science-and-Analytics

https://www.edureka.co/blog/what-is-data-science/

Following is the output after IF condition:

https://www.linkedin.com/in/elena-grewal

https://in.linkedin.com/in/bsatya

https://www.linkedin.com/in/kylemckiou

I know what the error means, but i do not know how to resolve it. I just wish to open each link in the "if" (true) condition in a separate web browser. The links present in the above "after IF" condition is what i wish to open in each tab.

Any help would be really appreciated!

StaleElementReferenceException comes when the element is no longer present in the div or has become stale. In your scenario as you are coming back to the url after navigating to the next page, the element has become stale when you are trying to access it again.

To resolve it, you need to fetch the element again before accessing it.
You can do it like:

linkedin_urls = driver.find_elements_by_class_name('r')

sub = 'linkedin.com'

while i < len(linkedin_urls): 
    tag = linkedin_urls[i].find_element_by_tag_name('a')
    URL = tag.get_attribute('href')
    i += 1

    if sub in URL:
       try:
          driver.get(URL)
          sleep(5)
          driver.back()
          driver.get(URL)
          # Fetching the element again here
          linkedin_urls = driver.find_elements_by_class_name('r') 
       except:
          pass

Finally!

I got the solution. Following is something that i tried and it worked:

all_urls = driver.find_elements_by_css_selector("div > a")

urls = []
for elem in all_urls:
text = elem.text
url = elem.get_property('href')
if "linkedin.com" in elem.text:
    urls.append(url) 
#print(urls)
for url in urls:
   driver.get(url)
   sleep(2)
   print(url)

Thanks everyone for your help!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM