简体   繁体   English

使用 Selenium 抓取 LinkedIn 个人资料

[英]LinkedIn profile scraping using Selenium

I am trying to scrape linkedin profiles, however when I get profile URLs they are duplicated, because one url can be located in several classes or tags.我正在尝试抓取linkedin 配置文件,但是当我获得配置文件URL 时,它们是重复的,因为一个url 可以位于多个类或标签中。 Could you please suggest how to find only one copy of URL for each profile.您能否建议如何为每个配置文件仅找到一份 URL 副本。 Thanks.谢谢。

options = Options()
options.add_argument("--start-maximized")


url = "https://www.linkedin.com/login?fromSignIn=true&trk=guest_homepage-basic_nav-header-signin"
driver = webdriver.Chrome(r"path", options=options)

driver.get(url)
driver.find_element_by_id('username').send_keys('email')
driver.find_element_by_id('password').send_keys('pass', Keys.ENTER)
sleep(10)
driver.find_element_by_class_name('search-global-typeahead__input').send_keys('CEO', Keys.ENTER)
driver.implicitly_wait(10)
driver.find_element_by_xpath('//button[text()="Люди"]').click()

x = 0
linklist = []
driver.execute_script("window.scrollTo(0, 1300);")
driver.implicitly_wait(10)
links = driver.find_elements_by_xpath('//a[contains(@href, "/in/")]')

for i in links:
      sleep(2)
      link = i.get_attribute('href')
      linklist.append(link)
print(linklist)

If you have duplicated values in your linklist , you can get the unique values by converting them into a set .如果您的链接列表中有重复的linklist ,您可以通过将它们转换为一个set来获得唯一值。

linklist=list(set(linklist))

EDIT:编辑:

You are getting duplicate links because you are searching the enite website for links and as you mentioned, these are present in different elements.您正在获得重复的链接,因为您正在 enite 网站上搜索链接,并且正如您所提到的,这些链接存在于不同的元素中。 You can get unique links by first searching for the name title of each member.您可以通过首先搜索每个成员的姓名标题来获取唯一链接。

options = Options()
options.add_argument("--start-maximized")


url = "https://www.linkedin.com/login?fromSignIn=true&trk=guest_homepage-basic_nav-header-signin"
driver = webdriver.Chrome(r"path", options=options)

driver.get(url)
driver.find_element_by_id('username').send_keys('email')
driver.find_element_by_id('password').send_keys('pass', Keys.ENTER)
sleep(10)
driver.find_element_by_class_name('search-global-typeahead__input').send_keys('CEO', Keys.ENTER)
driver.implicitly_wait(10)
driver.find_element_by_xpath('//button[text()="Люди"]').click()
sleep(5) #Wait for enitre page to load
linkedin_members = driver.find_elements_by_xpath('//span[@class="entity-result__title"]')

You can then loop through the name titles and select the href within the element (note the . in .//a[@class="app-aware-link] . You could use a try/except statement to find all the hrefs of non-hidden profiles using .//a[contains(@href, "/in/")] , but if that element doesn't exist, it takes selenium a while to figure that out. It is faster to select all hrefs and filter the hidden profiles out afterwards.然后,您可以遍历名称标题和 select 元素内的 href(注意.//a[@class="app-aware-link]中的.您可以使用 try/except 语句来查找使用.//a[contains(@href, "/in/")]非隐藏配置文件,但是如果该元素不存在,则需要 selenium 一段时间才能弄清楚。 select 所有href和之后过滤隐藏的配置文件。

linklist = [linkedin_member.find_element_by_xpath('.//a[@class="app-aware-link"]').get_attribute('href') for linkedin_member in linkedin_members if "/in/" in linkedin_member.find_element_by_xpath('.//a[@class="app-aware-link"]').get_attribute('href')]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 selenium 在 LinkedIn 上抓取个人资料网址 - Scraping profile urls on LinkedIn using selenium 使用 Selenium 抓取 LinkedIn 个人资料信息 - LinkedIn profile info scraping using Selenium LinkedIn个人资料名称抓取 - LinkedIn profile name scraping 如何使用 Selenium 和 Python 单击 linkedin 配置文件中的按钮 - How to click on the button within linkedin profile using Selenium and Python Instagram 抓取:如何使用 selenium Python 获取 Instagram 个人资料名称 - Instagram scraping: How to get the Instagram Profile Name using selenium Python 在 Python 上用 Selenium/BS 抓取 Linkedin SalesNavigator - Scraping Linkedin SalesNavigator with Selenium/BS on Python 在 LinkedIn 上使用 Selenium 和 Python - Using Selenium with Python on LinkedIn Web 使用 Selenium 抓取 LinkedIn 工作职位给出重复或空的结果 - Web scraping LinkedIn job posts using Selenium gives repeated or empty results 单击所有“查看更多”按钮并使用 selenium 和 beautifulsoup 从 LinkedIN 配置文件中抓取所有数据 - Click on all 'see more' buttons and scrape all data from a LinkedIN profile using selenium and beautifulsoup 我如何使用带有开放个人资料的 selenium 提取 LinkedIn 帖子喜欢会员姓名和名称? - How I can extract the LinkedIn post likes members name and designation using selenium with opening profile?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM