[英]LinkedIn profile scraping using Selenium
I am trying to scrape linkedin profiles, however when I get profile URLs they are duplicated, because one url can be located in several classes or tags.我正在尝试抓取linkedin 配置文件,但是当我获得配置文件URL 时,它们是重复的,因为一个url 可以位于多个类或标签中。 Could you please suggest how to find only one copy of URL for each profile.您能否建议如何为每个配置文件仅找到一份 URL 副本。 Thanks.谢谢。
options = Options()
options.add_argument("--start-maximized")
url = "https://www.linkedin.com/login?fromSignIn=true&trk=guest_homepage-basic_nav-header-signin"
driver = webdriver.Chrome(r"path", options=options)
driver.get(url)
driver.find_element_by_id('username').send_keys('email')
driver.find_element_by_id('password').send_keys('pass', Keys.ENTER)
sleep(10)
driver.find_element_by_class_name('search-global-typeahead__input').send_keys('CEO', Keys.ENTER)
driver.implicitly_wait(10)
driver.find_element_by_xpath('//button[text()="Люди"]').click()
x = 0
linklist = []
driver.execute_script("window.scrollTo(0, 1300);")
driver.implicitly_wait(10)
links = driver.find_elements_by_xpath('//a[contains(@href, "/in/")]')
for i in links:
sleep(2)
link = i.get_attribute('href')
linklist.append(link)
print(linklist)
If you have duplicated values in your linklist
, you can get the unique values by converting them into a set
.如果您的链接列表中有重复的linklist
,您可以通过将它们转换为一个set
来获得唯一值。
linklist=list(set(linklist))
EDIT:编辑:
You are getting duplicate links because you are searching the enite website for links and as you mentioned, these are present in different elements.您正在获得重复的链接,因为您正在 enite 网站上搜索链接,并且正如您所提到的,这些链接存在于不同的元素中。 You can get unique links by first searching for the name title of each member.您可以通过首先搜索每个成员的姓名标题来获取唯一链接。
options = Options()
options.add_argument("--start-maximized")
url = "https://www.linkedin.com/login?fromSignIn=true&trk=guest_homepage-basic_nav-header-signin"
driver = webdriver.Chrome(r"path", options=options)
driver.get(url)
driver.find_element_by_id('username').send_keys('email')
driver.find_element_by_id('password').send_keys('pass', Keys.ENTER)
sleep(10)
driver.find_element_by_class_name('search-global-typeahead__input').send_keys('CEO', Keys.ENTER)
driver.implicitly_wait(10)
driver.find_element_by_xpath('//button[text()="Люди"]').click()
sleep(5) #Wait for enitre page to load
linkedin_members = driver.find_elements_by_xpath('//span[@class="entity-result__title"]')
You can then loop through the name titles and select the href within the element (note the .
in .//a[@class="app-aware-link]
. You could use a try/except statement to find all the hrefs of non-hidden profiles using .//a[contains(@href, "/in/")]
, but if that element doesn't exist, it takes selenium a while to figure that out. It is faster to select all hrefs and filter the hidden profiles out afterwards.然后,您可以遍历名称标题和 select 元素内的 href(注意.//a[@class="app-aware-link]
中的.
您可以使用 try/except 语句来查找使用.//a[contains(@href, "/in/")]
非隐藏配置文件,但是如果该元素不存在,则需要 selenium 一段时间才能弄清楚。 select 所有href和之后过滤隐藏的配置文件。
linklist = [linkedin_member.find_element_by_xpath('.//a[@class="app-aware-link"]').get_attribute('href') for linkedin_member in linkedin_members if "/in/" in linkedin_member.find_element_by_xpath('.//a[@class="app-aware-link"]').get_attribute('href')]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.