简体   繁体   English

检索搜索结果selenium python bs4

[英]Retrieve search results selenium python bs4

I successfully put together a script to retrieve search results from sales navigator in Linkedin. 我成功地组合了一个脚本,以从Linkedin中的销售导航器检索搜索结果。 The following is the script, using python, selenium, and bs4. 以下是使用python,selenium和bs4的脚本。

browser = webdriver.Firefox(executable_path=r'D:\geckodriver\geckodriver.exe')
url1 = "https://www.linkedin.com/sales/search/company?companySize=E&geoIncluded=emea%3A0%2Ceurope%3A0&industryIncluded=6&keywords=AI&page=1&searchSessionId=zreYu57eQo%2BSZiFskdWJqg%3D%3D"

browser.get(url1)
time.sleep(15)

parsed = browser.find_element_by_tag_name('html').get_attribute('innerHTML')
soup = BeautifulSoup(parsed, 'html.parser')

search_results = soup.select('dt.result-lockup__name a')
print(len(search_results))

time.sleep(5)
browser.quit()

Irrespective of the no.of results, the answer was always 10 (ie) only 10 results were returned. 不管结果数是多少,答案始终为10(即,仅返回10个结果)。 Upon further investigation into the source, I noticed the following : 在进一步调查消息来源后,我注意到以下内容:

在此处输入图片说明

That the first 10 results are represented at a different level and the rest are under a div tag with style class named as deferred area . 前10个结果以不同的级别表示,其余的则在div标签下,其样式类称为deferred area Though the dt class name is the same for all the search results (result-lockup__name) , due to the change in levels, I am not able to access/retrieve it. 尽管所有级别的搜索结果的dt类名都相同(result-lockup__name) ,但由于级别的变化,我无法访问/检索它。

What would be the right way to retrieve all results in such a case? 在这种情况下检索所有结果的正确方法是什么?

EDIT 1 编辑1

An example of how the tag levels are within li 标签级别在li内的示例 在此处输入图片说明

And an example of the html script of the result that is not being retrieved 还有一个未获取结果的html脚本示例 在此处输入图片说明

EDIT 2 编辑2

The page source as requested 页面来源按要求

https://pastebin.com/D11YpHGQ https://pastebin.com/D11YpHGQ

A lot of sites don't display all search results on page load rather only display them when needed, eg the visitor keeps scrolling indicating they want to view more. 许多网站不会在页面加载时显示所有搜索结果,而是仅在需要时显示它们,例如,访客不断滚动以表明他们想查看更多内容。

We can use javascript to scroll to the bottom of the page for us window.scrollTo(0,document.body.scrollHeight) , (you may want to loop this if you expect hundreds of results) forcing all results on the page, after which we can grab the HTML. 我们可以使用javascript滚动到window.scrollTo(0,document.body.scrollHeight)的底部window.scrollTo(0,document.body.scrollHeight) ,(如果您希望获得数百个结果,则可能要循环进行此操作),然后强制页面上的所有结果我们可以获取HTML。

Below should do the trick. 下面应该可以解决问题。

browser = webdriver.Firefox(executable_path=r'D:\geckodriver\geckodriver.exe')
url1 = "https://www.linkedin.com/sales/search/company?companySize=E&geoIncluded=emea%3A0%2Ceurope%3A0&industryIncluded=6&keywords=AI&page=1&searchSessionId=zreYu57eQo%2BSZiFskdWJqg%3D%3D"

browser.get(url1)
time.sleep(15)
browser.execute_script('window.scrollTo(0,document.body.scrollHeight)')
time.sleep(15)

parsed = browser.find_element_by_tag_name('html').get_attribute('innerHTML')
soup = BeautifulSoup(parsed, 'html.parser')

search_results = soup.select('dt.result-lockup__name a')
print(len(search_results))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM