检索搜索结果selenium python bs4

Question

I successfully put together a script to retrieve search results from sales navigator in Linkedin. 我成功地组合了一个脚本，以从Linkedin中的销售导航器检索搜索结果。 The following is the script, using python, selenium, and bs4. 以下是使用python，selenium和bs4的脚本。

browser = webdriver.Firefox(executable_path=r'D:\geckodriver\geckodriver.exe')
url1 = "https://www.linkedin.com/sales/search/company?companySize=E&geoIncluded=emea%3A0%2Ceurope%3A0&industryIncluded=6&keywords=AI&page=1&searchSessionId=zreYu57eQo%2BSZiFskdWJqg%3D%3D"

browser.get(url1)
time.sleep(15)

parsed = browser.find_element_by_tag_name('html').get_attribute('innerHTML')
soup = BeautifulSoup(parsed, 'html.parser')

search_results = soup.select('dt.result-lockup__name a')
print(len(search_results))

time.sleep(5)
browser.quit()

Irrespective of the no.of results, the answer was always 10 (ie) only 10 results were returned. 不管结果数是多少，答案始终为10（即，仅返回10个结果）。 Upon further investigation into the source, I noticed the following : 在进一步调查消息来源后，我注意到以下内容：

That the first 10 results are represented at a different level and the rest are under a div tag with style class named as deferred area . 前10个结果以不同的级别表示，其余的则在div标签下，其样式类称为deferred area 。 Though the dt class name is the same for all the search results (result-lockup__name) , due to the change in levels, I am not able to access/retrieve it. 尽管所有级别的搜索结果的dt类名都相同（result-lockup__name） ，但由于级别的变化，我无法访问/检索它。

What would be the right way to retrieve all results in such a case? 在这种情况下检索所有结果的正确方法是什么？

EDIT 1 编辑1

An example of how the tag levels are within li 标签级别在li内的示例

And an example of the html script of the result that is not being retrieved 还有一个未获取结果的html脚本示例

EDIT 2 编辑2

The page source as requested 页面来源按要求

https://pastebin.com/D11YpHGQ https://pastebin.com/D11YpHGQ

Answer 1

A lot of sites don't display all search results on page load rather only display them when needed, eg the visitor keeps scrolling indicating they want to view more. 许多网站不会在页面加载时显示所有搜索结果，而是仅在需要时显示它们，例如，访客不断滚动以表明他们想查看更多内容。

We can use javascript to scroll to the bottom of the page for us window.scrollTo(0,document.body.scrollHeight) , (you may want to loop this if you expect hundreds of results) forcing all results on the page, after which we can grab the HTML. 我们可以使用javascript滚动到window.scrollTo(0,document.body.scrollHeight)的底部window.scrollTo(0,document.body.scrollHeight) ，（如果您希望获得数百个结果，则可能要循环进行此操作），然后强制页面上的所有结果我们可以获取HTML。

Below should do the trick. 下面应该可以解决问题。

browser = webdriver.Firefox(executable_path=r'D:\geckodriver\geckodriver.exe')
url1 = "https://www.linkedin.com/sales/search/company?companySize=E&geoIncluded=emea%3A0%2Ceurope%3A0&industryIncluded=6&keywords=AI&page=1&searchSessionId=zreYu57eQo%2BSZiFskdWJqg%3D%3D"

browser.get(url1)
time.sleep(15)
browser.execute_script('window.scrollTo(0,document.body.scrollHeight)')
time.sleep(15)

parsed = browser.find_element_by_tag_name('html').get_attribute('innerHTML')
soup = BeautifulSoup(parsed, 'html.parser')

search_results = soup.select('dt.result-lockup__name a')
print(len(search_results))

检索搜索结果selenium python bs4

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-03-13 14:29:39

检索搜索结果selenium python bs4

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-03-13 14:29:39

解决方案1
2 已采纳 2019-03-13 14:29:39