简体   繁体   中英

LinkedIn scraping not getting all data

From a linkedin site like: https://www.linkedin.com/company/10073529?trk=tyah&trkInfo=clickedVertical%3Acompany%2CclickedEntityId%3A10073529%2Cidx%3A1-1-1%2CtarId%3A1461132316737%2Ctas%3Adastrong%20

I am trying to retrieve

the link associated with data-li-miniprofile-id

a class="new-miniprofile-container" href="..." data-li-url="..." data-li-miniprofile-id="...>

which has parents of , under , under , etc...

This is what my code looks thus far:

import requests
from bs4 import beautifulsoup

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
for link in soup.find_all("a"):
    print(link.get('href'))

I initially just looked for a class="new-miniprofile-container" but it returned an empty array. I think the reason is that when I ran soup.prettify() (which returns all of the html scraped data), it just doesn't contain any children content after

I feel the problem is associated with the security blocks set up by LinkedIn engineers, but I want to know if there is a way to get those URLs, or if there are any other options to get those.

You should be using the LinkedIn REST API instead. There are the relevant company profile related endpoints and you can experiment with the REST API explorer here . And there is a python-linkedin client, which also has the Company API part documented.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM