简体   繁体   中英

Scraping profiles with Python and the “scrape-linkedin” package

I am trying to use the scrape_linkedin package . I follow the section on the github page on how to set up the package/LinkedIn li_at key (which I paste here for clarity).

Getting LI_AT
Navigate to www.linkedin.com and log in
Open browser developer tools (Ctrl-Shift-I or right click -> inspect element)
Select the appropriate tab for your browser (Application on Chrome, Storage on Firefox)
Click the Cookies dropdown on the left-hand menu, and select the www.linkedin.com option
Find and copy the li_at value

Once I collect the li_at value from my LinkedIn, I run the following code:

from scrape_linkedin import ProfileScraper

with ProfileScraper(cookie='myVeryLong_li_at_Code_which_has_characters_like_AQEDAQNZwYQAC5_etc') as scraper:
    profile = scraper.scrape(url='https://www.linkedin.com/in/justintrudeau/')
print(profile.to_dict())

I have two questions (I am originally an R user).

  1. How can I input a list of profiles:

    https://www.linkedin.com/in/justintrudeau/

    https://www.linkedin.com/in/barackobama/

    https://www.linkedin.com/in/williamhgates/

    https://www.linkedin.com/in/wozniaksteve/

and scrape the profiles? (In RI would use the map function from the purrr package to apply the function to each of the LinkedIn profiles).

  1. The output (from the original github page) is returned in a JSON style format. My second question is how I can convert this into a pandas data frame (ie it is returned similar to the following).

{'personal_info': {'name': 'Steve Wozniak', 'headline': 'Fellow at Apple', 'company': None, 'school': None, 'location': 'San Francisco Bay Area', 'summary': '', 'image': '', 'followers': '', 'email': None, 'phone': None, 'connected': None, 'websites': [], 'current_company_link': 'https://www.linkedin.com/company/sandisk/'}, 'experiences': {'jobs': [{'title': 'Chief Scientist', 'company': 'Fusion-io', 'date_range': 'Jul 2014 – Present', 'location': 'Primary Data', 'description': "I'm looking into future technologies applicable to servers and storage, and helping this company, which I love, get noticed and get a lead so that the world can discover the new amazing technology they have developed. My role is principally a marketing one at present but that will change over time.", 'li_company_url': 'https://www.linkedin.com/company/sandisk/'}, {'title': 'Fellow', 'company': 'Apple', 'date_range': 'Mar 1976 – Present', 'location': '1 Infinite Loop, Cupertino, CA 94015', 'description': 'Digita l Design engineer.', 'li_company_url': ''}, {'title': 'President & CTO', 'company': 'Wheels of Zeus', 'date_range': '2002 – 2005', 'location': None, 'description': None, 'li_company_url': 'https://www.linkedin.com/company/wheels-of-zeus/'}, {'title': 'diagnostic programmer', 'company': 'TENET Inc.', 'date_range': '1970 – 1971', 'location': None, 'description': None, 'li_company_url': ''}], 'education': [{'name': 'University of California, Berkeley', 'degree': 'BS', 'grades': None, 'field_of_study': 'EE & CS', 'date_range': '1971 – 1986', 'activities': None}, {'name': 'University of Colorado Boulder', 'degree': 'Honorary PhD.', 'grades': None, 'field_of_study': 'Electrical and Electronics Engineering', 'date_range': '1968 – 1969', 'activities': None}], 'volunteering': []}, 'skills': [], 'accomplishments': {'publications': [], 'certifications': [], 'patents': [], 'courses': [], 'projects': [], 'honors': [], 'test_scores': [], 'languages': [], 'organizations': []}, 'interests': ['Western Digital', 'University of Colorado Boulder', 'Western Digital Data Center Solutions', 'NEW Homebrew Computer Club', 'Wheels of Zeus', 'SanDisk®']}

Firstly, You can create a custom function to scrape data and use map function in Python to apply it over each profile link.

Secondly, to create a pandas dataframe using a dictionary, you can simply pass the dictionary to pd.DataFrame.

Thus to create a dataframe df , with dictionary dict , you can do like this:

df = pd.DataFrame(dict)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM