[英]LinkedIn profile name scraping
我一直试图从我拥有的一堆 LinkedIn URL 中只抓取个人资料名称。 我正在使用带有 python 的 bs4。 但无论我做什么,bs4 都会返回空数组。 怎么了?
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import re
r1 = requests.get("https://www.linkedin.com/in/agazdecki/")
coverpage = r1.content
soup1 = BeautifulSoup(coverpage, 'html5lib')
name_container = soup1.find_all("li", class_ = "inline t-24 t-black t-normal break-words")
print(name_container)
如果您尝试在没有 JavaScript 的情况下加载页面,您将看到您尝试查找的元素不存在。 换句话说,整个 LinkedIn 页面都加载了 Javascript (就像单页应用程序一样。事实上,BeautifulSoup 正在按预期工作并解析它获取的页面,其中包含 Z686155AF75A60A0F36E9D80C1F7ED 代码而不是您预期的页面。
>>> coverpage = r1.content
>>> coverpage
b'<html><head>\n<script type="text/javascript">\nwindow.onload =
function() {\n // Parse the tracking code from cookies.\n var trk =
"bf";\n var trkInfo = "bf";\n var cookies = document.cookie.split(";
");\n for (var i = 0; i < cookies.length; ++i) {\n if
((cookies[i].indexOf("trkCode=") == 0) && (cookies[i].length > 8)) {\n
trk = cookies[i].substring(8);\n }\n else if
((cookies[i].indexOf("trkInfo=") == 0) && (cookies[i].length > 8)) {\n
trkInfo = cookies[i].substring(8);\n }\n }\n\n if
(window.location.protocol == "http:") {\n // If "sl" cookie is set,
redirect to https.\n for (var i = 0; i < cookies.length; ++i) {\n
if ((cookies[i].indexOf("sl=") == 0) && (cookies[i].length > 3)) {\n
window.location.href = "https:" +
window.location.href.substring(window.location.protocol.length);\n
return;\n }\n }\n }\n\n // Get the new domain. For international
domains such as\n // fr.linkedin.com, we convert it to www.linkedin.com\n
var domain = "www.linkedin.com";\n if (domain != location.host) {\n
var subdomainIndex = location.host.indexOf(".linkedin");\n if
(subdomainIndex != -1) {\n domain = "www" +
location.host.substring(subdomainIndex);\n }\n }\n\n
window.location.href = "https://" + domain + "/authwall?trk=" + trk +
"&trkInfo=" + trkInfo +\n "&originalReferer=" +
document.referrer.substr(0, 200) +\n "&sessionRedirect=" +
encodeURIComponent(window.location.href);\n}\n</script>\n</head></html>'
您可以尝试使用Selenium 之类的东西。
第一个错误:您正在使用请求来获取页面,但您必须知道,您必须先登录,因此您需要使用会话。
第二个错误:您正在使用 css 选择器来获取由 JavaScript 动态生成并由浏览器呈现的元素,因此如果您查看页面的源代码,您将找不到该li
标记或class
或配置文件名称除了 json object 中的代码标签之外的任何地方。
我假设您使用的是 session
import requests , re , json
from bs4 import BeautifulSoup
r1 = requests.Session.get("https://www.linkedin.com/in/agazdecki/", headers={"User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"})
soup = BeautifulSoup(r1.content, 'html.parser')
info_tag = soup.find('code',text=re.compile('"data":{"firstName":'))
data = json.loads(info_tag.text)
first_name = data['data']['firstName']
last_name = data['data']['lastName']
occupation = data['data']['occupation']
print('First Name :' , first_name)
print('Last Name :' , last_name)
print('occupation :' , occupation)
Output:
First Name : Andrew
Last Name : Gazdecki
occupation : Chief Revenue Officer @ Spiff. Inc. 30 under 30 Entrepreneur.
我建议使用 selenium 来抓取数据。
从这里下载 Chrome WebDriver
from selenium import webdriver
driver = webdriver.Chrome("Path to your Chrome Webdriver")
#login using webdriver
driver.get('https://www.linkedin.com/login?trk=guest_homepage-basic_nav-header-signin')
username = driver.find_element_by_id('username')
username.send_keys('your email_id here')
password = driver.find_element_by_id('password')
password.send_keys('your password here')
sign_in_button = driver.find_element_by_xpath('//*[@type="submit"]')
sign_in_button.click()
driver.get('https://www.linkedin.com/in/agazdecki/') #change profile_url here.
name = driver.find_element_by_xpath('//li[@class = "inline t-24 t-black t-normal break-words"]').text
print(name)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.