[英]Im trying to web scrape Forbes business but when i request the url it doesn't give me the correct json data
Im using python, my code:我使用 python,我的代码:
import requests
import bs4
url = 'https://www.forbes.com/business'
res = requests.get(url)
soup = BeautifulSoup(res.content,'lxml')
print(soup)
and this is what it returns:这就是它返回的内容:
<!DOCTYPE html>
<html lang="en">
<head>
<meta content="en_US" http-equiv="Content-Language"/>
<script type="text/javascript">
(function () {
function isValidUrl(toURL) {
// Regex taken from welcome ad.
return (toURL || '').match(/^(?:https?:?\/\/)?(?:[^.(){}\\\/]*)?\.?forbes\.com(?:\/|\?|$)/i);
}
function getUrlParameter(name) {
name = name.replace(/[\[]/, '\\[').replace(/[\]]/, '\\]');
var regex = new RegExp('[\\?&]' + name + '=([^&#]*)');
var results = regex.exec(location.search);
return results === null ? '' : decodeURIComponent(results[1].replace(/\+/g, ' '));
};
function consentIsSet(message) {
console.log(message);
var result = JSON.parse(message.data);
if(result.message == "submit_preferences"){
var toURL = getUrlParameter("toURL");
if(!isValidUrl(toURL)){
toURL = "https://www.forbes.com/";
}
location.href=toURL;
}
}
var apiObject = {
PrivacyManagerAPI:
{
action: "getConsent",
timestamp: new Date().getTime(),
self: "forbes.com"
}
};
var json = JSON.stringify(apiObject);
window.top.postMessage(json,"*");
window.addEventListener("message", consentIsSet, false);
})();
</script>
</head>
<body><div id="teconsent">
<script async="async" crossorigin="" src="//consent.truste.com/notice?domain=forbes.com&c=teconsent" type="text/javascript"></script>
</div>
</body></html>
Im sure im doing something very obviously wrong or i need a header of some sort but im new to this and dont completely understand it, so i would appropriate any help!我确定我做的事情很明显是错误的,或者我需要某种 header 但我对此很陌生并且不完全理解它,所以我会提供任何帮助! Thanks谢谢
One of the first steps you should do when you want to scrape a website is to see what happens when you switch off javascript.当你想抓取一个网站时,你应该做的第一步是看看当你关闭 javascript 时会发生什么。 You can do this in chrome by inspecting the page and going to the settings (There are three dots).您可以通过检查页面并转到设置(共有三个点)在 chrome 中执行此操作。 In this case doesn't look like javascript is being employed in great detail.在这种情况下,javascript 看起来并没有被非常详细地使用。
The other thing you should think about is whether there is a need for headers, cookies and parameters.您应该考虑的另一件事是是否需要标头,cookies 和参数。 In this case, you need to send the headers with the HTTP request.在这种情况下,您需要使用 HTTP 请求发送标头。
headers = {
'authority': 'www.forbes.com',
'cache-control': 'max-age=0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Mobile Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'accept-language': 'en-US,en;q=0.9',
'cookie': 'client_id=4f023ae61ab633d8e7e2410a838a6ef93b8; notice_preferences=2:1a8b5228dd7ff0717196863a5d28ce6c; notice_gdpr_prefs=0,1,2:1a8b5228dd7ff0717196863a5d28ce6c; _cb_ls=1; _cb=DV5nM5D9GHBDCzpjkB; _ga=GA1.2.611301459.1592235761; __tbc=%7Bjzx%7DIFcj-ZhxuNCMjI4-mDfH1HGM-3PFKcN8Miwl1Jhx9eZNEmuQGlLmxXFL-9qM-F_OBO51AtKdJ3qgOfi3P9vM0qBHA3PyvmasSB5xaCbWibdU2meZrLoZ92gJ8xiw07mk3E9l5ifC0NcYbET3aSZxuA; xbc=%7Bjzx%7DGUDHEU3rvhv6-gySw5OY32YdbGDIZI_hJ7AHN4OvkbydVClZ3QNjNrlQVyHGl3ynSJzzGsKf0w3VfH3le6pYqMAfTQAzgDTJbUHa-cJS7p3ITwLt3PmPKKvsIVyFnHji; __gads=ID=a5ac1829fa387f90:T=1592235777:S=ALNI_MZfOqlh-TglrQCWbFNtjcjgFfkMGQ; _fbp=fb.1.1592235779290.60238202; __qca=P0-59264648-1592235777617; xdibx=N4Ig-mBGAeDGCuAnRIBcoAOGAuBnNAjAKwCcATGQMxEDsAHHQAyUkA0IGAbrAHbaHtc-VMXJVaDZmw6dcvfiPaIkAGzQgAFtmwZcqAPT6A7iYB0AMwD2iSAFNcp2JYC2-3AEts9.c.cBrW0sALwBDHncw.TJGaP1GADZ9Yn0eWyNYENxsFVsAWnhwrwATXNwQnNyQrERLTnLK3PMVdwxcy3Nc7A08p3cQdhVVdTdPb18A4LCIniiYxjjE5NT0zOy8gtGSsoqqjBq6lQamlraOrp7Ldx5SkIBPXFy9219bRFyckIBzeDz-kBU8IRSBRqPQmCwAL7sCAwJ6cNCgIp3YQAbVEIIkTAALDQALpQ8BQaC2Ti2PjCUDRBKUSgIkDw9AgWACEAKNHA8T0EiMEiUfGCOnM1CMdhs.kgFCMoUi1loFHioqCtAysUEoWgaWiuX4gnRMhYxiMOkMjUstnozl0ch0IjiilM5Va1DygmS03Cp0u9iKqWO2XO8Xqh0e.0uiEEmFwdw-kAhLHRLEkIodWyQEKwXJYrHxeK5SBEG2Z2A0EJFSCUMsJDMW0F0GhY4ggCFAA__; _chartbeat2=.1592235763483.1592235790833.1.CZMImKDrkr6iBB9QkcCHzJBoDWn8ZI.3',}
You can get access to this via inspecting the page by right clicking the page.您可以通过右键单击页面来检查页面来访问它。 Then clicking the doc part.然后单击文档部分。 You can copy the CURL(bash) cmd of the request and post it into curl.trillworks.com.您可以复制请求的 CURL(bash) cmd 并将其发布到 curl.trillworks.com 中。 This will be able to convert to python and give you nicely formatted headers.这将能够转换为 python 并为您提供格式良好的标题。
import requests
headers = {
'authority': 'www.forbes.com',
'cache-control': 'max-age=0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Mobile Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'accept-language': 'en-US,en;q=0.9',
'cookie': 'client_id=4f023ae61ab633d8e7e2410a838a6ef93b8; notice_preferences=2:1a8b5228dd7ff0717196863a5d28ce6c; notice_gdpr_prefs=0,1,2:1a8b5228dd7ff0717196863a5d28ce6c; _cb_ls=1; _cb=DV5nM5D9GHBDCzpjkB; _ga=GA1.2.611301459.1592235761; __tbc=%7Bjzx%7DIFcj-ZhxuNCMjI4-mDfH1HGM-3PFKcN8Miwl1Jhx9eZNEmuQGlLmxXFL-9qM-F_OBO51AtKdJ3qgOfi3P9vM0qBHA3PyvmasSB5xaCbWibdU2meZrLoZ92gJ8xiw07mk3E9l5ifC0NcYbET3aSZxuA; xbc=%7Bjzx%7DGUDHEU3rvhv6-gySw5OY32YdbGDIZI_hJ7AHN4OvkbydVClZ3QNjNrlQVyHGl3ynSJzzGsKf0w3VfH3le6pYqMAfTQAzgDTJbUHa-cJS7p3ITwLt3PmPKKvsIVyFnHji; __gads=ID=a5ac1829fa387f90:T=1592235777:S=ALNI_MZfOqlh-TglrQCWbFNtjcjgFfkMGQ; _fbp=fb.1.1592235779290.60238202; __qca=P0-59264648-1592235777617; xdibx=N4Ig-mBGAeDGCuAnRIBcoAOGAuBnNAjAKwCcATGQMxEDsAHHQAyUkA0IGAbrAHbaHtc-VMXJVaDZmw6dcvfiPaIkAGzQgAFtmwZcqAPT6A7iYB0AMwD2iSAFNcp2JYC2-3AEts9.c.cBrW0sALwBDHncw.TJGaP1GADZ9Yn0eWyNYENxsFVsAWnhwrwATXNwQnNyQrERLTnLK3PMVdwxcy3Nc7A08p3cQdhVVdTdPb18A4LCIniiYxjjE5NT0zOy8gtGSsoqqjBq6lQamlraOrp7Ldx5SkIBPXFy9219bRFyckIBzeDz-kBU8IRSBRqPQmCwAL7sCAwJ6cNCgIp3YQAbVEIIkTAALDQALpQ8BQaC2Ti2PjCUDRBKUSgIkDw9AgWACEAKNHA8T0EiMEiUfGCOnM1CMdhs.kgFCMoUi1loFHioqCtAysUEoWgaWiuX4gnRMhYxiMOkMjUstnozl0ch0IjiilM5Va1DygmS03Cp0u9iKqWO2XO8Xqh0e.0uiEEmFwdw-kAhLHRLEkIodWyQEKwXJYrHxeK5SBEG2Z2A0EJFSCUMsJDMW0F0GhY4ggCFAA__; _chartbeat2=.1592235763483.1592235790833.1.CZMImKDrkr6iBB9QkcCHzJBoDWn8ZI.3',
}
response = requests.get('https://www.forbes.com/business/', headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.