简体   繁体   English

我试图 web 刮取福布斯业务,但是当我请求 url 它没有给我正确的 json 数据

[英]Im trying to web scrape Forbes business but when i request the url it doesn't give me the correct json data

Im using python, my code:我使用 python,我的代码:

import requests
import bs4
url = 'https://www.forbes.com/business'
res = requests.get(url)
soup = BeautifulSoup(res.content,'lxml')
print(soup)

and this is what it returns:这就是它返回的内容:

<!DOCTYPE html>
<html lang="en">
<head>
<meta content="en_US" http-equiv="Content-Language"/>
<script type="text/javascript">
                        (function () {
                                function isValidUrl(toURL) {
                                        // Regex taken from welcome ad.
                                        return (toURL || '').match(/^(?:https?:?\/\/)?(?:[^.(){}\\\/]*)?\.?forbes\.com(?:\/|\?|$)/i);
                                }

                                function getUrlParameter(name) {
                                        name = name.replace(/[\[]/, '\\[').replace(/[\]]/, '\\]');
                                        var regex = new RegExp('[\\?&]' + name + '=([^&#]*)');
                                        var results = regex.exec(location.search);
                                        return results === null ? '' : decodeURIComponent(results[1].replace(/\+/g, ' '));
                                };

                                function consentIsSet(message) {
                                        console.log(message);
                                        var result = JSON.parse(message.data);
                                        if(result.message == "submit_preferences"){
                                                var toURL = getUrlParameter("toURL");
                                                if(!isValidUrl(toURL)){
                                                        toURL = "https://www.forbes.com/";
                                                }
                                                location.href=toURL;
                                        }
                                }

                                var apiObject = {
                                        PrivacyManagerAPI:
                                        {
                                                action: "getConsent",
                                                timestamp: new Date().getTime(),
                                                self: "forbes.com"
                                        }
                                };
                                var json = JSON.stringify(apiObject);
                                window.top.postMessage(json,"*");
                                window.addEventListener("message", consentIsSet, false);
                        })();
                </script>
</head>
<body><div id="teconsent">
<script async="async" crossorigin="" src="//consent.truste.com/notice?domain=forbes.com&amp;c=teconsent" type="text/javascript"></script>
</div>
</body></html>

Im sure im doing something very obviously wrong or i need a header of some sort but im new to this and dont completely understand it, so i would appropriate any help!我确定我做的事情很明显是错误的,或者我需要某种 header 但我对此很陌生并且不完全理解它,所以我会提供任何帮助! Thanks谢谢

One of the first steps you should do when you want to scrape a website is to see what happens when you switch off javascript.当你想抓取一个网站时,你应该做的第一步是看看当你关闭 javascript 时会发生什么。 You can do this in chrome by inspecting the page and going to the settings (There are three dots).您可以通过检查页面并转到设置(共有三个点)在 chrome 中执行此操作。 In this case doesn't look like javascript is being employed in great detail.在这种情况下,javascript 看起来并没有被非常详细地使用。

The other thing you should think about is whether there is a need for headers, cookies and parameters.您应该考虑的另一件事是是否需要标头,cookies 和参数。 In this case, you need to send the headers with the HTTP request.在这种情况下,您需要使用 HTTP 请求发送标头。

headers = {
'authority': 'www.forbes.com',
'cache-control': 'max-age=0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Mobile Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'accept-language': 'en-US,en;q=0.9',
'cookie': 'client_id=4f023ae61ab633d8e7e2410a838a6ef93b8; notice_preferences=2:1a8b5228dd7ff0717196863a5d28ce6c; notice_gdpr_prefs=0,1,2:1a8b5228dd7ff0717196863a5d28ce6c; _cb_ls=1; _cb=DV5nM5D9GHBDCzpjkB; _ga=GA1.2.611301459.1592235761; __tbc=%7Bjzx%7DIFcj-ZhxuNCMjI4-mDfH1HGM-3PFKcN8Miwl1Jhx9eZNEmuQGlLmxXFL-9qM-F_OBO51AtKdJ3qgOfi3P9vM0qBHA3PyvmasSB5xaCbWibdU2meZrLoZ92gJ8xiw07mk3E9l5ifC0NcYbET3aSZxuA; xbc=%7Bjzx%7DGUDHEU3rvhv6-gySw5OY32YdbGDIZI_hJ7AHN4OvkbydVClZ3QNjNrlQVyHGl3ynSJzzGsKf0w3VfH3le6pYqMAfTQAzgDTJbUHa-cJS7p3ITwLt3PmPKKvsIVyFnHji; __gads=ID=a5ac1829fa387f90:T=1592235777:S=ALNI_MZfOqlh-TglrQCWbFNtjcjgFfkMGQ; _fbp=fb.1.1592235779290.60238202; __qca=P0-59264648-1592235777617; xdibx=N4Ig-mBGAeDGCuAnRIBcoAOGAuBnNAjAKwCcATGQMxEDsAHHQAyUkA0IGAbrAHbaHtc-VMXJVaDZmw6dcvfiPaIkAGzQgAFtmwZcqAPT6A7iYB0AMwD2iSAFNcp2JYC2-3AEts9.c.cBrW0sALwBDHncw.TJGaP1GADZ9Yn0eWyNYENxsFVsAWnhwrwATXNwQnNyQrERLTnLK3PMVdwxcy3Nc7A08p3cQdhVVdTdPb18A4LCIniiYxjjE5NT0zOy8gtGSsoqqjBq6lQamlraOrp7Ldx5SkIBPXFy9219bRFyckIBzeDz-kBU8IRSBRqPQmCwAL7sCAwJ6cNCgIp3YQAbVEIIkTAALDQALpQ8BQaC2Ti2PjCUDRBKUSgIkDw9AgWACEAKNHA8T0EiMEiUfGCOnM1CMdhs.kgFCMoUi1loFHioqCtAysUEoWgaWiuX4gnRMhYxiMOkMjUstnozl0ch0IjiilM5Va1DygmS03Cp0u9iKqWO2XO8Xqh0e.0uiEEmFwdw-kAhLHRLEkIodWyQEKwXJYrHxeK5SBEG2Z2A0EJFSCUMsJDMW0F0GhY4ggCFAA__; _chartbeat2=.1592235763483.1592235790833.1.CZMImKDrkr6iBB9QkcCHzJBoDWn8ZI.3',}

You can get access to this via inspecting the page by right clicking the page.您可以通过右键单击页面来检查页面来访问它。 Then clicking the doc part.然后单击文档部分。 You can copy the CURL(bash) cmd of the request and post it into curl.trillworks.com.您可以复制请求的 CURL(bash) cmd 并将其发布到 curl.trillworks.com 中。 This will be able to convert to python and give you nicely formatted headers.这将能够转换为 python 并为您提供格式良好的标题。

Code Example代码示例

import requests

headers = {
    'authority': 'www.forbes.com',
    'cache-control': 'max-age=0',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Mobile Safari/537.36',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'sec-fetch-site': 'none',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-user': '?1',
    'sec-fetch-dest': 'document',
    'accept-language': 'en-US,en;q=0.9',
    'cookie': 'client_id=4f023ae61ab633d8e7e2410a838a6ef93b8; notice_preferences=2:1a8b5228dd7ff0717196863a5d28ce6c; notice_gdpr_prefs=0,1,2:1a8b5228dd7ff0717196863a5d28ce6c; _cb_ls=1; _cb=DV5nM5D9GHBDCzpjkB; _ga=GA1.2.611301459.1592235761; __tbc=%7Bjzx%7DIFcj-ZhxuNCMjI4-mDfH1HGM-3PFKcN8Miwl1Jhx9eZNEmuQGlLmxXFL-9qM-F_OBO51AtKdJ3qgOfi3P9vM0qBHA3PyvmasSB5xaCbWibdU2meZrLoZ92gJ8xiw07mk3E9l5ifC0NcYbET3aSZxuA; xbc=%7Bjzx%7DGUDHEU3rvhv6-gySw5OY32YdbGDIZI_hJ7AHN4OvkbydVClZ3QNjNrlQVyHGl3ynSJzzGsKf0w3VfH3le6pYqMAfTQAzgDTJbUHa-cJS7p3ITwLt3PmPKKvsIVyFnHji; __gads=ID=a5ac1829fa387f90:T=1592235777:S=ALNI_MZfOqlh-TglrQCWbFNtjcjgFfkMGQ; _fbp=fb.1.1592235779290.60238202; __qca=P0-59264648-1592235777617; xdibx=N4Ig-mBGAeDGCuAnRIBcoAOGAuBnNAjAKwCcATGQMxEDsAHHQAyUkA0IGAbrAHbaHtc-VMXJVaDZmw6dcvfiPaIkAGzQgAFtmwZcqAPT6A7iYB0AMwD2iSAFNcp2JYC2-3AEts9.c.cBrW0sALwBDHncw.TJGaP1GADZ9Yn0eWyNYENxsFVsAWnhwrwATXNwQnNyQrERLTnLK3PMVdwxcy3Nc7A08p3cQdhVVdTdPb18A4LCIniiYxjjE5NT0zOy8gtGSsoqqjBq6lQamlraOrp7Ldx5SkIBPXFy9219bRFyckIBzeDz-kBU8IRSBRqPQmCwAL7sCAwJ6cNCgIp3YQAbVEIIkTAALDQALpQ8BQaC2Ti2PjCUDRBKUSgIkDw9AgWACEAKNHA8T0EiMEiUfGCOnM1CMdhs.kgFCMoUi1loFHioqCtAysUEoWgaWiuX4gnRMhYxiMOkMjUstnozl0ch0IjiilM5Va1DygmS03Cp0u9iKqWO2XO8Xqh0e.0uiEEmFwdw-kAhLHRLEkIodWyQEKwXJYrHxeK5SBEG2Z2A0EJFSCUMsJDMW0F0GhY4ggCFAA__; _chartbeat2=.1592235763483.1592235790833.1.CZMImKDrkr6iBB9QkcCHzJBoDWn8ZI.3',
}

response = requests.get('https://www.forbes.com/business/', headers=headers)
soup = BeautifulSoup(response.text, 'lxml')

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 我正在尝试通过网页抓取一个网站,当我试图将其转换为一个csv文件时,数据不会进入正确的列 - I am trying to web scrape a website and when I am trying to turn this into a csv file, data isn't going to the correct column 我正在尝试使用 python 请求来抓取网站,该请求在单击加载时不会更改其链接,我尝试了更多但看起来它的 json - I am trying to scrape a website using python request that doesn't change its link when click on load more i tried but looks its json urllib没有给我正确的文件类型 - urllib doesn't give me correct filetype 为什么numpy.where不能给我正确答案? - Why doesn't numpy.where give me the correct answer? Web 使用 requests-html 抓取福布斯网站 - Web scrape of forbes website using requests-html 尝试使用 BS4 从 Trustpilot 抓取日期 web 时出现以下 JSON 错误 - Python - I am getting the following JSON error when trying to web scrape dates from Trustpilot with BS4 - Python Web 抓取 LinkedIn 没有给我 html...。我做错了什么? - Web scraping LinkedIn doesn't give me the html.... what am I doing wrong? 输入给我不正确即使我写正确的答案 - input give me incorrect even im writing correct answer 响应 object 不返回我想从 URL 刮取的数据 - Response object doesn't return the data I want to scrape from a URL 网址未定义页码时,如何抓取多个页面? - How do I scrape multiple pages when the url doesn't define the page number?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM