简体   繁体   中英

How to correctly parse XML urls with requests in Python?

I would like to parse an XML file from a URL.

By doing the following:

req = requests.get('https://www.forbes.com/news_sitemap.xml')

Instead of getting the proper XML file, I get:

<!doctype html>
<html lang="en">
        <head>
                <meta http-equiv="Content-Language" content="en_US">

                <script type="text/javascript">
                        (function () {
                                function isValidUrl(toURL) {
                                        // Regex taken from welcome ad.
                                        return (toURL || '').match(/^(?:https?:?\/\/)?(?:[^.(){}\\\/]*)?\.?forbes\.com(?:\/|\?|$)/i);
                                }

                                function getUrlParameter(name) {
                                        name = name.replace(/[\[]/, '\\[').replace(/[\]]/, '\\]');
                                        var regex = new RegExp('[\\?&]' + name + '=([^&#]*)');
                                        var results = regex.exec(location.search);
                                        return results === null ? '' : decodeURIComponent(results[1].replace(/\+/g, ' '));
                                };

                                function consentIsSet(message) {
                                        console.log(message);
                                        var result = JSON.parse(message.data);
                                        if(result.message == "submit_preferences"){
                                                var toURL = getUrlParameter("toURL");
                                                if(!isValidUrl(toURL)){
                                                        toURL = "https://www.forbes.com/";
                                                }
                                                location.href=toURL;
                                        }
                                }

                                var apiObject = {
                                        PrivacyManagerAPI:
                                        {
                                                action: "getConsent",
                                                timestamp: new Date().getTime(),
                                                self: "forbes.com"
                                        }
                                };
                                var json = JSON.stringify(apiObject);
                                window.top.postMessage(json,"*");
                                window.addEventListener("message", consentIsSet, false);
                        })();
                </script>
        </head>
        <div id='teconsent'>
                <script async="async" type="text/javascript" crossorigin src='//consent.truste.com/notice?domain=forbes.com&c=teconsent'></script>
        </div>
        <body>
        </body>
</html>

Is there also a better way to handle the XML file (for example, if it is compressed, or by parsing it recursively if the file is too big...)? Thanks!

Using requests module I get the xml file. You can then use an xml parser library to do what you want.

import requests
url = "https://www.forbes.com/news_sitemap.xml"
x = requests.get(url)
print(x.text)

This site checks a cookie for GDPR if you give that cookie to request you can get XML file. Try this code, works fine to me.

import requests
url = "https://www.forbes.com/news_sitemap.xml"
news_sitemap = requests.get(url, headers={"Cookie": "notice_gdpr_prefs=0,1,2:1a8b5228dd7ff0717196863a5d28ce6c"})

print(news_sitemap.text)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM