I would like to parse an XML file from a URL.
By doing the following:
req = requests.get('https://www.forbes.com/news_sitemap.xml')
Instead of getting the proper XML file, I get:
<!doctype html>
<html lang="en">
<head>
<meta http-equiv="Content-Language" content="en_US">
<script type="text/javascript">
(function () {
function isValidUrl(toURL) {
// Regex taken from welcome ad.
return (toURL || '').match(/^(?:https?:?\/\/)?(?:[^.(){}\\\/]*)?\.?forbes\.com(?:\/|\?|$)/i);
}
function getUrlParameter(name) {
name = name.replace(/[\[]/, '\\[').replace(/[\]]/, '\\]');
var regex = new RegExp('[\\?&]' + name + '=([^&#]*)');
var results = regex.exec(location.search);
return results === null ? '' : decodeURIComponent(results[1].replace(/\+/g, ' '));
};
function consentIsSet(message) {
console.log(message);
var result = JSON.parse(message.data);
if(result.message == "submit_preferences"){
var toURL = getUrlParameter("toURL");
if(!isValidUrl(toURL)){
toURL = "https://www.forbes.com/";
}
location.href=toURL;
}
}
var apiObject = {
PrivacyManagerAPI:
{
action: "getConsent",
timestamp: new Date().getTime(),
self: "forbes.com"
}
};
var json = JSON.stringify(apiObject);
window.top.postMessage(json,"*");
window.addEventListener("message", consentIsSet, false);
})();
</script>
</head>
<div id='teconsent'>
<script async="async" type="text/javascript" crossorigin src='//consent.truste.com/notice?domain=forbes.com&c=teconsent'></script>
</div>
<body>
</body>
</html>
Is there also a better way to handle the XML file (for example, if it is compressed, or by parsing it recursively if the file is too big...)? Thanks!
Using requests module I get the xml file. You can then use an xml parser library to do what you want.
import requests
url = "https://www.forbes.com/news_sitemap.xml"
x = requests.get(url)
print(x.text)
This site checks a cookie for GDPR if you give that cookie to request you can get XML file. Try this code, works fine to me.
import requests
url = "https://www.forbes.com/news_sitemap.xml"
news_sitemap = requests.get(url, headers={"Cookie": "notice_gdpr_prefs=0,1,2:1a8b5228dd7ff0717196863a5d28ce6c"})
print(news_sitemap.text)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.