I'm trying to scrape the content of a webpage with Python and i'm able to get every content i need, but in the returned HTML there's also the cookie advisor. I want to remove it but i don't know how to exclude it from the XPath query or the HTML content. Here you can find the advisor in the footer of the page. Webpage here
#!C:/Python27/python
from lxml import etree
import requests
import cgi
fs = cgi.FieldStorage()
q =fs.getfirst ("URL")
page = requests.get(q)
if q.find("http://www.dlib.org") != -1:
tree = etree.HTML(page.text)
element = tree.xpath('./body/form/table[3]/tr/td/table[5]')
else:
p = etree.XMLParser(remove_blank_text=True, resolve_entities=False)
tree = etree.fromstring(page.content, p)
element = tree.xpath('.//*[@id="content"]')
content = etree.tostring(element[0])
print "Content-type: text\n\n"
print content.strip()
For the page you specified, the cookies advisor exists in a div
with an id=cookiesAlert
. You can use lxml.xpath()
to search for that div
and remove it, like so:
if q.find("http://www.dlib.org") != -1:
tree = etree.HTML(page.text)
element = tree.xpath('./body/form/table[3]/tr/td/table[5]')
else:
p = etree.XMLParser(remove_blank_text=True, resolve_entities=False)
tree = etree.fromstring(page.content, p)
element = tree.xpath('.//*[@id="content"]')
cookies_alert = element[0].xpath('.//*[@id="cookiesAlert"]')
for ca in cookies_alert:
ca.getparent().remove(ca)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.