Python/HTML How to scrape webpage content without cookie advisor?

Question

I'm trying to scrape the content of a webpage with Python and i'm able to get every content i need, but in the returned HTML there's also the cookie advisor. I want to remove it but i don't know how to exclude it from the XPath query or the HTML content. Here you can find the advisor in the footer of the page. Webpage here

#!C:/Python27/python
from lxml import etree
import requests
import cgi

fs = cgi.FieldStorage()
q =fs.getfirst ("URL")

page = requests.get(q)

if q.find("http://www.dlib.org") != -1:
    tree = etree.HTML(page.text)
    element = tree.xpath('./body/form/table[3]/tr/td/table[5]')
else:
    p = etree.XMLParser(remove_blank_text=True, resolve_entities=False)
    tree = etree.fromstring(page.content, p)
    element = tree.xpath('.//*[@id="content"]')

content = etree.tostring(element[0])

print "Content-type: text\n\n"
print content.strip()

Answer 1

For the page you specified, the cookies advisor exists in a div with an id=cookiesAlert . You can use lxml.xpath() to search for that div and remove it, like so:

if q.find("http://www.dlib.org") != -1:
    tree = etree.HTML(page.text)
    element = tree.xpath('./body/form/table[3]/tr/td/table[5]')
else:
    p = etree.XMLParser(remove_blank_text=True, resolve_entities=False)
    tree = etree.fromstring(page.content, p)
    element = tree.xpath('.//*[@id="content"]')
    cookies_alert = element[0].xpath('.//*[@id="cookiesAlert"]')
    for ca in cookies_alert:
        ca.getparent().remove(ca)

Python/HTML How to scrape webpage content without cookie advisor?

Question

1 answers

solution1
1 ACCPTED 2015-09-02 14:27:29

Python/HTML How to scrape webpage content without cookie advisor?

Question

1 answers

solution1 1 ACCPTED 2015-09-02 14:27:29

solution1
1 ACCPTED 2015-09-02 14:27:29