简体   繁体   中英

Python/HTML How to scrape webpage content without cookie advisor?

I'm trying to scrape the content of a webpage with Python and i'm able to get every content i need, but in the returned HTML there's also the cookie advisor. I want to remove it but i don't know how to exclude it from the XPath query or the HTML content. Here you can find the advisor in the footer of the page. Webpage here

#!C:/Python27/python
from lxml import etree
import requests
import cgi

fs = cgi.FieldStorage()
q =fs.getfirst ("URL")

page = requests.get(q)

if q.find("http://www.dlib.org") != -1:
    tree = etree.HTML(page.text)
    element = tree.xpath('./body/form/table[3]/tr/td/table[5]')
else:
    p = etree.XMLParser(remove_blank_text=True, resolve_entities=False)
    tree = etree.fromstring(page.content, p)
    element = tree.xpath('.//*[@id="content"]')

content = etree.tostring(element[0])

print "Content-type: text\n\n"
print content.strip()

For the page you specified, the cookies advisor exists in a div with an id=cookiesAlert . You can use lxml.xpath() to search for that div and remove it, like so:

if q.find("http://www.dlib.org") != -1:
    tree = etree.HTML(page.text)
    element = tree.xpath('./body/form/table[3]/tr/td/table[5]')
else:
    p = etree.XMLParser(remove_blank_text=True, resolve_entities=False)
    tree = etree.fromstring(page.content, p)
    element = tree.xpath('.//*[@id="content"]')
    cookies_alert = element[0].xpath('.//*[@id="cookiesAlert"]')
    for ca in cookies_alert:
        ca.getparent().remove(ca)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM